LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 net-next 2/3] net: dpaa: Replace in_irq() usage.
From: Sebastian Andrzej Siewior @ 2020-11-01 23:22 UTC (permalink / raw)
  To: netdev
  Cc: Aymen Sghaier, Herbert Xu, Horia Geantă, Madalin Bucur,
	Sebastian Andrzej Siewior, Li Yang, linux-crypto, Jakub Kicinski,
	Thomas Gleixner, linuxppc-dev, David S. Miller, linux-arm-kernel
In-Reply-To: <20201101232257.3028508-1-bigeasy@linutronix.de>

The driver uses in_irq() + in_serving_softirq() magic to decide if NAPI
scheduling is required or packet processing.

The usage of in_*() in drivers is phased out and Linus clearly requested
that code which changes behaviour depending on context should either be
separated or the context be conveyed in an argument passed by the caller,
which usually knows the context.

Use the `sched_napi' argument passed by the callback. It is set true if
called from the interrupt handler and NAPI should be scheduled.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Horia Geantă" <horia.geanta@nxp.com>
Cc: Aymen Sghaier <aymen.sghaier@nxp.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Madalin Bucur <madalin.bucur@nxp.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Li Yang <leoyang.li@nxp.com>
Cc: linux-crypto@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-arm-kernel@lists.infradead.org
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 98ead77c673e5..39c996b64ccec 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2300,9 +2300,9 @@ static void dpaa_tx_conf(struct net_device *net_dev,
 }
 
 static inline int dpaa_eth_napi_schedule(struct dpaa_percpu_priv *percpu_priv,
-					 struct qman_portal *portal)
+					 struct qman_portal *portal, bool sched_napi)
 {
-	if (unlikely(in_irq() || !in_serving_softirq())) {
+	if (sched_napi) {
 		/* Disable QMan IRQ and invoke NAPI */
 		qman_p_irqsource_remove(portal, QM_PIRQ_DQRI);
 
@@ -2333,7 +2333,7 @@ static enum qman_cb_dqrr_result rx_error_dqrr(struct qman_portal *portal,
 
 	percpu_priv = this_cpu_ptr(priv->percpu_priv);
 
-	if (dpaa_eth_napi_schedule(percpu_priv, portal))
+	if (dpaa_eth_napi_schedule(percpu_priv, portal, sched_napi))
 		return qman_cb_dqrr_stop;
 
 	dpaa_eth_refill_bpools(priv);
@@ -2377,7 +2377,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 	percpu_priv = this_cpu_ptr(priv->percpu_priv);
 	percpu_stats = &percpu_priv->stats;
 
-	if (unlikely(dpaa_eth_napi_schedule(percpu_priv, portal)))
+	if (unlikely(dpaa_eth_napi_schedule(percpu_priv, portal, sched_napi)))
 		return qman_cb_dqrr_stop;
 
 	/* Make sure we didn't run out of buffers */
@@ -2474,7 +2474,7 @@ static enum qman_cb_dqrr_result conf_error_dqrr(struct qman_portal *portal,
 
 	percpu_priv = this_cpu_ptr(priv->percpu_priv);
 
-	if (dpaa_eth_napi_schedule(percpu_priv, portal))
+	if (dpaa_eth_napi_schedule(percpu_priv, portal, sched_napi))
 		return qman_cb_dqrr_stop;
 
 	dpaa_tx_error(net_dev, priv, percpu_priv, &dq->fd, fq->fqid);
@@ -2499,7 +2499,7 @@ static enum qman_cb_dqrr_result conf_dflt_dqrr(struct qman_portal *portal,
 
 	percpu_priv = this_cpu_ptr(priv->percpu_priv);
 
-	if (dpaa_eth_napi_schedule(percpu_priv, portal))
+	if (dpaa_eth_napi_schedule(percpu_priv, portal, sched_napi))
 		return qman_cb_dqrr_stop;
 
 	dpaa_tx_conf(net_dev, priv, percpu_priv, &dq->fd, fq->fqid);
-- 
2.29.1


^ permalink raw reply related

* [PATCH v2 net-next 1/3] soc/fsl/qbman: Add an argument to signal if NAPI processing is required.
From: Sebastian Andrzej Siewior @ 2020-11-01 23:22 UTC (permalink / raw)
  To: netdev
  Cc: Aymen Sghaier, Herbert Xu, Horia Geantă, Madalin Bucur,
	Sebastian Andrzej Siewior, Li Yang, linux-crypto, Jakub Kicinski,
	Thomas Gleixner, linuxppc-dev, David S. Miller, linux-arm-kernel
In-Reply-To: <20201101232257.3028508-1-bigeasy@linutronix.de>

dpaa_eth_napi_schedule() and caam_qi_napi_schedule() schedule NAPI if
invoked from:

 - Hard interrupt context
 - Any context which is not serving soft interrupts

Any context which is not serving soft interrupts includes hard interrupts
so the in_irq() check is redundant. caam_qi_napi_schedule() has a comment
about this:

        /*
         * In case of threaded ISR, for RT kernels in_irq() does not return
         * appropriate value, so use in_serving_softirq to distinguish between
         * softirq and irq contexts.
         */
         if (in_irq() || !in_serving_softirq())

This has nothing to do with RT. Even on a non RT kernel force threaded
interrupts run obviously in thread context and therefore in_irq() returns
false when invoked from the handler.

The extension of the in_irq() check with !in_serving_softirq() was there
when the drivers were added, but in the out of tree FSL BSP the original
condition was in_irq() which got extended due to failures on RT.

The usage of in_xxx() in drivers is phased out and Linus clearly requested
that code which changes behaviour depending on context should either be
separated or the context be conveyed in an argument passed by the caller,
which usually knows the context. Right he is, the above construct is
clearly showing why.

The following callchains have been analyzed to end up in
dpaa_eth_napi_schedule():

qman_p_poll_dqrr()
  __poll_portal_fast()
    fq->cb.dqrr()
       dpaa_eth_napi_schedule()

portal_isr()
  __poll_portal_fast()
    fq->cb.dqrr()
       dpaa_eth_napi_schedule()

Both need to schedule NAPI.
The crypto part has another code path leading up to this:
  kill_fq()
     empty_retired_fq()
       qman_p_poll_dqrr()
         __poll_portal_fast()
            fq->cb.dqrr()
               dpaa_eth_napi_schedule()

kill_fq() is called from task context and ends up scheduling NAPI, but
that's pointless and an unintended side effect of the !in_serving_softirq()
check.

The code path:
  caam_qi_poll() -> qman_p_poll_dqrr()

is invoked from NAPI and I *assume* from crypto's NAPI device and not
from qbman's NAPI device. I *guess* it is okay to skip scheduling NAPI
(because this is what happens now) but could be changed if it is wrong
due to `budget' handling.

Add an argument to __poll_portal_fast() which is true if NAPI needs to be
scheduled. This requires propagating the value to the caller including
`qman_cb_dqrr' typedef which is used by the dpaa and the crypto driver.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Horia Geantă" <horia.geanta@nxp.com>
Cc: Aymen Sghaier <aymen.sghaier@nxp.com>
Cc: Herbert XS <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Madalin Bucur <madalin.bucur@nxp.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Li Yang <leoyang.li@nxp.com>
Cc: linux-crypto@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-arm-kernel@lists.infradead.org
---
 drivers/crypto/caam/qi.c                       |  3 ++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 12 ++++++++----
 drivers/soc/fsl/qbman/qman.c                   | 12 ++++++------
 drivers/soc/fsl/qbman/qman_test_api.c          |  6 ++++--
 drivers/soc/fsl/qbman/qman_test_stash.c        |  6 ++++--
 include/soc/fsl/qman.h                         |  3 ++-
 6 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/drivers/crypto/caam/qi.c b/drivers/crypto/caam/qi.c
index ec53528d82058..57f6ab6bfb56a 100644
--- a/drivers/crypto/caam/qi.c
+++ b/drivers/crypto/caam/qi.c
@@ -564,7 +564,8 @@ static int caam_qi_napi_schedule(struct qman_portal *p, struct caam_napi *np)
 
 static enum qman_cb_dqrr_result caam_rsp_fq_dqrr_cb(struct qman_portal *p,
 						    struct qman_fq *rsp_fq,
-						    const struct qm_dqrr_entry *dqrr)
+						    const struct qm_dqrr_entry *dqrr,
+						    bool sched_napi)
 {
 	struct caam_napi *caam_napi = raw_cpu_ptr(&pcpu_qipriv.caam_napi);
 	struct caam_drv_req *drv_req;
diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 06cc863f4dd63..98ead77c673e5 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2316,7 +2316,8 @@ static inline int dpaa_eth_napi_schedule(struct dpaa_percpu_priv *percpu_priv,
 
 static enum qman_cb_dqrr_result rx_error_dqrr(struct qman_portal *portal,
 					      struct qman_fq *fq,
-					      const struct qm_dqrr_entry *dq)
+					      const struct qm_dqrr_entry *dq,
+					      bool sched_napi)
 {
 	struct dpaa_fq *dpaa_fq = container_of(fq, struct dpaa_fq, fq_base);
 	struct dpaa_percpu_priv *percpu_priv;
@@ -2343,7 +2344,8 @@ static enum qman_cb_dqrr_result rx_error_dqrr(struct qman_portal *portal,
 
 static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 						struct qman_fq *fq,
-						const struct qm_dqrr_entry *dq)
+						const struct qm_dqrr_entry *dq,
+						bool sched_napi)
 {
 	struct skb_shared_hwtstamps *shhwtstamps;
 	struct rtnl_link_stats64 *percpu_stats;
@@ -2460,7 +2462,8 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 
 static enum qman_cb_dqrr_result conf_error_dqrr(struct qman_portal *portal,
 						struct qman_fq *fq,
-						const struct qm_dqrr_entry *dq)
+						const struct qm_dqrr_entry *dq,
+						bool sched_napi)
 {
 	struct dpaa_percpu_priv *percpu_priv;
 	struct net_device *net_dev;
@@ -2481,7 +2484,8 @@ static enum qman_cb_dqrr_result conf_error_dqrr(struct qman_portal *portal,
 
 static enum qman_cb_dqrr_result conf_dflt_dqrr(struct qman_portal *portal,
 					       struct qman_fq *fq,
-					       const struct qm_dqrr_entry *dq)
+					       const struct qm_dqrr_entry *dq,
+					       bool sched_napi)
 {
 	struct dpaa_percpu_priv *percpu_priv;
 	struct net_device *net_dev;
diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index 9888a70618730..101def7dc73d4 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -1159,7 +1159,7 @@ static u32 fq_to_tag(struct qman_fq *fq)
 
 static u32 __poll_portal_slow(struct qman_portal *p, u32 is);
 static inline unsigned int __poll_portal_fast(struct qman_portal *p,
-					unsigned int poll_limit);
+					unsigned int poll_limit, bool sched_napi);
 static void qm_congestion_task(struct work_struct *work);
 static void qm_mr_process_task(struct work_struct *work);
 
@@ -1174,7 +1174,7 @@ static irqreturn_t portal_isr(int irq, void *ptr)
 
 	/* DQRR-handling if it's interrupt-driven */
 	if (is & QM_PIRQ_DQRI) {
-		__poll_portal_fast(p, QMAN_POLL_LIMIT);
+		__poll_portal_fast(p, QMAN_POLL_LIMIT, true);
 		clear = QM_DQAVAIL_MASK | QM_PIRQ_DQRI;
 	}
 	/* Handling of anything else that's interrupt-driven */
@@ -1602,7 +1602,7 @@ static noinline void clear_vdqcr(struct qman_portal *p, struct qman_fq *fq)
  * user callbacks to call into any QMan API.
  */
 static inline unsigned int __poll_portal_fast(struct qman_portal *p,
-					unsigned int poll_limit)
+					unsigned int poll_limit, bool sched_napi)
 {
 	const struct qm_dqrr_entry *dq;
 	struct qman_fq *fq;
@@ -1636,7 +1636,7 @@ static inline unsigned int __poll_portal_fast(struct qman_portal *p,
 			 * and we don't want multiple if()s in the critical
 			 * path (SDQCR).
 			 */
-			res = fq->cb.dqrr(p, fq, dq);
+			res = fq->cb.dqrr(p, fq, dq, sched_napi);
 			if (res == qman_cb_dqrr_stop)
 				break;
 			/* Check for VDQCR completion */
@@ -1646,7 +1646,7 @@ static inline unsigned int __poll_portal_fast(struct qman_portal *p,
 			/* SDQCR: context_b points to the FQ */
 			fq = tag_to_fq(be32_to_cpu(dq->context_b));
 			/* Now let the callback do its stuff */
-			res = fq->cb.dqrr(p, fq, dq);
+			res = fq->cb.dqrr(p, fq, dq, sched_napi);
 			/*
 			 * The callback can request that we exit without
 			 * consuming this entry nor advancing;
@@ -1753,7 +1753,7 @@ EXPORT_SYMBOL(qman_start_using_portal);
 
 int qman_p_poll_dqrr(struct qman_portal *p, unsigned int limit)
 {
-	return __poll_portal_fast(p, limit);
+	return __poll_portal_fast(p, limit, false);
 }
 EXPORT_SYMBOL(qman_p_poll_dqrr);
 
diff --git a/drivers/soc/fsl/qbman/qman_test_api.c b/drivers/soc/fsl/qbman/qman_test_api.c
index 7066b2f1467cc..28fbddc3c2048 100644
--- a/drivers/soc/fsl/qbman/qman_test_api.c
+++ b/drivers/soc/fsl/qbman/qman_test_api.c
@@ -45,7 +45,8 @@
 
 static enum qman_cb_dqrr_result cb_dqrr(struct qman_portal *,
 					struct qman_fq *,
-					const struct qm_dqrr_entry *);
+					const struct qm_dqrr_entry *,
+					bool sched_napi);
 static void cb_ern(struct qman_portal *, struct qman_fq *,
 		   const union qm_mr_entry *);
 static void cb_fqs(struct qman_portal *, struct qman_fq *,
@@ -208,7 +209,8 @@ int qman_test_api(void)
 
 static enum qman_cb_dqrr_result cb_dqrr(struct qman_portal *p,
 					struct qman_fq *fq,
-					const struct qm_dqrr_entry *dq)
+					const struct qm_dqrr_entry *dq,
+					bool sched_napi)
 {
 	if (WARN_ON(fd_neq(&fd_dq, &dq->fd))) {
 		pr_err("BADNESS: dequeued frame doesn't match;\n");
diff --git a/drivers/soc/fsl/qbman/qman_test_stash.c b/drivers/soc/fsl/qbman/qman_test_stash.c
index e87b65403b672..b7e8e5ec884c6 100644
--- a/drivers/soc/fsl/qbman/qman_test_stash.c
+++ b/drivers/soc/fsl/qbman/qman_test_stash.c
@@ -275,7 +275,8 @@ static inline int process_frame_data(struct hp_handler *handler,
 
 static enum qman_cb_dqrr_result normal_dqrr(struct qman_portal *portal,
 					    struct qman_fq *fq,
-					    const struct qm_dqrr_entry *dqrr)
+					    const struct qm_dqrr_entry *dqrr,
+					    bool sched_napi)
 {
 	struct hp_handler *handler = (struct hp_handler *)fq;
 
@@ -293,7 +294,8 @@ static enum qman_cb_dqrr_result normal_dqrr(struct qman_portal *portal,
 
 static enum qman_cb_dqrr_result special_dqrr(struct qman_portal *portal,
 					     struct qman_fq *fq,
-					     const struct qm_dqrr_entry *dqrr)
+					     const struct qm_dqrr_entry *dqrr,
+					     bool sched_napi)
 {
 	struct hp_handler *handler = (struct hp_handler *)fq;
 
diff --git a/include/soc/fsl/qman.h b/include/soc/fsl/qman.h
index 9f484113cfda7..59eeba31c1928 100644
--- a/include/soc/fsl/qman.h
+++ b/include/soc/fsl/qman.h
@@ -689,7 +689,8 @@ enum qman_cb_dqrr_result {
 };
 typedef enum qman_cb_dqrr_result (*qman_cb_dqrr)(struct qman_portal *qm,
 					struct qman_fq *fq,
-					const struct qm_dqrr_entry *dqrr);
+					const struct qm_dqrr_entry *dqrr,
+					bool sched_napi);
 
 /*
  * This callback type is used when handling ERNs, FQRNs and FQRLs via MR. They
-- 
2.29.1


^ permalink raw reply related

* Re: [PATCH net-next 14/15] net: dpaa: Replace in_irq() usage.
From: Sebastian Andrzej Siewior @ 2020-11-01 23:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Aymen Sghaier, Madalin Bucur, Zhu Yanjun, Samuel Chessman,
	Ping-Ke Shih, Herbert Xu, Horia Geantă, linux-rdma,
	Rain River, Kalle Valo, Ulrich Kunitz, Jouni Malinen,
	linuxppc-dev, Thomas Gleixner, linux-arm-kernel, Leon Romanovsky,
	netdev, linux-wireless, Li Yang, linux-crypto, Jon Mason,
	Saeed Mahameed, David S. Miller
In-Reply-To: <20201031101215.38a13e51@kicinski-fedora-PC1C0HJN.hsd1.ca.comcast.net>

On 2020-10-31 10:12:15 [-0700], Jakub Kicinski wrote:
> Nit: some networking drivers have a bool napi which means "are we
> running in napi context", the semantics here feel a little backwards,
> at least to me. But if I'm the only one thinking this, so be it.

I renamed it to `sched_napi'.

Sebastian

^ permalink raw reply

* [PATCH v3 4/4] arch, mm: make kernel_page_present() always available
From: Mike Rapoport @ 2020-11-01 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, Pavel Machek, H. Peter Anvin, sparclinux,
	Christoph Lameter, Will Deacon, linux-riscv, linux-s390, x86,
	Mike Rapoport, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Len Brown, Albert Ou, Vasily Gorbik, linux-pm,
	Heiko Carstens, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Kirill A. Shutemov, Thomas Gleixner,
	linux-arm-kernel, Rafael J. Wysocki, linux-kernel, Pekka Enberg,
	Palmer Dabbelt, Joonsoo Kim, Edgecombe, Rick P, linuxppc-dev,
	David S. Miller, Mike Rapoport
In-Reply-To: <20201101170815.9795-1-rppt@kernel.org>

From: Mike Rapoport <rppt@linux.ibm.com>

For architectures that enable ARCH_HAS_SET_MEMORY having the ability to
verify that a page is mapped in the kernel direct map can be useful
regardless of hibernation.

Add RISC-V implementation of kernel_page_present(), update its forward
declarations and stubs to be a part of set_memory API and remove ugly
ifdefery in inlcude/linux/mm.h around current declarations of
kernel_page_present().

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/pageattr.c            |  4 +---
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/mm/pageattr.c            | 29 +++++++++++++++++++++++++++++
 arch/x86/include/asm/set_memory.h   |  1 +
 arch/x86/mm/pat/set_memory.c        |  4 +---
 include/linux/mm.h                  |  7 -------
 include/linux/set_memory.h          |  5 +++++
 8 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 9384fd8fc13c..45217f21f1fe 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -140,6 +140,7 @@ int set_memory_valid(unsigned long addr, int numpages, int enable);
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+bool kernel_page_present(struct page *page);
 
 #include <asm-generic/cacheflush.h>
 
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 439325532be1..92eccaf595c8 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -186,8 +186,8 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
 
 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
 }
+#endif /* CONFIG_DEBUG_PAGEALLOC */
 
-#ifdef CONFIG_HIBERNATION
 /*
  * This function is used to determine if a linear map page has been marked as
  * not-valid. Walk the page table and check the PTE_VALID bit. This is based
@@ -234,5 +234,3 @@ bool kernel_page_present(struct page *page)
 	ptep = pte_offset_kernel(pmdp, addr);
 	return pte_valid(READ_ONCE(*ptep));
 }
-#endif /* CONFIG_HIBERNATION */
-#endif /* CONFIG_DEBUG_PAGEALLOC */
diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h
index 4c5bae7ca01c..d690b08dff2a 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -24,6 +24,7 @@ static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+bool kernel_page_present(struct page *page);
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 321b09d2e2ea..87ba5a68bbb8 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -198,3 +198,32 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
 			     __pgprot(0), __pgprot(_PAGE_PRESENT));
 }
 #endif
+
+bool kernel_page_present(struct page *page)
+{
+	unsigned long addr = (unsigned long)page_address(page);
+	pgd_t *pgd;
+	pud_t *pud;
+	p4d_t *p4d;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset_k(addr);
+	if (!pgd_present(*pgd))
+		return false;
+
+	p4d = p4d_offset(pgd, addr);
+	if (!p4d_present(*p4d))
+		return false;
+
+	pud = pud_offset(p4d, addr);
+	if (!pud_present(*pud))
+		return false;
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd_present(*pmd))
+		return false;
+
+	pte = pte_offset_kernel(pmd, addr);
+	return pte_present(*pte);
+}
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 5948218f35c5..4352f08bfbb5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -82,6 +82,7 @@ int set_pages_rw(struct page *page, int numpages);
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bc9be96b777f..16f878c26667 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2226,8 +2226,8 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
 
 	arch_flush_lazy_mmu_mode();
 }
+#endif /* CONFIG_DEBUG_PAGEALLOC */
 
-#ifdef CONFIG_HIBERNATION
 bool kernel_page_present(struct page *page)
 {
 	unsigned int level;
@@ -2239,8 +2239,6 @@ bool kernel_page_present(struct page *page)
 	pte = lookup_address((unsigned long)page_address(page), &level);
 	return (pte_val(*pte) & _PAGE_PRESENT);
 }
-#endif /* CONFIG_HIBERNATION */
-#endif /* CONFIG_DEBUG_PAGEALLOC */
 
 int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
 				   unsigned numpages, unsigned long page_flags)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ab0ef6bd351d..44b82f22e76a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2937,16 +2937,9 @@ static inline void debug_pagealloc_map_pages(struct page *page,
 	if (debug_pagealloc_enabled_static())
 		__kernel_map_pages(page, numpages, enable);
 }
-
-#ifdef CONFIG_HIBERNATION
-extern bool kernel_page_present(struct page *page);
-#endif	/* CONFIG_HIBERNATION */
 #else	/* CONFIG_DEBUG_PAGEALLOC */
 static inline void debug_pagealloc_map_pages(struct page *page,
 					     int numpages, int enable) {}
-#ifdef CONFIG_HIBERNATION
-static inline bool kernel_page_present(struct page *page) { return true; }
-#endif	/* CONFIG_HIBERNATION */
 #endif	/* CONFIG_DEBUG_PAGEALLOC */
 
 #ifdef __HAVE_ARCH_GATE_AREA
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 860e0f843c12..fe1aa4e54680 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -23,6 +23,11 @@ static inline int set_direct_map_default_noflush(struct page *page)
 {
 	return 0;
 }
+
+static inline bool kernel_page_present(struct page *page)
+{
+	return true;
+}
 #endif
 
 #ifndef set_mce_nospec
-- 
2.28.0


^ permalink raw reply related

* [PATCH v3 3/4] arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC
From: Mike Rapoport @ 2020-11-01 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, Pavel Machek, H. Peter Anvin, sparclinux,
	Christoph Lameter, Will Deacon, linux-riscv, linux-s390, x86,
	Mike Rapoport, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Len Brown, Albert Ou, Vasily Gorbik, linux-pm,
	Heiko Carstens, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Kirill A. Shutemov, Thomas Gleixner,
	linux-arm-kernel, Rafael J. Wysocki, linux-kernel, Pekka Enberg,
	Palmer Dabbelt, Joonsoo Kim, Edgecombe, Rick P, linuxppc-dev,
	David S. Miller, Mike Rapoport
In-Reply-To: <20201101170815.9795-1-rppt@kernel.org>

From: Mike Rapoport <rppt@linux.ibm.com>

The design of DEBUG_PAGEALLOC presumes that __kernel_map_pages() must never
fail. With this assumption is wouldn't be safe to allow general usage of
this function.

Moreover, some architectures that implement __kernel_map_pages() have this
function guarded by #ifdef DEBUG_PAGEALLOC and some refuse to map/unmap
pages when page allocation debugging is disabled at runtime.

As all the users of __kernel_map_pages() were converted to use
debug_pagealloc_map_pages() it is safe to make it available only when
DEBUG_PAGEALLOC is set.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/Kconfig                     |  3 +++
 arch/arm64/Kconfig               |  4 +---
 arch/arm64/mm/pageattr.c         |  8 ++++++--
 arch/powerpc/Kconfig             |  5 +----
 arch/riscv/Kconfig               |  4 +---
 arch/riscv/include/asm/pgtable.h |  2 --
 arch/riscv/mm/pageattr.c         |  2 ++
 arch/s390/Kconfig                |  4 +---
 arch/sparc/Kconfig               |  4 +---
 arch/x86/Kconfig                 |  4 +---
 arch/x86/mm/pat/set_memory.c     |  2 ++
 include/linux/mm.h               | 10 +++++++---
 12 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc0e32d..56d4752b6db6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1028,6 +1028,9 @@ config HAVE_STATIC_CALL_INLINE
 	bool
 	depends on HAVE_STATIC_CALL
 
+config ARCH_SUPPORTS_DEBUG_PAGEALLOC
+	bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index f858c352f72a..5a01dfb77b93 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -71,6 +71,7 @@ config ARM64
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select ARCH_USE_SYM_ANNOTATIONS
+	select ARCH_SUPPORTS_DEBUG_PAGEALLOC
 	select ARCH_SUPPORTS_MEMORY_FAILURE
 	select ARCH_SUPPORTS_SHADOW_CALL_STACK if CC_HAVE_SHADOW_CALL_STACK
 	select ARCH_SUPPORTS_ATOMIC_RMW
@@ -1005,9 +1006,6 @@ config HOLES_IN_ZONE
 
 source "kernel/Kconfig.hz"
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-	def_bool y
-
 config ARCH_SPARSEMEM_ENABLE
 	def_bool y
 	select SPARSEMEM_VMEMMAP_ENABLE
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 1b94f5b82654..439325532be1 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -155,7 +155,7 @@ int set_direct_map_invalid_noflush(struct page *page)
 		.clear_mask = __pgprot(PTE_VALID),
 	};
 
-	if (!rodata_full)
+	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -170,7 +170,7 @@ int set_direct_map_default_noflush(struct page *page)
 		.clear_mask = __pgprot(PTE_RDONLY),
 	};
 
-	if (!rodata_full)
+	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -178,6 +178,7 @@ int set_direct_map_default_noflush(struct page *page)
 				   PAGE_SIZE, change_page_range, &data);
 }
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
 	if (!debug_pagealloc_enabled() && !rodata_full)
@@ -186,6 +187,7 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
 }
 
+#ifdef CONFIG_HIBERNATION
 /*
  * This function is used to determine if a linear map page has been marked as
  * not-valid. Walk the page table and check the PTE_VALID bit. This is based
@@ -232,3 +234,5 @@ bool kernel_page_present(struct page *page)
 	ptep = pte_offset_kernel(pmdp, addr);
 	return pte_valid(READ_ONCE(*ptep));
 }
+#endif /* CONFIG_HIBERNATION */
+#endif /* CONFIG_DEBUG_PAGEALLOC */
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e9f13fe08492..ad8a83f3ddca 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -146,6 +146,7 @@ config PPC
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select ARCH_OPTIONAL_KERNEL_RWX		if ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_SUPPORTS_ATOMIC_RMW
+	select ARCH_SUPPORTS_DEBUG_PAGEALLOC	if PPC32 || PPC_BOOK3S_64
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
 	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
@@ -355,10 +356,6 @@ config PPC_OF_PLATFORM_PCI
 	depends on PCI
 	depends on PPC64 # not supported on 32 bits yet
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-	depends on PPC32 || PPC_BOOK3S_64
-	def_bool y
-
 config ARCH_SUPPORTS_UPROBES
 	def_bool y
 
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 44377fd7860e..9283c6f9ae2a 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -14,6 +14,7 @@ config RISCV
 	def_bool y
 	select ARCH_CLOCKSOURCE_INIT
 	select ARCH_SUPPORTS_ATOMIC_RMW
+	select ARCH_SUPPORTS_DEBUG_PAGEALLOC if MMU
 	select ARCH_HAS_BINFMT_FLAT
 	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DEBUG_VIRTUAL if MMU
@@ -153,9 +154,6 @@ config ARCH_SELECT_MEMORY_MODEL
 config ARCH_WANT_GENERAL_HUGETLB
 	def_bool y
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-	def_bool y
-
 config SYS_SUPPORTS_HUGETLBFS
 	depends on MMU
 	def_bool y
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 183f1f4b2ae6..41a72861987c 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -461,8 +461,6 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 #define VMALLOC_START		0
 #define VMALLOC_END		TASK_SIZE
 
-static inline void __kernel_map_pages(struct page *page, int numpages, int enable) {}
-
 #endif /* !CONFIG_MMU */
 
 #define kern_addr_valid(addr)   (1) /* FIXME */
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 19fecb362d81..321b09d2e2ea 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -184,6 +184,7 @@ int set_direct_map_default_noflush(struct page *page)
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
 	if (!debug_pagealloc_enabled())
@@ -196,3 +197,4 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
 		__set_memory((unsigned long)page_address(page), numpages,
 			     __pgprot(0), __pgprot(_PAGE_PRESENT));
 }
+#endif
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 4a2a12be04c9..991a850a6c0b 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -35,9 +35,6 @@ config GENERIC_LOCKBREAK
 config PGSTE
 	def_bool y if KVM
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-	def_bool y
-
 config AUDIT_ARCH
 	def_bool y
 
@@ -106,6 +103,7 @@ config S390
 	select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
 	select ARCH_STACKWALK
 	select ARCH_SUPPORTS_ATOMIC_RMW
+	select ARCH_SUPPORTS_DEBUG_PAGEALLOC
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index a6ca135442f9..2c729b8d097a 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -88,6 +88,7 @@ config SPARC64
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_ARCH_AUDITSYSCALL
 	select ARCH_SUPPORTS_ATOMIC_RMW
+	select ARCH_SUPPORTS_DEBUG_PAGEALLOC
 	select HAVE_NMI
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select ARCH_USE_QUEUED_RWLOCKS
@@ -148,9 +149,6 @@ config GENERIC_ISA_DMA
 	bool
 	default y if SPARC32
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-	def_bool y if SPARC64
-
 config PGTABLE_LEVELS
 	default 4 if 64BIT
 	default 3
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..0db3fb1da70c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -91,6 +91,7 @@ config X86
 	select ARCH_STACKWALK
 	select ARCH_SUPPORTS_ACPI
 	select ARCH_SUPPORTS_ATOMIC_RMW
+	select ARCH_SUPPORTS_DEBUG_PAGEALLOC
 	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUED_RWLOCKS
@@ -329,9 +330,6 @@ config ZONE_DMA32
 config AUDIT_ARCH
 	def_bool y if X86_64
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-	def_bool y
-
 config KASAN_SHADOW_OFFSET
 	hex
 	depends on KASAN
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 40baa90e74f4..bc9be96b777f 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2194,6 +2194,7 @@ int set_direct_map_default_noflush(struct page *page)
 	return __set_pages_p(page, 1);
 }
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
 	if (PageHighMem(page))
@@ -2239,6 +2240,7 @@ bool kernel_page_present(struct page *page)
 	return (pte_val(*pte) & _PAGE_PRESENT);
 }
 #endif /* CONFIG_HIBERNATION */
+#endif /* CONFIG_DEBUG_PAGEALLOC */
 
 int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
 				   unsigned numpages, unsigned long page_flags)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 14e397f3752c..ab0ef6bd351d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2924,7 +2924,11 @@ static inline bool debug_pagealloc_enabled_static(void)
 	return static_branch_unlikely(&_debug_pagealloc_enabled);
 }
 
-#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP)
+#ifdef CONFIG_DEBUG_PAGEALLOC
+/*
+ * To support DEBUG_PAGEALLOC architecture must ensure that
+ * __kernel_map_pages() never fails
+ */
 extern void __kernel_map_pages(struct page *page, int numpages, int enable);
 
 static inline void debug_pagealloc_map_pages(struct page *page,
@@ -2937,13 +2941,13 @@ static inline void debug_pagealloc_map_pages(struct page *page,
 #ifdef CONFIG_HIBERNATION
 extern bool kernel_page_present(struct page *page);
 #endif	/* CONFIG_HIBERNATION */
-#else	/* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+#else	/* CONFIG_DEBUG_PAGEALLOC */
 static inline void debug_pagealloc_map_pages(struct page *page,
 					     int numpages, int enable) {}
 #ifdef CONFIG_HIBERNATION
 static inline bool kernel_page_present(struct page *page) { return true; }
 #endif	/* CONFIG_HIBERNATION */
-#endif	/* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+#endif	/* CONFIG_DEBUG_PAGEALLOC */
 
 #ifdef __HAVE_ARCH_GATE_AREA
 extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
-- 
2.28.0


^ permalink raw reply related

* [PATCH v3 2/4] PM: hibernate: make direct map manipulations more explicit
From: Mike Rapoport @ 2020-11-01 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rafael J . Wysocki, David Hildenbrand, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, Pavel Machek,
	H. Peter Anvin, sparclinux, Christoph Lameter, Will Deacon,
	linux-riscv, linux-s390, x86, Mike Rapoport,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas, Len Brown,
	Albert Ou, Vasily Gorbik, linux-pm, Heiko Carstens,
	David Rientjes, Borislav Petkov, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Thomas Gleixner, linux-arm-kernel,
	Rafael J. Wysocki, linux-kernel, Pekka Enberg, Palmer Dabbelt,
	Joonsoo Kim, Edgecombe, Rick P, linuxppc-dev, David S. Miller,
	Mike Rapoport
In-Reply-To: <20201101170815.9795-1-rppt@kernel.org>

From: Mike Rapoport <rppt@linux.ibm.com>

When DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP is enabled a page may be
not present in the direct map and has to be explicitly mapped before it
could be copied.

Introduce hibernate_map_page() that will explicitly use
set_direct_map_{default,invalid}_noflush() for ARCH_HAS_SET_DIRECT_MAP case
and debug_pagealloc_map_pages() for DEBUG_PAGEALLOC case.

The remapping of the pages in safe_copy_page() presumes that it only
changes protection bits in an existing PTE and so it is safe to ignore
return value of set_direct_map_{default,invalid}_noflush().

Still, add a WARN_ON() so that future changes in set_memory APIs will not
silently break hibernation.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 include/linux/mm.h      | 12 ------------
 kernel/power/snapshot.c | 30 ++++++++++++++++++++++++++++--
 2 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1fc0609056dc..14e397f3752c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2927,16 +2927,6 @@ static inline bool debug_pagealloc_enabled_static(void)
 #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP)
 extern void __kernel_map_pages(struct page *page, int numpages, int enable);
 
-/*
- * When called in DEBUG_PAGEALLOC context, the call should most likely be
- * guarded by debug_pagealloc_enabled() or debug_pagealloc_enabled_static()
- */
-static inline void
-kernel_map_pages(struct page *page, int numpages, int enable)
-{
-	__kernel_map_pages(page, numpages, enable);
-}
-
 static inline void debug_pagealloc_map_pages(struct page *page,
 					     int numpages, int enable)
 {
@@ -2948,8 +2938,6 @@ static inline void debug_pagealloc_map_pages(struct page *page,
 extern bool kernel_page_present(struct page *page);
 #endif	/* CONFIG_HIBERNATION */
 #else	/* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
-static inline void
-kernel_map_pages(struct page *page, int numpages, int enable) {}
 static inline void debug_pagealloc_map_pages(struct page *page,
 					     int numpages, int enable) {}
 #ifdef CONFIG_HIBERNATION
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 46b1804c1ddf..054c8cce4236 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -76,6 +76,32 @@ static inline void hibernate_restore_protect_page(void *page_address) {}
 static inline void hibernate_restore_unprotect_page(void *page_address) {}
 #endif /* CONFIG_STRICT_KERNEL_RWX  && CONFIG_ARCH_HAS_SET_MEMORY */
 
+static inline void hibernate_map_page(struct page *page, int enable)
+{
+	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
+		unsigned long addr = (unsigned long)page_address(page);
+		int ret;
+
+		/*
+		 * This should not fail because remapping a page here means
+		 * that we only update protection bits in an existing PTE.
+		 * It is still worth to have WARN_ON() here if something
+		 * changes and this will no longer be the case.
+		 */
+		if (enable)
+			ret = set_direct_map_default_noflush(page);
+		else
+			ret = set_direct_map_invalid_noflush(page);
+
+		if (WARN_ON(ret))
+			return;
+
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	} else {
+		debug_pagealloc_map_pages(page, 1, enable);
+	}
+}
+
 static int swsusp_page_is_free(struct page *);
 static void swsusp_set_page_forbidden(struct page *);
 static void swsusp_unset_page_forbidden(struct page *);
@@ -1355,9 +1381,9 @@ static void safe_copy_page(void *dst, struct page *s_page)
 	if (kernel_page_present(s_page)) {
 		do_copy_page(dst, page_address(s_page));
 	} else {
-		kernel_map_pages(s_page, 1, 1);
+		hibernate_map_page(s_page, 1);
 		do_copy_page(dst, page_address(s_page));
-		kernel_map_pages(s_page, 1, 0);
+		hibernate_map_page(s_page, 0);
 	}
 }
 
-- 
2.28.0


^ permalink raw reply related

* [PATCH v3 1/4] mm: introduce debug_pagealloc_map_pages() helper
From: Mike Rapoport @ 2020-11-01 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, Pavel Machek, H. Peter Anvin, sparclinux,
	Christoph Lameter, Will Deacon, linux-riscv, linux-s390, x86,
	Mike Rapoport, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Len Brown, Albert Ou, Vasily Gorbik, linux-pm,
	Heiko Carstens, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Kirill A. Shutemov, Thomas Gleixner,
	linux-arm-kernel, Rafael J. Wysocki, linux-kernel, Pekka Enberg,
	Palmer Dabbelt, Joonsoo Kim, Edgecombe, Rick P, linuxppc-dev,
	David S. Miller, Mike Rapoport
In-Reply-To: <20201101170815.9795-1-rppt@kernel.org>

From: Mike Rapoport <rppt@linux.ibm.com>

When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel
direct mapping after free_pages(). The pages than need to be mapped back
before they could be used. Theese mapping operations use
__kernel_map_pages() guarded with with debug_pagealloc_enabled().

The only place that calls __kernel_map_pages() without checking whether
DEBUG_PAGEALLOC is enabled is the hibernation code that presumes
availability of this function when ARCH_HAS_SET_DIRECT_MAP is set.
Still, on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is
not enabled but set_direct_map_invalid_noflush() may render some pages not
present in the direct map and hibernation code won't be able to save such
pages.

To make page allocation debugging and hibernation interaction more robust,
the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be made
more explicit.

Start with combining the guard condition and the call to
__kernel_map_pages() into a single debug_pagealloc_map_pages() function to
emphasize that __kernel_map_pages() should not be called without
DEBUG_PAGEALLOC and use this new function to map/unmap pages when page
allocation debug is enabled.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h  | 10 ++++++++++
 mm/memory_hotplug.c |  3 +--
 mm/page_alloc.c     |  6 ++----
 mm/slab.c           |  8 +++-----
 4 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..1fc0609056dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2936,12 +2936,22 @@ kernel_map_pages(struct page *page, int numpages, int enable)
 {
 	__kernel_map_pages(page, numpages, enable);
 }
+
+static inline void debug_pagealloc_map_pages(struct page *page,
+					     int numpages, int enable)
+{
+	if (debug_pagealloc_enabled_static())
+		__kernel_map_pages(page, numpages, enable);
+}
+
 #ifdef CONFIG_HIBERNATION
 extern bool kernel_page_present(struct page *page);
 #endif	/* CONFIG_HIBERNATION */
 #else	/* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 static inline void
 kernel_map_pages(struct page *page, int numpages, int enable) {}
+static inline void debug_pagealloc_map_pages(struct page *page,
+					     int numpages, int enable) {}
 #ifdef CONFIG_HIBERNATION
 static inline bool kernel_page_present(struct page *page) { return true; }
 #endif	/* CONFIG_HIBERNATION */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b44d4c7ba73b..e2b6043a4428 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -614,8 +614,7 @@ void generic_online_page(struct page *page, unsigned int order)
 	 * so we should map it first. This is better than introducing a special
 	 * case in page freeing fast path.
 	 */
-	if (debug_pagealloc_enabled_static())
-		kernel_map_pages(page, 1 << order, 1);
+	debug_pagealloc_map_pages(page, 1 << order, 1);
 	__free_pages_core(page, order);
 	totalram_pages_add(1UL << order);
 #ifdef CONFIG_HIGHMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23f5066bd4a5..9a66a1ff9193 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1272,8 +1272,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	 */
 	arch_free_page(page, order);
 
-	if (debug_pagealloc_enabled_static())
-		kernel_map_pages(page, 1 << order, 0);
+	debug_pagealloc_map_pages(page, 1 << order, 0);
 
 	kasan_free_nondeferred_pages(page, order);
 
@@ -2270,8 +2269,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	set_page_refcounted(page);
 
 	arch_alloc_page(page, order);
-	if (debug_pagealloc_enabled_static())
-		kernel_map_pages(page, 1 << order, 1);
+	debug_pagealloc_map_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
 	kernel_poison_pages(page, 1 << order, 1);
 	set_page_owner(page, order, gfp_flags);
diff --git a/mm/slab.c b/mm/slab.c
index b1113561b98b..340db0ce74c4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1431,10 +1431,8 @@ static bool is_debug_pagealloc_cache(struct kmem_cache *cachep)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 static void slab_kernel_map(struct kmem_cache *cachep, void *objp, int map)
 {
-	if (!is_debug_pagealloc_cache(cachep))
-		return;
-
-	kernel_map_pages(virt_to_page(objp), cachep->size / PAGE_SIZE, map);
+	debug_pagealloc_map_pages(virt_to_page(objp),
+				  cachep->size / PAGE_SIZE, map);
 }
 
 #else
@@ -2062,7 +2060,7 @@ int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags)
 
 #if DEBUG
 	/*
-	 * If we're going to use the generic kernel_map_pages()
+	 * If we're going to use the generic debug_pagealloc_map_pages()
 	 * poisoning, then it's going to smash the contents of
 	 * the redzone and userword anyhow, so switch them off.
 	 */
-- 
2.28.0


^ permalink raw reply related

* [PATCH v3 0/4] arch, mm: improve robustness of direct map manipulation
From: Mike Rapoport @ 2020-11-01 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, Pavel Machek, H. Peter Anvin, sparclinux,
	Christoph Lameter, Will Deacon, linux-riscv, linux-s390, x86,
	Mike Rapoport, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Len Brown, Albert Ou, Vasily Gorbik, linux-pm,
	Heiko Carstens, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Kirill A. Shutemov, Thomas Gleixner,
	linux-arm-kernel, Rafael J. Wysocki, linux-kernel, Pekka Enberg,
	Palmer Dabbelt, Joonsoo Kim, Edgecombe, Rick P, linuxppc-dev,
	David S. Miller, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

During recent discussion about KVM protected memory, David raised a concern
about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1].

Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
possible that __kernel_map_pages() would fail, but since this function is
void, the failure will go unnoticed.

Moreover, there's lack of consistency of __kernel_map_pages() semantics
across architectures as some guard this function with
#ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page
allocation debugging is disabled at run time and some allow modifying the
direct map regardless of DEBUG_PAGEALLOC settings.

This set straightens this out by restoring dependency of
__kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
accordingly. 

Since currently the only user of __kernel_map_pages() outside
DEBUG_PAGEALLOC, it is updated to make direct map accesses there more
explicit.

[1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a40348269d@redhat.com

v3 changes:
* update arm64 changes to avoid regression, per Rick's comments
* fix bisectability

v2 changes:
* Rephrase patch 2 changelog to better describe the change intentions and
implications
* Move removal of kernel_map_pages() from patch 1 to patch 2, per David
https://lore.kernel.org/lkml/20201029161902.19272-1-rppt@kernel.org

v1:
https://lore.kernel.org/lkml/20201025101555.3057-1-rppt@kernel.org

Mike Rapoport (4):
  mm: introduce debug_pagealloc_map_pages() helper
  PM: hibernate: make direct map manipulations more explicit
  arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC
  arch, mm: make kernel_page_present() always available

 arch/Kconfig                        |  3 +++
 arch/arm64/Kconfig                  |  4 +---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/pageattr.c            |  6 +++--
 arch/powerpc/Kconfig                |  5 +----
 arch/riscv/Kconfig                  |  4 +---
 arch/riscv/include/asm/pgtable.h    |  2 --
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/mm/pageattr.c            | 31 +++++++++++++++++++++++++
 arch/s390/Kconfig                   |  4 +---
 arch/sparc/Kconfig                  |  4 +---
 arch/x86/Kconfig                    |  4 +---
 arch/x86/include/asm/set_memory.h   |  1 +
 arch/x86/mm/pat/set_memory.c        |  4 ++--
 include/linux/mm.h                  | 35 +++++++++++++----------------
 include/linux/set_memory.h          |  5 +++++
 kernel/power/snapshot.c             | 30 +++++++++++++++++++++++--
 mm/memory_hotplug.c                 |  3 +--
 mm/page_alloc.c                     |  6 ++---
 mm/slab.c                           |  8 +++----
 20 files changed, 103 insertions(+), 58 deletions(-)

-- 
2.28.0

*** BLURB HERE ***

Mike Rapoport (4):
  mm: introduce debug_pagealloc_map_pages() helper
  PM: hibernate: make direct map manipulations more explicit
  arch, mm: restore dependency of __kernel_map_pages() of
    DEBUG_PAGEALLOC
  arch, mm: make kernel_page_present() always available

 arch/Kconfig                        |  3 +++
 arch/arm64/Kconfig                  |  4 +---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/pageattr.c            |  6 +++--
 arch/powerpc/Kconfig                |  5 +----
 arch/riscv/Kconfig                  |  4 +---
 arch/riscv/include/asm/pgtable.h    |  2 --
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/mm/pageattr.c            | 31 +++++++++++++++++++++++++
 arch/s390/Kconfig                   |  4 +---
 arch/sparc/Kconfig                  |  4 +---
 arch/x86/Kconfig                    |  4 +---
 arch/x86/include/asm/set_memory.h   |  1 +
 arch/x86/mm/pat/set_memory.c        |  4 ++--
 include/linux/mm.h                  | 35 +++++++++++++----------------
 include/linux/set_memory.h          |  5 +++++
 kernel/power/snapshot.c             | 30 +++++++++++++++++++++++--
 mm/memory_hotplug.c                 |  3 +--
 mm/page_alloc.c                     |  6 ++---
 mm/slab.c                           |  8 +++----
 20 files changed, 103 insertions(+), 58 deletions(-)

-- 
2.28.0


^ permalink raw reply

* Re: [PATCH 2/4] PM: hibernate: improve robustness of mapping pages in the direct map
From: Mike Rapoport @ 2020-11-01 17:02 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david@redhat.com, peterz@infradead.org, catalin.marinas@arm.com,
	dave.hansen@linux.intel.com, linux-mm@kvack.org, paulus@samba.org,
	pavel@ucw.cz, hpa@zytor.com, sparclinux@vger.kernel.org,
	cl@linux.com, will@kernel.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, x86@kernel.org, rppt@linux.ibm.com,
	borntraeger@de.ibm.com, mingo@redhat.com, rientjes@google.com,
	Brown, Len, aou@eecs.berkeley.edu, gor@linux.ibm.com,
	linux-pm@vger.kernel.org, hca@linux.ibm.com, bp@alien8.de,
	luto@kernel.org, paul.walmsley@sifive.com, kirill@shutemov.name,
	tglx@linutronix.de, iamjoonsoo.kim@lge.com,
	linux-arm-kernel@lists.infradead.org, rjw@rjwysocki.net,
	linux-kernel@vger.kernel.org, penberg@kernel.org,
	palmer@dabbelt.com, akpm@linux-foundation.org,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <604554805defb03d158c09aba4b5cced3416a7fb.camel@intel.com>

On Thu, Oct 29, 2020 at 11:19:18PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2020-10-29 at 09:54 +0200, Mike Rapoport wrote:
> > __kernel_map_pages() on arm64 will also bail out if rodata_full is
> > false:
> > void __kernel_map_pages(struct page *page, int numpages, int enable)
> > {
> >         if (!debug_pagealloc_enabled() && !rodata_full)
> >                 return;
> > 
> >         set_memory_valid((unsigned long)page_address(page), numpages,
> > enable);
> > }
> > 
> > So using set_direct_map() to map back pages removed from the direct
> > map
> > with __kernel_map_pages() seems safe to me.
> 
> Heh, one of us must have some simple boolean error in our head. I hope
> its not me! :) I'll try on more time.

Well, then it's me :)
You are right, I misread this and I could not understand why
!rodata_full bothers you.

> __kernel_map_pages() will bail out if rodata_full is false **AND**
> debug page alloc is off. So it will only bail under conditions where
> there could be nothing unmapped on the direct map.
> 
> Equivalent logic would be:
> 	if (!(debug_pagealloc_enabled() || rodata_full))
> 		return;
> 
> Or:
> 	if (debug_pagealloc_enabled() || rodata_full)
> 		set_memory_valid(blah)
> 
> So if either is on, the existing code will try to re-map. But the
> set_direct_map_()'s will only work if rodata_full is on. So switching
> hibernate to set_direct_map() will cause the remap to be missed for the
> debug page alloc case, with !rodata_full.
> 
> It also breaks normal debug page alloc usage with !rodata_full for
> similar reasons after patch 3. The pages would never get unmapped.

I've updated the patches, there should be no regression now.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends
From: Arnd Bergmann @ 2020-10-31 21:33 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Juri Lelli, David Airlie, Sebastian Andrzej Siewior,
	open list:BROADCOM NVRAM DRIVER, Ben Segall, Max Filippov,
	Paul Mackerras, Matthew Wilcox, linux-sparc, Vincent Chen,
	Ard Biesheuvel, linux-arch, Vincent Guittot, Paul McKenney,
	the arch/x86 maintainers, Russell King, linux-csky,
	Christoph Hellwig, Peter Zijlstra, Daniel Bristot de Oliveira,
	open list:SYNOPSYS ARC ARCHITECTURE, Mel Gorman, Ingo Molnar,
	linux-xtensa, Arnd Bergmann, Steven Rostedt, Greentime Hu,
	Thomas Gleixner, Dietmar Eggemann, Linux ARM, Herbert Xu,
	Chris Zankel, Michal Simek, Thomas Bogendoerfer, Nick Hu,
	Linux-MM, Vineet Gupta, Linus Torvalds, LKML, Daniel Vetter,
	Guo Ren, Andrew Morton, linuxppc-dev, David S. Miller
In-Reply-To: <20201031160539.Horde.n5yNbG9LoUSWqtuPQW_h3w1@messagerie.c-s.fr>

On Sat, Oct 31, 2020 at 4:04 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
> > There are also some users on 10+ year old 32-bit netbooks or
> > business laptops, both x86 and Apple G4.
> > The longest-lived 32-bit embedded systems with large memory
> > (other than Arm) are probably NXP QorIQ P20xx/P40xx used in
> > military VME bus systems, and low-end embedded systems based
> > on Vortex86.
> > I'm less worried about all of these because upstream kernel
> > support for ppc32 and x86-32 is already bitrotting and they will
> > likely get stuck on the last working kernel before the
> > TI/Renesas/NXP Arm systems do.
> >
>
> Upstream kernel support for ppc32 is bitrotting, seriously ? What do
> you mean exactly ?

I was thinking more of the platform support: out of the twelve
32-bit platforms in arch/powerpc/platforms/, your 8xx is the only
one listed as 'maintained' or 'supported' in the maintainers list,
and that seems to accurately describe the current state.

Freescale seems to have practically stopped contributing to any of
their 32-bit platforms in 2016 after the NXP acquisition and no longer
employing either of the maintainers. Similarly, Ben seems to have
stopped working on powermac in 2016, which was ten years after
the last 32-bit hardware shipped for that platform.

> ppc32 is actively supported, with recent addition of support of
> hugepages, kasan, uaccess protection, VMAP stack, etc ...

That is good to hear, I didn't know about these additions.
What platforms are people using to develop these? Is this
mainly your 8xx work, or is there ongoing development for
platforms that need highmem?

         Arnd

^ permalink raw reply

* Re: [PATCH net-next 14/15] net: dpaa: Replace in_irq() usage.
From: Jakub Kicinski @ 2020-10-31 17:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Aymen Sghaier, Madalin Bucur, Zhu Yanjun, Samuel Chessman,
	Ping-Ke Shih, Herbert Xu, Horia Geantă, linux-rdma,
	Rain River, Kalle Valo, Ulrich Kunitz, Jouni Malinen,
	linuxppc-dev, Daniel Drake, Thomas Gleixner, linux-arm-kernel,
	Leon Romanovsky, netdev, linux-wireless, Li Yang, linux-crypto,
	Jon Mason, Saeed Mahameed, David S. Miller
In-Reply-To: <20201027225454.3492351-15-bigeasy@linutronix.de>

On Tue, 27 Oct 2020 23:54:53 +0100 Sebastian Andrzej Siewior wrote:
> The driver uses in_irq() + in_serving_softirq() magic to decide if NAPI
> scheduling is required or packet processing.
> 
> The usage of in_*() in drivers is phased out and Linus clearly requested
> that code which changes behaviour depending on context should either be
> seperated or the context be conveyed in an argument passed by the caller,
> which usually knows the context.
> 
> Use the `napi' argument passed by the callback. It is set true if
> called from the interrupt handler and NAPI should be scheduled.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: "Horia Geantă" <horia.geanta@nxp.com>
> Cc: Aymen Sghaier <aymen.sghaier@nxp.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Madalin Bucur <madalin.bucur@nxp.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Li Yang <leoyang.li@nxp.com>
> Cc: linux-crypto@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-arm-kernel@lists.infradead.org
> ---
>  drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> index 27835310b718e..2c949acd74c67 100644
> --- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> +++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> @@ -2300,9 +2300,9 @@ static void dpaa_tx_conf(struct net_device *net_dev,
>  }
>  
>  static inline int dpaa_eth_napi_schedule(struct dpaa_percpu_priv *percpu_priv,
> -					 struct qman_portal *portal)
> +					 struct qman_portal *portal, bool napi)
>  {
> -	if (unlikely(in_irq() || !in_serving_softirq())) {
> +	if (napi) {
>  		/* Disable QMan IRQ and invoke NAPI */
>  		qman_p_irqsource_remove(portal, QM_PIRQ_DQRI);
>  

Nit: some networking drivers have a bool napi which means "are we
running in napi context", the semantics here feel a little backwards,
at least to me. But if I'm the only one thinking this, so be it.

^ permalink raw reply

* Re: [PATCH net-next 00/15] in_interrupt() cleanup, part 2
From: Jakub Kicinski @ 2020-10-31 17:08 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Aymen Sghaier, Madalin Bucur, Zhu Yanjun, Samuel Chessman,
	Ping-Ke Shih, Herbert Xu, Horia Geantă, linux-rdma,
	Rain River, Kalle Valo, Ulrich Kunitz, Jouni Malinen,
	linuxppc-dev, Daniel Drake, Thomas Gleixner, linux-arm-kernel,
	Leon Romanovsky, netdev, linux-wireless, Li Yang, linux-crypto,
	Jon Mason, Saeed Mahameed, David S. Miller
In-Reply-To: <20201027225454.3492351-1-bigeasy@linutronix.de>

On Tue, 27 Oct 2020 23:54:39 +0100 Sebastian Andrzej Siewior wrote:
> Folks,
> 
> in the discussion about preempt count consistency across kernel configurations:
> 
>   https://lore.kernel.org/r/20200914204209.256266093@linutronix.de/
> 
> Linus clearly requested that code in drivers and libraries which changes
> behaviour based on execution context should either be split up so that
> e.g. task context invocations and BH invocations have different interfaces
> or if that's not possible the context information has to be provided by the
> caller which knows in which context it is executing.
> 
> This includes conditional locking, allocation mode (GFP_*) decisions and
> avoidance of code paths which might sleep.
> 
> In the long run, usage of 'preemptible, in_*irq etc.' should be banned from
> driver code completely.
> 
> This is part two addressing remaining drivers except for orinoco-usb.

Sebastian, thanks for there, I picked up only the Ethernet patches:

5ce7f3f46f6b net: neterion: s2io: Replace in_interrupt() for context detection
dc5e8bfcd12e net: forcedeth: Replace context and lock check with a lockdep_assert()
beca92820dc4 net: tlan: Replace in_irq() usage

Please repost the wireless stuff for Kalle to linux-wireless@vger, 
and repost the fsl stuff separately (our patchwork queue is huge, 
I can't keep this waiting for an ack any longer). 

^ permalink raw reply

* Re: [PATCH net-next 04/15] net: mlx5: Replace in_irq() usage.
From: Jakub Kicinski @ 2020-10-31 16:59 UTC (permalink / raw)
  To: Leon Romanovsky, Saeed Mahameed
  Cc: Aymen Sghaier, Madalin Bucur, Sebastian Andrzej Siewior,
	Zhu Yanjun, Samuel Chessman, Ping-Ke Shih, Herbert Xu,
	Horia Geantă, linux-rdma, Rain River, Kalle Valo,
	Ulrich Kunitz, Jouni Malinen, Daniel Drake, Thomas Gleixner,
	linux-arm-kernel, netdev, linux-wireless, Li Yang, linux-crypto,
	Jon Mason, linuxppc-dev, David S. Miller
In-Reply-To: <20201027225454.3492351-5-bigeasy@linutronix.de>

On Tue, 27 Oct 2020 23:54:43 +0100 Sebastian Andrzej Siewior wrote:
> mlx5_eq_async_int() uses in_irq() to decide whether eq::lock needs to be
> acquired and released with spin_[un]lock() or the irq saving/restoring
> variants.
> 
> The usage of in_*() in drivers is phased out and Linus clearly requested
> that code which changes behaviour depending on context should either be
> seperated or the context be conveyed in an argument passed by the caller,
> which usually knows the context.
> 
> mlx5_eq_async_int() knows the context via the action argument already so
> using it for the lock variant decision is a straight forward replacement
> for in_irq().
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Saeed Mahameed <saeedm@nvidia.com>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: linux-rdma@vger.kernel.org

Saeed, please pick this up into your tree.

^ permalink raw reply

* Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends
From: Christophe Leroy @ 2020-10-31 15:05 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Juri Lelli, David Airlie, Sebastian Andrzej Siewior,
	open list:BROADCOM NVRAM DRIVER, Ben Segall, Max Filippov,
	Paul Mackerras, Matthew Wilcox, linux-sparc, Vincent Chen,
	Ard Biesheuvel, linux-arch, Vincent Guittot, Paul McKenney,
	the arch/x86 maintainers, Russell King, linux-csky,
	Christoph Hellwig, Peter Zijlstra, Daniel Bristot de Oliveira,
	open list:SYNOPSYS ARC ARCHITECTURE, Mel Gorman, Ingo Molnar,
	linux-xtensa, Arnd Bergmann, Steven Rostedt, Greentime Hu,
	Thomas Gleixner, Dietmar Eggemann, Linux ARM, Herbert Xu,
	Chris Zankel, Michal Simek, Thomas Bogendoerfer, Nick Hu,
	Linux-MM, Vineet Gupta, Linus Torvalds, LKML, Daniel Vetter,
	Guo Ren, Andrew Morton, linuxppc-dev, David S. Miller
In-Reply-To: <CAK8P3a3FyKTHDSAPCyP8e7UA0LN3OvAatNK_vQ3tnBsdbou4sA@mail.gmail.com>


> There are also some users on 10+ year old 32-bit netbooks or
> business laptops, both x86 and Apple G4.
> The longest-lived 32-bit embedded systems with large memory
> (other than Arm) are probably NXP QorIQ P20xx/P40xx used in
> military VME bus systems, and low-end embedded systems based
> on Vortex86.
> I'm less worried about all of these because upstream kernel
> support for ppc32 and x86-32 is already bitrotting and they will
> likely get stuck on the last working kernel before the
> TI/Renesas/NXP Arm systems do.
>

Upstream kernel support for ppc32 is bitrotting, seriously ? What do  
you mean exactly ?

ppc32 is actively supported, with recent addition of support of  
hugepages, kasan, uaccess protection, VMAP stack, etc ...

Christophe



^ permalink raw reply

* Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends
From: Arnd Bergmann @ 2020-10-31 13:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juri Lelli, linux-xtensa, Peter Zijlstra,
	Sebastian Andrzej Siewior, open list:BROADCOM NVRAM DRIVER,
	Ben Segall, Linux-MM, Guo Ren, Matthew Wilcox, linux-sparc,
	Vincent Chen, Ingo Molnar, linux-arch, Vincent Guittot,
	Herbert Xu, the arch/x86 maintainers, Russell King, linux-csky,
	Christoph Hellwig, David Airlie, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, Ard Biesheuvel,
	Paul McKenney, linuxppc-dev, Steven Rostedt, Greentime Hu,
	Thomas Gleixner, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Max Filippov,
	Vineet Gupta, LKML, Arnd Bergmann, Daniel Vetter, Paul Mackerras,
	Andrew Morton, Daniel Bristot de Oliveira, David S. Miller
In-Reply-To: <CAHk-=wjjO9BtTUAsLraqZqdzaPGJ-qvubZfwUsmRUX896eHcGw@mail.gmail.com>

On Fri, Oct 30, 2020 at 11:46 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Oct 30, 2020 at 3:26 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > While at it I might have a look at that debug request from Willy in the
> > other end of this thread. Any comment on that?
> >
> >  https://lore.kernel.org/r/87k0v7mrrd.fsf@nanos.tec.linutronix.de
>
> I do think that it would be nice to have a debug mode, particularly
> since over the last few years we've really lost a lot of HIGHMEM
> coverage (to the point that I've wondered how worthwhile it really is
> to support at all any more - I think it's Arnd who argued that it's
> mainly some embedded ARM variants that will want it for the forseeable
> future).
>
> So I'm honestly somewhat torn. I think HIGHMEM is dying, and yes that
> argues for "non-HIGHMEM had better have some debug coverage", but at
> the same time I think it doesn't even really matter any more.

Shifting the testing of highmem code to the actual users of highmem
sounds reasonable to me. This means it will get broken more often,
but if it doesn't happen all the time, it might serve as a canary:
once a bug survives for long enough, we have a good indication that
users stopped caring ;-)

> At some
> point those embedded ARM platforms just aren't even interesting - they
> might as well use older kernels if they are the only thing really
> arguing for HIGHMEM at all.

Agreed, but it does need a little time to get there. My best guess is three
to five years from now we can remove it for good, provided a few things
happen first:

1. The remaining users of TI Keystone2, NXP i.MX6 Quad(Plus) and
  Renesas R8A7790/R8A7791/R8A7742/R8A7743 that use the
  largest memory configuration must have stopped requiring kernel
  version updates.
  These were all introduced at the peak of 32-bit Arm embedded
  systems around 2013, but they have long (15+ year) product
  life cycles and users pick these because they do promise kernel
  updates, unlike other SoC families that get stuck on old vendor
  kernels much earlier.

2. The plan to add a CONFIG_VMSPLIT_4G_4G option on arch/arm/
  must work out. We don't have all the code yet, and so far it looks
  like it's going to be a bit ugly and a bit slow but not nearly as ugly
  or slow as it was on x86 20 years ago.
  This would cover Cortex-A7/A15 systems from 2GB to 4GB,
  which are still fairly common and need to keep using mainline
  kernels for much longer than the system in point 1.
  It won't help on Cortex-A9 machines with 2GB, which I hope can
  migrate CONFIG_VMSPLIT_2G_OPT as a fallback.

3. No new systems with larger memory must appear. I noticed that
  e.g. the newly introduced Rockchips RV1109 and Allwinner
  A50/R328/V316 support LP-DDR4 on a dual Cortex-A7, but I
  hope that nobody will in practice combine a $2 SoC with a $15
  memory chip.
  Most other 32-bit chips use DDR3, which tends to prohibit
  configurations over 4GB in new designs, with the cheapest
  ones limited to 512MB (a single 256Mx16 chip) and the
  high end having already moved on to 64 bit.

Regarding 32-bit non-Arm systems, support for the MIPS-based
Baikal T1 was merged earlier this year and is used in Russian
PC style systems with up to 8GB.
There are also some users on 10+ year old 32-bit netbooks or
business laptops, both x86 and Apple G4.
The longest-lived 32-bit embedded systems with large memory
(other than Arm) are probably NXP QorIQ P20xx/P40xx used in
military VME bus systems, and low-end embedded systems based
on Vortex86.
I'm less worried about all of these because upstream kernel
support for ppc32 and x86-32 is already bitrotting and they will
likely get stuck on the last working kernel before the
TI/Renesas/NXP Arm systems do.

       Arnd

^ permalink raw reply

* RE: [PATCH v2 31/39] docs: ABI: cleanup several ABI documents
From: Peter Chen @ 2020-10-30 23:23 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Linux Doc Mailing List
  Cc: Andrew Lunn, Linus Walleij, Jerry Snitselaar,
	dri-devel@lists.freedesktop.org, Pavel Machek, Christian Gromm,
	ceph-devel@vger.kernel.org, Bart Van Assche,
	linux-acpi@vger.kernel.org, Danil Kipnis, Samuel Thibault,
	Guenter Roeck, Ohad Ben-Cohen, linux-pm@vger.kernel.org,
	Alexander Antonov, Dan Murphy, Thomas Gleixner, Stefan Achatz,
	Konstantin Khlebnikov, Jingoo Han, Rafael J. Wysocki,
	linux-kernel@vger.kernel.org, Greg Kroah-Hartman, Hans Verkuil,
	Wu Hao, Vineela Tummalapalli, Peter Meerwald-Stadler,
	Srinivas Kandagatla, Hanjun Guo, Oleh Kravchenko,
	Lars-Peter Clausen, Andy Shevchenko, Saravana Kannan,
	Anton Vorontsov, Marek Marczykowski-Górecki,
	linux-stm32@st-md-mailman.stormreply.com, Bartosz Golaszewski,
	Len Brown, Alexandre Torgue, Michael Hennerich, Suzuki K Poulose,
	coresight@lists.linaro.org, linux-media@vger.kernel.org,
	Frederic Barrat, Bjorn Helgaas, Jaegeuk Kim, Boris Ostrovsky,
	Mika Westerberg, linux-arm-kernel@lists.infradead.org,
	Oded Gabbay, Tony Luck, Mathieu Poirier, Boris Brezillon,
	PrasannaKumar Muralidharan, linux-gpio@vger.kernel.org,
	Dongsheng Yang, linux-f2fs-devel@lists.sourceforge.net,
	Jarkko Sakkinen, Maxime Coquelin, Vaibhav Jain, Pali Rohár,
	Jonathan Cameron, Heiner Kallweit, Gautham R. Shenoy,
	Cezary Rojewski, Mario Limonciello, Alexander Shishkin, Tom Rix,
	linux-fpga@vger.kernel.org, Rasmus Villemoes, Jonas Meurer,
	Daniel Thompson, Florian Fainelli, Mark Gross, Jonathan Corbet,
	Ilya Dryomov, Jack Wang, Kees Cook, Dan Williams, Kranthi Kuntala,
	Dmitry Torokhov, Sebastian Reichel, Colin Cross,
	Enric Balletbo i Serra, Roman Sudarikov, Roger Pau Monné,
	Peter Zijlstra (Intel), linux-remoteproc@vger.kernel.org,
	Bjorn Andersson, linux-i3c@lists.infradead.org, Lee Jones,
	Russell King, Marek Behún, Jason Gunthorpe, Pawan Gupta,
	Mike Leach, Andrew Donnellan, Kajol Jain, Chao Yu, Johan Hovold,
	Andreas Klinger, Jonathan Cameron, David Sterba, Jens Axboe,
	netdev@vger.kernel.org, linux-iio@vger.kernel.org, Asutosh Das,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <5bc78e5b68ed1e9e39135173857cb2e753be868f.1604042072.git.mchehab+huawei@kernel.org>

Acked-by: Peter Chen <peter.chen@nxp.com> 

For:
Documentation/ABI/testing/usb-charger-uevent

Peter

^ permalink raw reply

* Re: [PATCH v2 20/39] docs: ABI: testing: make the files compatible with ReST output
From: Frederic Barrat @ 2020-10-30 17:26 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Linux Doc Mailing List
  Cc: Gautham R. Shenoy, Jason A. Donenfeld, Heikki Krogerus,
	Peter Meerwald-Stadler, Petr Mladek, Mario Limonciello,
	Alexander Shishkin, Tom Rix, Jonathan Cameron, Alexandre Belloni,
	Mimi Zohar, Sebastian Reichel, linux-mm, Bruno Meneguele,
	Vishal Verma, Pavel Machek, Hanjun Guo, Guenter Roeck, netdev,
	Oleh Kravchenko, Dan Williams, Andrew Donnellan,
	Javier González, Andy Shevchenko, Mark Gross, linux-acpi,
	Jonathan Corbet, Chunyan Zhang, Nayna Jain, linux-stm32,
	Lakshmi Ramasubramanian, Ludovic Desroches, Pawan Gupta,
	linux-arm-kernel, Orson Zhai, Niklas Cassel, Len Brown,
	Juergen Gross, linuxppc-dev, Mika Westerberg, Alexandre Torgue,
	linux-pm, linux-kernel, Richard Cochran, Oded Gabbay, Baolin Wang,
	Lars-Peter Clausen, Dan Murphy, xen-devel, Philippe Bergheaud,
	Boris Ostrovsky, Fabrice Gasnier, Benson Leung,
	Konstantin Khlebnikov, Jens Axboe, Felipe Balbi, Kranthi Kuntala,
	Martin K. Petersen, Greg Kroah-Hartman, linux-usb,
	Rafael J. Wysocki, Nicolas Ferre, linux-iio, Thinh Nguyen,
	Sergey Senozhatsky, Stefano Stabellini, Thomas Gleixner,
	Leonid Maksymchuk, Maxime Coquelin, Johannes Thumshirn,
	Enric Balletbo i Serra, Vaibhav Jain, Vineela Tummalapalli,
	Peter Rosin, Jonathan Cameron, Mike Kravetz
In-Reply-To: <58cf3c2d611e0197fb215652719ebd82ca2658db.1604042072.git.mchehab+huawei@kernel.org>



Le 30/10/2020 à 08:40, Mauro Carvalho Chehab a écrit :
> Some files over there won't parse well by Sphinx.
> 
> Fix them.
> 
> Acked-by: Jonathan Cameron<Jonathan.Cameron@huawei.com>  # for IIO
> Signed-off-by: Mauro Carvalho Chehab<mchehab+huawei@kernel.org>
> ---
...
>   Documentation/ABI/testing/sysfs-class-cxl     |  15 +-
...
>   Documentation/ABI/testing/sysfs-class-ocxl    |   3 +


Patches 20, 28 and 31 look good for cxl and ocxl.
Acked-by: Frederic Barrat <fbarrat@linux.ibm.com>

   Fred

^ permalink raw reply

* Re: [PATCH v2 31/39] docs: ABI: cleanup several ABI documents
From: Mathieu Poirier @ 2020-10-30 16:42 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Andrew Lunn, Peter Chen, Linus Walleij, Jerry Snitselaar,
	dri-devel, Pavel Machek, Christian Gromm, ceph-devel,
	Bart Van Assche, linux-acpi, Danil Kipnis, Samuel Thibault,
	Guenter Roeck, Ohad Ben-Cohen, linux-pm, Alexander Antonov,
	Dan Murphy, Thomas Gleixner, Stefan Achatz, Konstantin Khlebnikov,
	Jingoo Han, Rafael J. Wysocki, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Hans Verkuil, Wu Hao, Vineela Tummalapalli,
	Peter Meerwald-Stadler, Linux Doc Mailing List,
	Srinivas Kandagatla, Hanjun Guo, Oleh Kravchenko,
	Lars-Peter Clausen, Andy Shevchenko, Saravana Kannan,
	Anton Vorontsov, Marek Marczykowski-Górecki, linux-stm32,
	Bartosz Golaszewski, Len Brown, Alexandre Torgue,
	Michael Hennerich, Suzuki K Poulose, Coresight ML, linux-media,
	Frederic Barrat, Bjorn Helgaas, Jaegeuk Kim, Boris Ostrovsky,
	Mika Westerberg, linux-arm-kernel, Oded Gabbay, Tony Luck,
	Boris Brezillon, PrasannaKumar Muralidharan, linux-gpio,
	Dongsheng Yang, linux-f2fs-devel, Jarkko Sakkinen,
	Maxime Coquelin, Vaibhav Jain, Pali Rohár, Jonathan Cameron,
	Heiner Kallweit, Gautham R. Shenoy, Cezary Rojewski,
	Mario Limonciello, Alexander Shishkin, Tom Rix, linux-fpga,
	Rasmus Villemoes, Jonas Meurer, Daniel Thompson, Florian Fainelli,
	Mark Gross, Jonathan Corbet, Ilya Dryomov, Jack Wang, Kees Cook,
	Dan Williams, Kranthi Kuntala, Dmitry Torokhov, Sebastian Reichel,
	Colin Cross, Enric Balletbo i Serra, Roman Sudarikov,
	Roger Pau Monné, Peter Zijlstra (Intel), linux-remoteproc,
	Bjorn Andersson, linux-i3c, Lee Jones, Russell King,
	Marek Behún, Jason Gunthorpe, Pawan Gupta, Mike Leach,
	Andrew Donnellan, Kajol Jain, Chao Yu, Johan Hovold,
	Andreas Klinger, Jonathan Cameron, David Sterba, Jens Axboe,
	netdev, linux-iio, Asutosh Das, linuxppc-dev
In-Reply-To: <5bc78e5b68ed1e9e39135173857cb2e753be868f.1604042072.git.mchehab+huawei@kernel.org>

On Fri, 30 Oct 2020 at 01:41, Mauro Carvalho Chehab
<mchehab+huawei@kernel.org> wrote:
>
> There are some ABI documents that, while they don't generate
> any warnings, they have issues when parsed by get_abi.pl script
> on its output result.
>
> Address them, in order to provide a clean output.
>
> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> #for IIO
> Reviewed-by: Tom Rix <trix@redhat.com> # for fpga-manager
> Reviewed-By: Kajol Jain<kjain@linux.ibm.com> # for sysfs-bus-event_source-devices-hv_gpci and sysfs-bus-event_source-devices-hv_24x7
> Acked-by: Oded Gabbay <oded.gabbay@gmail.com> # for Habanalabs
> Acked-by: Vaibhav Jain <vaibhav@linux.ibm.com> # for sysfs-bus-papr-pmem
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---

>  .../testing/sysfs-bus-coresight-devices-etb10 |   5 +-

For the CoreSight part:

Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>

^ permalink raw reply

* Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends
From: Thomas Gleixner @ 2020-10-30 23:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juri Lelli, linux-xtensa, Peter Zijlstra,
	Sebastian Andrzej Siewior, linux-mips, Ben Segall, Linux-MM,
	Guo Ren, Matthew Wilcox, linux-sparc, Vincent Chen, Ingo Molnar,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, Christoph Hellwig, David Airlie,
	Mel Gorman, open list:SYNOPSYS ARC ARCHITECTURE, Ard Biesheuvel,
	Paul McKenney, linuxppc-dev, Steven Rostedt, Greentime Hu,
	Dietmar Eggemann, Linux ARM, Chris Zankel, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, Max Filippov, Vineet Gupta, LKML,
	Arnd Bergmann, Daniel Vetter, Paul Mackerras, Andrew Morton,
	Daniel Bristot de Oliveira, David S. Miller
In-Reply-To: <87pn4zl2ia.fsf@nanos.tec.linutronix.de>

On Sat, Oct 31 2020 at 00:26, Thomas Gleixner wrote:

> On Fri, Oct 30 2020 at 15:46, Linus Torvalds wrote:
>> On Fri, Oct 30, 2020 at 3:26 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> To me, your patch series has two big advantages:
>>
>>  - more common code
>>
>>  - kmap_local() becomes more of a no-op
>>
>> and the last thing we want is to expand on kmap.
>
> Happy to go with that.
>
> While trying to document the mess, I just stumbled over the abuse of
> kmap_atomic_prot() in
>
> drivers/gpu/drm/ttm/ttm_bo_util.c:      dst = kmap_atomic_prot(d, prot);
> drivers/gpu/drm/ttm/ttm_bo_util.c:      src = kmap_atomic_prot(s, prot);
> drivers/gpu/drm/vmwgfx/vmwgfx_blit.c:   kmap_atomic_prot(d->dst_pages[dst_page],
> drivers/gpu/drm/vmwgfx/vmwgfx_blit.c:   kmap_atomic_prot(d->src_pages[src_page],
>
> For !HIGHMEM kmap_atomic_prot() just ignores the 'prot' argument and
> returns the page address. 
>
> Moar patches to be written ... Sigh!

Or not. This is actually correct by some definition of correct. For
the non highmem case pgprot is set via the set_memory_*() functions and
this just handles the highmem case.

Comments are overrrated...

^ permalink raw reply

* Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends
From: Thomas Gleixner @ 2020-10-30 23:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juri Lelli, linux-xtensa, Peter Zijlstra,
	Sebastian Andrzej Siewior, linux-mips, Ben Segall, Linux-MM,
	Guo Ren, Matthew Wilcox, linux-sparc, Vincent Chen, Ingo Molnar,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, Christoph Hellwig, David Airlie,
	Mel Gorman, open list:SYNOPSYS ARC ARCHITECTURE, Ard Biesheuvel,
	Paul McKenney, linuxppc-dev, Steven Rostedt, Greentime Hu,
	Dietmar Eggemann, Linux ARM, Chris Zankel, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, Max Filippov, Vineet Gupta, LKML,
	Arnd Bergmann, Daniel Vetter, Paul Mackerras, Andrew Morton,
	Daniel Bristot de Oliveira, David S. Miller
In-Reply-To: <CAHk-=wjjO9BtTUAsLraqZqdzaPGJ-qvubZfwUsmRUX896eHcGw@mail.gmail.com>

On Fri, Oct 30 2020 at 15:46, Linus Torvalds wrote:
> On Fri, Oct 30, 2020 at 3:26 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> To me, your patch series has two big advantages:
>
>  - more common code
>
>  - kmap_local() becomes more of a no-op
>
> and the last thing we want is to expand on kmap.

Happy to go with that.

While trying to document the mess, I just stumbled over the abuse of
kmap_atomic_prot() in

drivers/gpu/drm/ttm/ttm_bo_util.c:      dst = kmap_atomic_prot(d, prot);
drivers/gpu/drm/ttm/ttm_bo_util.c:      src = kmap_atomic_prot(s, prot);
drivers/gpu/drm/vmwgfx/vmwgfx_blit.c:   kmap_atomic_prot(d->dst_pages[dst_page],
drivers/gpu/drm/vmwgfx/vmwgfx_blit.c:   kmap_atomic_prot(d->src_pages[src_page],

For !HIGHMEM kmap_atomic_prot() just ignores the 'prot' argument and
returns the page address. 

Moar patches to be written ... Sigh!

Thanks,

        tglx





^ permalink raw reply

* Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends
From: Linus Torvalds @ 2020-10-30 22:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Juri Lelli, linux-xtensa, Peter Zijlstra,
	Sebastian Andrzej Siewior, linux-mips, Ben Segall, Linux-MM,
	Guo Ren, Matthew Wilcox, linux-sparc, Vincent Chen, Ingo Molnar,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, Christoph Hellwig, David Airlie,
	Mel Gorman, open list:SYNOPSYS ARC ARCHITECTURE, Ard Biesheuvel,
	Paul McKenney, linuxppc-dev, Steven Rostedt, Greentime Hu,
	Dietmar Eggemann, Linux ARM, Chris Zankel, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, Max Filippov, Vineet Gupta, LKML,
	Arnd Bergmann, Daniel Vetter, Paul Mackerras, Andrew Morton,
	Daniel Bristot de Oliveira, David S. Miller
In-Reply-To: <87sg9vl59i.fsf@nanos.tec.linutronix.de>

On Fri, Oct 30, 2020 at 3:26 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> While at it I might have a look at that debug request from Willy in the
> other end of this thread. Any comment on that?
>
>  https://lore.kernel.org/r/87k0v7mrrd.fsf@nanos.tec.linutronix.de

I do think that it would be nice to have a debug mode, particularly
since over the last few years we've really lost a lot of HIGHMEM
coverage (to the point that I've wondered how worthwhile it really is
to support at all any more - I think it's Arnd who argued that it's
mainly some embedded ARM variants that will want it for the forseeable
future).

So I'm honestly somewhat torn. I think HIGHMEM is dying, and yes that
argues for "non-HIGHMEM had better have some debug coverage", but at
the same time I think it doesn't even really matter any more. At some
point those embedded ARM platforms just aren't even interesting - they
might as well use older kernels if they are the only thing really
arguing for HIGHMEM at all.

This is one reason why I'd like the new kmap_local() to be a no-op,
and I'd prefer for it to have no other side effects - because I want
to be ready to remove it entirely some day. And if we end up having
some transition where people start rewriting "kmap_atomic()" to be
"kmap_local() + explicit preemption disable", then I think that would
be a good step on that whole "kmap will eventually go away" path.

But I do *not* believe that we need to add _so_ much debug support
that we'd catch Willy's "more than one page" case. And I absolutely do
not believe for a second that we should start caring about compound
pages. NO. kmap() is almost dead already, we're not making it worse.

To me, your patch series has two big advantages:

 - more common code

 - kmap_local() becomes more of a no-op

and the last thing we want is to expand on kmap.

           Linus

^ permalink raw reply

* Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends
From: Thomas Gleixner @ 2020-10-30 22:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juri Lelli, linux-xtensa, Peter Zijlstra,
	Sebastian Andrzej Siewior, linux-mips, Ben Segall, Linux-MM,
	Guo Ren, Matthew Wilcox, linux-sparc, Vincent Chen, Ingo Molnar,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, Christoph Hellwig, David Airlie,
	Mel Gorman, open list:SYNOPSYS ARC ARCHITECTURE, Ard Biesheuvel,
	Paul McKenney, linuxppc-dev, Steven Rostedt, Greentime Hu,
	Dietmar Eggemann, Linux ARM, Chris Zankel, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, Max Filippov, Vineet Gupta, LKML,
	Arnd Bergmann, Daniel Vetter, Paul Mackerras, Andrew Morton,
	Daniel Bristot de Oliveira, David S. Miller
In-Reply-To: <CAHk-=whsJv0bwWRVZHsLoSe48ykAea6T7Oi=G+r8ckLrZ0YUpg@mail.gmail.com>

On Fri, Oct 30 2020 at 13:28, Linus Torvalds wrote:
> On Fri, Oct 30, 2020 at 2:39 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> But then we really should not name it kmap_local. 'local' suggests
>> locality, think local_irq*, local_bh* ... kmap_task would be more
>> accurate then.
>
> So the main reason I'd like to see it is because I think on a
> non-highmem machine, the new kmap should be a complete no-op. IOW,
> we'd make sure that there are no costs, no need to increment any
> "restrict migration" counts etc.

Fair enough.

> It's been a bit of a pain to have kmap_atomic() have magical side
> semantics that people might then depend on.

kmap_atomic() will still have the side semantics :)

> I think "local" could still work as a name, because it would have to
> be thread-local (and maybe we'd want a debug mode where that gets
> verified, as actual HIGHMEM machines are getting rare).
>
> I'd avoid "task", because that implies (to me, at least) that it
> wouldn't be good for interrupts etc that don't have a task context.
>
> I think the main issue is that it has to be released in the same
> context as it was created (ie no passing those things around to other
> contexts). I think "local" is fine for that, but I could imagine other
> names. The ones that come to mind are somewhat cumbersome, though
> ("shortterm" or "local_ctx" or something along those lines).

Yeah, not really intuitive either.

Let's stick with _local and add proper function documentation which
clearly says, that the side effect of non-migratability applies only for
the 32bit highmem case in order to make it work at all.

So code which needs CPU locality cannot rely on it and we have enough
debug stuff to catch something like:

    kmap_local()
    this_cpu_write(....)
    kunmap_local()

Let me redo the pile.

While at it I might have a look at that debug request from Willy in the
other end of this thread. Any comment on that?

 https://lore.kernel.org/r/87k0v7mrrd.fsf@nanos.tec.linutronix.de

Thanks,

        tglx


      

^ permalink raw reply

* [PATCH 05/11 v2] kprobes/ftrace: Add recursion protection to the ftrace callback
From: Steven Rostedt @ 2020-10-30 21:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, James E.J. Bottomley, Guo Ren, linux-csky,
	H. Peter Anvin, Miroslav Benes, Ingo Molnar, linux-s390,
	Helge Deller, x86, Anil S Keshavamurthy, Christian Borntraeger,
	Naveen N. Rao, Petr Mladek, Vasily Gorbik, Heiko Carstens,
	Jiri Kosina, Borislav Petkov, Josh Poimboeuf, Thomas Gleixner,
	linux-parisc, Masami Hiramatsu, Paul Mackerras, Andrew Morton,
	linuxppc-dev, David S. Miller
In-Reply-To: <20201030213142.096102821@goodmis.org>

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

If a ftrace callback does not supply its own recursion protection and
does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
make a helper trampoline to do so before calling the callback instead of
just calling the callback directly.

The default for ftrace_ops is going to change. It will expect that handlers
provide their own recursion protection, unless its ftrace_ops states
otherwise.

Link: https://lkml.kernel.org/r/20201028115613.140212174@goodmis.org

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-csky@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 arch/csky/kernel/probes/ftrace.c     | 12 ++++++++++--
 arch/parisc/kernel/ftrace.c          | 13 +++++++++++--
 arch/powerpc/kernel/kprobes-ftrace.c | 11 ++++++++++-
 arch/s390/kernel/ftrace.c            | 13 +++++++++++--
 arch/x86/kernel/kprobes/ftrace.c     | 12 ++++++++++--
 5 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/arch/csky/kernel/probes/ftrace.c b/arch/csky/kernel/probes/ftrace.c
index 5264763d05be..5eb2604fdf71 100644
--- a/arch/csky/kernel/probes/ftrace.c
+++ b/arch/csky/kernel/probes/ftrace.c
@@ -13,16 +13,21 @@ int arch_check_ftrace_location(struct kprobe *p)
 void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 			   struct ftrace_ops *ops, struct pt_regs *regs)
 {
+	int bit;
 	bool lr_saver = false;
 	struct kprobe *p;
 	struct kprobe_ctlblk *kcb;
 
-	/* Preempt is disabled by ftrace */
+	bit = ftrace_test_recursion_trylock();
+	if (bit < 0)
+		return;
+
+	preempt_disable_notrace();
 	p = get_kprobe((kprobe_opcode_t *)ip);
 	if (!p) {
 		p = get_kprobe((kprobe_opcode_t *)(ip - MCOUNT_INSN_SIZE));
 		if (unlikely(!p) || kprobe_disabled(p))
-			return;
+			goto out;
 		lr_saver = true;
 	}
 
@@ -56,6 +61,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 		 */
 		__this_cpu_write(current_kprobe, NULL);
 	}
+out:
+	preempt_enable_notrace();
+	ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
 
diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
index 63e3ecb9da81..4b1fdf15662c 100644
--- a/arch/parisc/kernel/ftrace.c
+++ b/arch/parisc/kernel/ftrace.c
@@ -208,13 +208,19 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 {
 	struct kprobe_ctlblk *kcb;
 	struct kprobe *p = get_kprobe((kprobe_opcode_t *)ip);
+	int bit;
 
-	if (unlikely(!p) || kprobe_disabled(p))
+	bit = ftrace_test_recursion_trylock();
+	if (bit < 0)
 		return;
 
+	preempt_disable_notrace();
+	if (unlikely(!p) || kprobe_disabled(p))
+		goto out;
+
 	if (kprobe_running()) {
 		kprobes_inc_nmissed_count(p);
-		return;
+		goto out;
 	}
 
 	__this_cpu_write(current_kprobe, p);
@@ -235,6 +241,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 		}
 	}
 	__this_cpu_write(current_kprobe, NULL);
+out:
+	preempt_enable_notrace();
+	ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
 
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c b/arch/powerpc/kernel/kprobes-ftrace.c
index 972cb28174b2..5df8d50c65ae 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -18,10 +18,16 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
 {
 	struct kprobe *p;
 	struct kprobe_ctlblk *kcb;
+	int bit;
 
+	bit = ftrace_test_recursion_trylock();
+	if (bit < 0)
+		return;
+
+	preempt_disable_notrace();
 	p = get_kprobe((kprobe_opcode_t *)nip);
 	if (unlikely(!p) || kprobe_disabled(p))
-		return;
+		goto out;
 
 	kcb = get_kprobe_ctlblk();
 	if (kprobe_running()) {
@@ -52,6 +58,9 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
 		 */
 		__this_cpu_write(current_kprobe, NULL);
 	}
+out:
+	preempt_enable_notrace();
+	ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
 
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index b388e87a08bf..88466d7fb6b2 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -202,13 +202,19 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 {
 	struct kprobe_ctlblk *kcb;
 	struct kprobe *p = get_kprobe((kprobe_opcode_t *)ip);
+	int bit;
 
-	if (unlikely(!p) || kprobe_disabled(p))
+	bit = ftrace_test_recursion_trylock();
+	if (bit < 0)
 		return;
 
+	preempt_disable_notrace();
+	if (unlikely(!p) || kprobe_disabled(p))
+		goto out;
+
 	if (kprobe_running()) {
 		kprobes_inc_nmissed_count(p);
-		return;
+		goto out;
 	}
 
 	__this_cpu_write(current_kprobe, p);
@@ -228,6 +234,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 		}
 	}
 	__this_cpu_write(current_kprobe, NULL);
+out:
+	preempt_enable_notrace();
+	ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
 
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 681a4b36e9bb..a40a6cdfcca3 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -18,11 +18,16 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 {
 	struct kprobe *p;
 	struct kprobe_ctlblk *kcb;
+	int bit;
 
-	/* Preempt is disabled by ftrace */
+	bit = ftrace_test_recursion_trylock();
+	if (bit < 0)
+		return;
+
+	preempt_disable_notrace();
 	p = get_kprobe((kprobe_opcode_t *)ip);
 	if (unlikely(!p) || kprobe_disabled(p))
-		return;
+		goto out;
 
 	kcb = get_kprobe_ctlblk();
 	if (kprobe_running()) {
@@ -52,6 +57,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 		 */
 		__this_cpu_write(current_kprobe, NULL);
 	}
+out:
+	preempt_enable_notrace();
+	ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
 
-- 
2.28.0



^ permalink raw reply related

* [PATCH 11/11 v2] ftrace: Add recording of functions that caused recursion
From: Steven Rostedt @ 2020-10-30 21:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Anton Vorontsov, linux-doc, Peter Zijlstra,
	Sebastian Andrzej Siewior, Kamalesh Babulal, James E.J. Bottomley,
	Guo Ren, H. Peter Anvin, live-patching, Miroslav Benes,
	Ingo Molnar, linux-s390, Joe Lawrence, Jonathan Corbet,
	Mauro Carvalho Chehab, Helge Deller, x86, linux-csky,
	Christian Borntraeger, Petr Mladek, Kees Cook, Vasily Gorbik,
	Heiko Carstens, Jiri Kosina, Borislav Petkov, Josh Poimboeuf,
	Thomas Gleixner, Tony Luck, linux-parisc, Masami Hiramatsu,
	Colin Cross, Paul Mackerras, Andrew Morton, linuxppc-dev
In-Reply-To: <20201030213142.096102821@goodmis.org>

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

This adds CONFIG_FTRACE_RECORD_RECURSION that will record to a file
"recursed_functions" all the functions that caused recursion while a
callback to the function tracer was running.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-csky@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: live-patching@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 Documentation/trace/ftrace-uses.rst   |   6 +-
 arch/csky/kernel/probes/ftrace.c      |   2 +-
 arch/parisc/kernel/ftrace.c           |   2 +-
 arch/powerpc/kernel/kprobes-ftrace.c  |   2 +-
 arch/s390/kernel/ftrace.c             |   2 +-
 arch/x86/kernel/kprobes/ftrace.c      |   2 +-
 fs/pstore/ftrace.c                    |   2 +-
 include/linux/trace_recursion.h       |  32 +++-
 kernel/livepatch/patch.c              |   2 +-
 kernel/trace/Kconfig                  |  25 +++
 kernel/trace/Makefile                 |   1 +
 kernel/trace/ftrace.c                 |   4 +-
 kernel/trace/trace_event_perf.c       |   2 +-
 kernel/trace/trace_functions.c        |   2 +-
 kernel/trace/trace_output.c           |   6 +-
 kernel/trace/trace_output.h           |   1 +
 kernel/trace/trace_recursion_record.c | 220 ++++++++++++++++++++++++++
 17 files changed, 293 insertions(+), 20 deletions(-)
 create mode 100644 kernel/trace/trace_recursion_record.c

diff --git a/Documentation/trace/ftrace-uses.rst b/Documentation/trace/ftrace-uses.rst
index 86cd14b8e126..5981d5691745 100644
--- a/Documentation/trace/ftrace-uses.rst
+++ b/Documentation/trace/ftrace-uses.rst
@@ -118,7 +118,7 @@ can help in this regard. If you start your code with:
 
 	int bit;
 
-	bit = ftrace_test_recursion_trylock();
+	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (bit < 0)
 		return;
 
@@ -130,7 +130,9 @@ The code in between will be safe to use, even if it ends up calling a
 function that the callback is tracing. Note, on success,
 ftrace_test_recursion_trylock() will disable preemption, and the
 ftrace_test_recursion_unlock() will enable it again (if it was previously
-enabled).
+enabled). The instruction pointer (ip) and its parent (parent_ip) is passed to
+ftrace_test_recursion_trylock() to record where the recursion happened
+(if CONFIG_FTRACE_RECORD_RECURSION is set).
 
 Alternatively, if the FTRACE_OPS_FL_RECURSION flag is set on the ftrace_ops
 (as explained below), then a helper trampoline will be used to test
diff --git a/arch/csky/kernel/probes/ftrace.c b/arch/csky/kernel/probes/ftrace.c
index 5eb2604fdf71..f30b179924ef 100644
--- a/arch/csky/kernel/probes/ftrace.c
+++ b/arch/csky/kernel/probes/ftrace.c
@@ -18,7 +18,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 	struct kprobe *p;
 	struct kprobe_ctlblk *kcb;
 
-	bit = ftrace_test_recursion_trylock();
+	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (bit < 0)
 		return;
 
diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
index 4b1fdf15662c..8b0ed7c5a4ab 100644
--- a/arch/parisc/kernel/ftrace.c
+++ b/arch/parisc/kernel/ftrace.c
@@ -210,7 +210,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 	struct kprobe *p = get_kprobe((kprobe_opcode_t *)ip);
 	int bit;
 
-	bit = ftrace_test_recursion_trylock();
+	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (bit < 0)
 		return;
 
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c b/arch/powerpc/kernel/kprobes-ftrace.c
index 5df8d50c65ae..fdfee39938ea 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -20,7 +20,7 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
 	struct kprobe_ctlblk *kcb;
 	int bit;
 
-	bit = ftrace_test_recursion_trylock();
+	bit = ftrace_test_recursion_trylock(nip, parent_nip);
 	if (bit < 0)
 		return;
 
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index 88466d7fb6b2..a1556333d481 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -204,7 +204,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 	struct kprobe *p = get_kprobe((kprobe_opcode_t *)ip);
 	int bit;
 
-	bit = ftrace_test_recursion_trylock();
+	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (bit < 0)
 		return;
 
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index a40a6cdfcca3..954d930a7127 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -20,7 +20,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 	struct kprobe_ctlblk *kcb;
 	int bit;
 
-	bit = ftrace_test_recursion_trylock();
+	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (bit < 0)
 		return;
 
diff --git a/fs/pstore/ftrace.c b/fs/pstore/ftrace.c
index 816210fc5d3a..adb0935eb062 100644
--- a/fs/pstore/ftrace.c
+++ b/fs/pstore/ftrace.c
@@ -41,7 +41,7 @@ static void notrace pstore_ftrace_call(unsigned long ip,
 	if (unlikely(oops_in_progress))
 		return;
 
-	bit = ftrace_test_recursion_trylock();
+	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (bit < 0)
 		return;
 
diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index ac3d73484cb2..1cba5fe8777a 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -142,7 +142,28 @@ static __always_inline int trace_get_context_bit(void)
 			pc & HARDIRQ_MASK ? TRACE_CTX_IRQ : TRACE_CTX_SOFTIRQ;
 }
 
-static __always_inline int trace_test_and_set_recursion(int start, int max)
+#ifdef CONFIG_FTRACE_RECORD_RECURSION
+extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip);
+/*
+* The paranoid_test check can cause dropped reports (unlikely), but
+* if the recursion is common, it will likely still be recorded later.
+* But the paranoid_test is needed to make sure we don't crash.
+*/
+# define do_ftrace_record_recursion(ip, pip)				\
+	do {								\
+		static atomic_t paranoid_test;				\
+		if (!atomic_read(&paranoid_test)) {			\
+			atomic_inc(&paranoid_test);			\
+			ftrace_record_recursion(ip, pip);		\
+			atomic_dec(&paranoid_test);			\
+		}							\
+	} while (0)
+#else
+# define do_ftrace_record_recursion(ip, pip)	do { } while (0)
+#endif
+
+static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
+							int start, int max)
 {
 	unsigned int val = current->trace_recursion;
 	int bit;
@@ -158,8 +179,10 @@ static __always_inline int trace_test_and_set_recursion(int start, int max)
 		 * a switch between contexts. Allow for a single recursion.
 		 */
 		bit = TRACE_TRANSITION_BIT;
-		if (trace_recursion_test(bit))
+		if (trace_recursion_test(bit)) {
+			do_ftrace_record_recursion(ip, pip);
 			return -1;
+		}
 		trace_recursion_set(bit);
 		barrier();
 		return bit + 1;
@@ -199,9 +222,10 @@ static __always_inline void trace_clear_recursion(int bit)
  * Returns: -1 if a recursion happened.
  *           >= 0 if no recursion
  */
-static __always_inline int ftrace_test_recursion_trylock(void)
+static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
+							 unsigned long parent_ip)
 {
-	return trace_test_and_set_recursion(TRACE_FTRACE_START, TRACE_FTRACE_MAX);
+	return trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START, TRACE_FTRACE_MAX);
 }
 
 /**
diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index 15480bf3ce88..875c5dbbdd33 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -49,7 +49,7 @@ static void notrace klp_ftrace_handler(unsigned long ip,
 
 	ops = container_of(fops, struct klp_ops, fops);
 
-	bit = ftrace_test_recursion_trylock();
+	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (WARN_ON_ONCE(bit < 0))
 		return;
 	/*
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index a4020c0b4508..9b11c096d139 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -727,6 +727,31 @@ config TRACE_EVAL_MAP_FILE
 
 	If unsure, say N.
 
+config FTRACE_RECORD_RECURSION
+	bool "Record functions that recurse in function tracing"
+	depends on FUNCTION_TRACER
+	help
+	  All callbacks that attach to the function tracing have some sort
+	  of protection against recursion. Even though the protection exists,
+	  it adds overhead. This option will create a file in the tracefs
+	  file system called "recursed_functions" that will list the functions
+	  that triggered a recursion.
+
+	  This will add more overhead to cases that have recursion.
+
+	  If unsure, say N
+
+config FTRACE_RECORD_RECURSION_SIZE
+	int "Max number of recursed functions to record"
+	default	128
+	depends on FTRACE_RECORD_RECURSION
+	help
+	  This defines the limit of number of functions that can be
+	  listed in the "recursed_functions" file, that lists all
+	  the functions that caused a recursion to happen.
+	  This file can be reset, but the limit can not change in
+	  size at runtime.
+
 config GCOV_PROFILE_FTRACE
 	bool "Enable GCOV profiling on ftrace subsystem"
 	depends on GCOV_KERNEL
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index e153be351548..7e44cea89fdc 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_DYNAMIC_EVENTS) += trace_dynevent.o
 obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
 obj-$(CONFIG_UPROBE_EVENTS) += trace_uprobe.o
 obj-$(CONFIG_BOOTTIME_TRACING) += trace_boot.o
+obj-$(CONFIG_FTRACE_RECORD_RECURSION) += trace_recursion_record.o
 
 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
 
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 39f2bba89b76..03aad2b5cd5e 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -6918,7 +6918,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
 	struct ftrace_ops *op;
 	int bit;
 
-	bit = trace_test_and_set_recursion(TRACE_LIST_START, TRACE_LIST_MAX);
+	bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START, TRACE_LIST_MAX);
 	if (bit < 0)
 		return;
 
@@ -6993,7 +6993,7 @@ static void ftrace_ops_assist_func(unsigned long ip, unsigned long parent_ip,
 {
 	int bit;
 
-	bit = trace_test_and_set_recursion(TRACE_LIST_START, TRACE_LIST_MAX);
+	bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START, TRACE_LIST_MAX);
 	if (bit < 0)
 		return;
 
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index a2b9fddb8148..1b202e28dfaa 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -447,7 +447,7 @@ perf_ftrace_function_call(unsigned long ip, unsigned long parent_ip,
 	if ((unsigned long)ops->private != smp_processor_id())
 		return;
 
-	bit = ftrace_test_recursion_trylock();
+	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (bit < 0)
 		return;
 
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 89c414ce1388..646eda6c44a5 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -141,7 +141,7 @@ function_trace_call(unsigned long ip, unsigned long parent_ip,
 	if (unlikely(!tr->function_enabled))
 		return;
 
-	bit = ftrace_test_recursion_trylock();
+	bit = ftrace_test_recursion_trylock(ip, parent_ip);
 	if (bit < 0)
 		return;
 
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 000e9dc224c6..92b1575ae0ca 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -353,8 +353,8 @@ static inline const char *kretprobed(const char *name)
 }
 #endif /* CONFIG_KRETPROBES */
 
-static void
-seq_print_sym(struct trace_seq *s, unsigned long address, bool offset)
+void
+trace_seq_print_sym(struct trace_seq *s, unsigned long address, bool offset)
 {
 #ifdef CONFIG_KALLSYMS
 	char str[KSYM_SYMBOL_LEN];
@@ -420,7 +420,7 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
 		goto out;
 	}
 
-	seq_print_sym(s, ip, sym_flags & TRACE_ITER_SYM_OFFSET);
+	trace_seq_print_sym(s, ip, sym_flags & TRACE_ITER_SYM_OFFSET);
 
 	if (sym_flags & TRACE_ITER_SYM_ADDR)
 		trace_seq_printf(s, " <" IP_FMT ">", ip);
diff --git a/kernel/trace/trace_output.h b/kernel/trace/trace_output.h
index 2f742b74e7e6..4c954636caf0 100644
--- a/kernel/trace/trace_output.h
+++ b/kernel/trace/trace_output.h
@@ -16,6 +16,7 @@ extern int
 seq_print_ip_sym(struct trace_seq *s, unsigned long ip,
 		unsigned long sym_flags);
 
+extern void trace_seq_print_sym(struct trace_seq *s, unsigned long address, bool offset);
 extern int trace_print_context(struct trace_iterator *iter);
 extern int trace_print_lat_context(struct trace_iterator *iter);
 
diff --git a/kernel/trace/trace_recursion_record.c b/kernel/trace/trace_recursion_record.c
new file mode 100644
index 000000000000..0523071ca97c
--- /dev/null
+++ b/kernel/trace/trace_recursion_record.c
@@ -0,0 +1,220 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/module.h>
+#include <linux/ftrace.h>
+#include <linux/fs.h>
+
+#include "trace_output.h"
+
+struct recursed_functions {
+	unsigned long		ip;
+	unsigned long		parent_ip;
+};
+
+static struct recursed_functions recursed_functions[CONFIG_FTRACE_RECORD_RECURSION_SIZE];
+static atomic_t nr_records;
+
+/*
+ * Cache the last found function. Yes, updates to this is racey, but
+ * so is memory cache ;-)
+ */
+static unsigned long cached_function;
+
+void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip)
+{
+	int index;
+	int i = 0;
+	unsigned long old;
+
+ again:
+	/* First check the last one recorded */
+	if (ip == cached_function)
+		return;
+
+	index = atomic_read(&nr_records);
+	/* nr_records is -1 when clearing records */
+	smp_mb__after_atomic();
+	if (index < 0)
+		return;
+
+	/* See below */
+	if (i > index)
+		index = i;
+	if (index >= CONFIG_FTRACE_RECORD_RECURSION_SIZE)
+		return;
+
+	for (i = index - 1; i >= 0; i--) {
+		if (recursed_functions[i].ip == ip) {
+			cached_function = ip;
+			return;
+		}
+	}
+
+	cached_function = ip;
+
+	/*
+	 * We only want to add a function if it hasn't been added before.
+	 * Add to the current location before incrementing the count.
+	 * If it fails to add, then increment the index (save in i)
+	 * and try again.
+	 */
+	old = cmpxchg(&recursed_functions[index].ip, 0, ip);
+	if (old != 0) {
+		/* Did something else already added this for us? */
+		if (old == ip)
+			return;
+		/* Try the next location (use i for the next index) */
+		i = index + 1;
+		goto again;
+	}
+
+	recursed_functions[index].parent_ip = parent_ip;
+
+	/*
+	 * It's still possible that we could race with the clearing
+	 *    CPU0                                    CPU1
+	 *    ----                                    ----
+	 *                                       ip = func
+	 *  nr_records = -1;
+	 *  recursed_functions[0] = 0;
+	 *                                       i = -1
+	 *                                       if (i < 0)
+	 *  nr_records = 0;
+	 *  (new recursion detected)
+	 *      recursed_functions[0] = func
+	 *                                            cmpxchg(recursed_functions[0],
+	 *                                                    func, 0)
+	 *
+	 * But the worse that could happen is that we get a zero in
+	 * the recursed_functions array, and it's likely that "func" will
+	 * be recorded again.
+	 */
+	i = atomic_read(&nr_records);
+	smp_mb__after_atomic();
+	if (i < 0)
+		cmpxchg(&recursed_functions[index].ip, ip, 0);
+	else if (i <= index)
+		atomic_cmpxchg(&nr_records, i, index + 1);
+}
+
+static DEFINE_MUTEX(recursed_function_lock);
+static struct trace_seq *tseq;
+
+static void *recursed_function_seq_start(struct seq_file *m, loff_t *pos)
+{
+	void *ret = NULL;
+	int index;
+
+	mutex_lock(&recursed_function_lock);
+	index = atomic_read(&nr_records);
+	if (*pos < index) {
+		ret = &recursed_functions[*pos];
+	}
+
+	tseq = kzalloc(sizeof(*tseq), GFP_KERNEL);
+	if (!tseq)
+		return ERR_PTR(-ENOMEM);
+
+	trace_seq_init(tseq);
+
+	return ret;
+}
+
+static void *recursed_function_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	int index;
+	int p;
+
+	index = atomic_read(&nr_records);
+	p = ++(*pos);
+
+	return p < index ? &recursed_functions[p] : NULL;
+}
+
+static void recursed_function_seq_stop(struct seq_file *m, void *v)
+{
+	kfree(tseq);
+	mutex_unlock(&recursed_function_lock);
+}
+
+static int recursed_function_seq_show(struct seq_file *m, void *v)
+{
+	struct recursed_functions *record = v;
+	int ret = 0;
+
+	if (record) {
+		trace_seq_print_sym(tseq, record->parent_ip, true);
+		trace_seq_puts(tseq, ":\t");
+		trace_seq_print_sym(tseq, record->ip, true);
+		trace_seq_putc(tseq, '\n');
+		ret = trace_print_seq(m, tseq);
+	}
+
+	return ret;
+}
+
+static const struct seq_operations recursed_function_seq_ops = {
+	.start  = recursed_function_seq_start,
+	.next   = recursed_function_seq_next,
+	.stop   = recursed_function_seq_stop,
+	.show   = recursed_function_seq_show
+};
+
+static int recursed_function_open(struct inode *inode, struct file *file)
+{
+	int ret = 0;
+
+	mutex_lock(&recursed_function_lock);
+	/* If this file was opened for write, then erase contents */
+	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
+		/* disable updating records */
+		atomic_set(&nr_records, -1);
+		smp_mb__after_atomic();
+		memset(recursed_functions, 0, sizeof(recursed_functions));
+		smp_wmb();
+		/* enable them again */
+		atomic_set(&nr_records, 0);
+	}
+	if (file->f_mode & FMODE_READ)
+		ret = seq_open(file, &recursed_function_seq_ops);
+	mutex_unlock(&recursed_function_lock);
+
+	return ret;
+}
+
+static ssize_t recursed_function_write(struct file *file,
+				       const char __user *buffer,
+				       size_t count, loff_t *ppos)
+{
+	return count;
+}
+
+static int recursed_function_release(struct inode *inode, struct file *file)
+{
+	if (file->f_mode & FMODE_READ)
+		seq_release(inode, file);
+	return 0;
+}
+
+static const struct file_operations recursed_functions_fops = {
+	.open           = recursed_function_open,
+	.write		= recursed_function_write,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = recursed_function_release,
+};
+
+__init static int create_recursed_functions(void)
+{
+	struct dentry *dentry;
+
+	dentry = trace_create_file("recursed_functions", 0644, NULL, NULL,
+				   &recursed_functions_fops);
+	if (!dentry)
+		pr_warn("WARNING: Failed to create recursed_functions\n");
+	return 0;
+}
+
+fs_initcall(create_recursed_functions);
-- 
2.28.0



^ permalink raw reply related

* Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends
From: Linus Torvalds @ 2020-10-30 20:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Juri Lelli, linux-xtensa, Peter Zijlstra,
	Sebastian Andrzej Siewior, linux-mips, Ben Segall, Linux-MM,
	Guo Ren, linux-sparc, Vincent Chen, Ingo Molnar, linux-arch,
	Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, Christoph Hellwig, David Airlie,
	Mel Gorman, open list:SYNOPSYS ARC ARCHITECTURE, Ard Biesheuvel,
	Paul McKenney, linuxppc-dev, Steven Rostedt, Greentime Hu,
	Dietmar Eggemann, Linux ARM, Chris Zankel, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, Max Filippov, Vineet Gupta, LKML,
	Arnd Bergmann, Daniel Vetter, Paul Mackerras, Andrew Morton,
	Daniel Bristot de Oliveira, David S. Miller
In-Reply-To: <87blgknjcw.fsf@nanos.tec.linutronix.de>

On Fri, Oct 30, 2020 at 2:39 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> But then we really should not name it kmap_local. 'local' suggests
> locality, think local_irq*, local_bh* ... kmap_task would be more
> accurate then.

So the main reason I'd like to see it is because I think on a
non-highmem machine, the new kmap should be a complete no-op. IOW,
we'd make sure that there are no costs, no need to increment any
"restrict migration" counts etc.

It's been a bit of a pain to have kmap_atomic() have magical side
semantics that people might then depend on.

I think "local" could still work as a name, because it would have to
be thread-local (and maybe we'd want a debug mode where that gets
verified, as actual HIGHMEM machines are getting rare).

I'd avoid "task", because that implies (to me, at least) that it
wouldn't be good for interrupts etc that don't have a task context.

I think the main issue is that it has to be released in the same
context as it was created (ie no passing those things around to other
contexts). I think "local" is fine for that, but I could imagine other
names. The ones that come to mind are somewhat cumbersome, though
("shortterm" or "local_ctx" or something along those lines).

I dunno.

              Linus

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox