[PATCH v5 0/7] RISCV device tree mapping

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/7]  RISCV device tree mapping
@ 2024-08-21 16:06 Oleksii Kurochko
  2024-08-21 16:06 ` [PATCH v5 1/7] xen/riscv: use {read,write}{b,w,l,q}_cpu() to define {read,write}_atomic() Oleksii Kurochko
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Oleksii Kurochko @ 2024-08-21 16:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Current patch series introduces device tree mapping for RISC-V
and necessary things for that such as:
- Fixmap mapping
- pmap
- Xen page table processing

---
Changes in v5:
 - The following patch was merged to staging:
     [PATCH v3 3/9] xen/riscv: enable CONFIG_HAS_DEVICE_TREE
 - Drop depedency from "RISCV basic exception handling implementation" as
   it was meged to staging branch.
 - All other changes are patch specific so please look at the patch.
---
Changes in v4:
 - Drop depedency from common devicre tree patch series as it was merged to
   staging.
 - Update the cover letter message.
 - All other changes are patch specific so please look at the patch.
---
Changes in v3:
 - Introduce SBI RFENCE extension support.
 - Introduce and initialize pcpu_info[] and __cpuid_to_hartid_map[] and functionality
   to work with this arrays.
 - Make page table handling arch specific instead of trying to make it generic.
 - All other changes are patch specific so please look at the patch.
---
Changes in v2:
 - Update the cover letter message
 - introduce fixmap mapping
 - introduce pmap
 - introduce CONFIG_GENREIC_PT
 - update use early_fdt_map() after MMU is enabled.
---

Oleksii Kurochko (7):
  xen/riscv: use {read,write}{b,w,l,q}_cpu() to define
    {read,write}_atomic()
  xen/riscv: set up fixmap mappings
  xen/riscv: introduce asm/pmap.h header
  xen/riscv: introduce functionality to work with CPU info
  xen/riscv: introduce and initialize SBI RFENCE extension
  xen/riscv: page table handling
  xen/riscv: introduce early_fdt_map()

 xen/arch/riscv/Kconfig                      |   1 +
 xen/arch/riscv/Makefile                     |   3 +
 xen/arch/riscv/include/asm/atomic.h         |  25 +-
 xen/arch/riscv/include/asm/config.h         |  15 +-
 xen/arch/riscv/include/asm/fixmap.h         |  46 +++
 xen/arch/riscv/include/asm/flushtlb.h       |  18 +
 xen/arch/riscv/include/asm/mm.h             |   6 +
 xen/arch/riscv/include/asm/page.h           |  76 ++++
 xen/arch/riscv/include/asm/pmap.h           |  36 ++
 xen/arch/riscv/include/asm/processor.h      |  29 +-
 xen/arch/riscv/include/asm/riscv_encoding.h |   1 +
 xen/arch/riscv/include/asm/sbi.h            |  63 +++
 xen/arch/riscv/include/asm/smp.h            |  11 +
 xen/arch/riscv/mm.c                         | 101 ++++-
 xen/arch/riscv/pt.c                         | 420 ++++++++++++++++++++
 xen/arch/riscv/riscv64/head.S               |   4 +
 xen/arch/riscv/sbi.c                        | 273 ++++++++++++-
 xen/arch/riscv/setup.c                      |  17 +
 xen/arch/riscv/smp.c                        |  21 +
 xen/arch/riscv/smpboot.c                    |   8 +
 xen/arch/riscv/xen.lds.S                    |   2 +-
 21 files changed, 1150 insertions(+), 26 deletions(-)
 create mode 100644 xen/arch/riscv/include/asm/fixmap.h
 create mode 100644 xen/arch/riscv/include/asm/pmap.h
 create mode 100644 xen/arch/riscv/pt.c
 create mode 100644 xen/arch/riscv/smp.c
 create mode 100644 xen/arch/riscv/smpboot.c

-- 
2.46.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 1/7] xen/riscv: use {read,write}{b,w,l,q}_cpu() to define {read,write}_atomic()
  2024-08-21 16:06 [PATCH v5 0/7] RISCV device tree mapping Oleksii Kurochko
@ 2024-08-21 16:06 ` Oleksii Kurochko
  2024-08-27 10:06   ` Jan Beulich
  2024-08-21 16:06 ` [PATCH v5 2/7] xen/riscv: set up fixmap mappings Oleksii Kurochko
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Oleksii Kurochko @ 2024-08-21 16:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

In Xen, memory-ordered atomic operations are not necessary,
based on {read,write}_atomic() implementations for other architectures.
Therefore, {read,write}{b,w,l,q}_cpu() can be used instead of
{read,write}{b,w,l,q}(), allowing the caller to decide if additional
fences should be applied before or after {read,write}_atomic().

Change the declaration of _write_atomic() to accept a 'volatile void *'
type for the 'x' argument instead of 'unsigned long'.
This prevents compilation errors such as:
1."discards 'volatile' qualifier from pointer target type," which occurs
  due to the initialization of a volatile pointer,
  e.g., `volatile uint8_t *ptr = p;` in _add_sized().
2."incompatible type for argument 2 of '_write_atomic'," which can occur
  when calling write_pte(), where 'x' is of type pte_t rather than
  unsigned long.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v5:
 - new patch.
---
 xen/arch/riscv/include/asm/atomic.h | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/xen/arch/riscv/include/asm/atomic.h b/xen/arch/riscv/include/asm/atomic.h
index 31b91a79c8..446c8c7928 100644
--- a/xen/arch/riscv/include/asm/atomic.h
+++ b/xen/arch/riscv/include/asm/atomic.h
@@ -31,21 +31,17 @@
 
 void __bad_atomic_size(void);
 
-/*
- * Legacy from Linux kernel. For some reason they wanted to have ordered
- * read/write access. Thereby read* is used instead of read*_cpu()
- */
 static always_inline void read_atomic_size(const volatile void *p,
                                            void *res,
                                            unsigned int size)
 {
     switch ( size )
     {
-    case 1: *(uint8_t *)res = readb(p); break;
-    case 2: *(uint16_t *)res = readw(p); break;
-    case 4: *(uint32_t *)res = readl(p); break;
+    case 1: *(uint8_t *)res = readb_cpu(p); break;
+    case 2: *(uint16_t *)res = readw_cpu(p); break;
+    case 4: *(uint32_t *)res = readl_cpu(p); break;
 #ifndef CONFIG_RISCV_32
-    case 8: *(uint32_t *)res = readq(p); break;
+    case 8: *(uint32_t *)res = readq_cpu(p); break;
 #endif
     default: __bad_atomic_size(); break;
     }
@@ -58,15 +54,16 @@ static always_inline void read_atomic_size(const volatile void *p,
 })
 
 static always_inline void _write_atomic(volatile void *p,
-                                       unsigned long x, unsigned int size)
+                                        volatile void *x,
+                                        unsigned int size)
 {
     switch ( size )
     {
-    case 1: writeb(x, p); break;
-    case 2: writew(x, p); break;
-    case 4: writel(x, p); break;
+    case 1: writeb_cpu(*(uint8_t *)x, p); break;
+    case 2: writew_cpu(*(uint16_t *)x, p); break;
+    case 4: writel_cpu(*(uint32_t *)x, p); break;
 #ifndef CONFIG_RISCV_32
-    case 8: writeq(x, p); break;
+    case 8: writeq_cpu(*(uint64_t *)x, p); break;
 #endif
     default: __bad_atomic_size(); break;
     }
@@ -75,7 +72,7 @@ static always_inline void _write_atomic(volatile void *p,
 #define write_atomic(p, x)                              \
 ({                                                      \
     typeof(*(p)) x_ = (x);                              \
-    _write_atomic(p, x_, sizeof(*(p)));                 \
+    _write_atomic(p, &x_, sizeof(*(p)));                \
 })
 
 static always_inline void _add_sized(volatile void *p,
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 2/7] xen/riscv: set up fixmap mappings
  2024-08-21 16:06 [PATCH v5 0/7] RISCV device tree mapping Oleksii Kurochko
  2024-08-21 16:06 ` [PATCH v5 1/7] xen/riscv: use {read,write}{b,w,l,q}_cpu() to define {read,write}_atomic() Oleksii Kurochko
@ 2024-08-21 16:06 ` Oleksii Kurochko
  2024-08-27 10:29   ` Jan Beulich
  2024-08-21 16:06 ` [PATCH v5 3/7] xen/riscv: introduce asm/pmap.h header Oleksii Kurochko
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Oleksii Kurochko @ 2024-08-21 16:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Set up fixmap mappings and the L0 page table for fixmap support.

Define new macros in riscv/config.h for calculating
the FIXMAP_BASE address, including BOOT_FDT_VIRT_{START, SIZE},
XEN_VIRT_SIZE, and XEN_VIRT_END.

Update the check for Xen size in riscv/lds.S to use
XEN_VIRT_SIZE instead of a hardcoded constant.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
 - move definition of FIXMAP_ADDR() to asm/fixmap.h
 - add gap size equal to 2 MB ( 512 * 4K one page table entry in L1 page table )
   between Xen, FDT and Fixmap.
 - drop the comment for FIX_LAST.
 - move +1 from FIX_LAST definition to FIXADDR_TOP to be aligned with Arm.
   ( probably everything below FIX_LAST will be moved to a separate header in asm/generic.h )
 - correct the "changes in V4: s/'fence r,r'/'fence rw, rw'
 - use write_atomic() in set_pte().
 - introduce read_pte().
---
Changes in V4:
 - move definitions of XEN_VIRT_SIZE, BOOT_FDT_VIRT_{START,SIZE}, FIXMAP_{BASE,ADDR}
   below XEN_VIRT_START to have definitions appear in order.
 - define FIX_LAST as (FIX_MISC + 1) to have a guard slot at the end.
 - s/enumerated/numbered in the comment
 - update the cycle which looks for L1 page table in setup_fixmap_mapping_function() and
   the comment above him.
 - drop fences inside write_pte() and put 'fence rw,rw' in setup_fixmap() before sfence_vma().
 - update the commit message
 - drop printk message inside setup_fixmap().
---
Changes in V3:
 - s/XEN_SIZE/XEN_VIRT_SIZE
 - drop usage of XEN_VIRT_END.
 - sort newly introduced defines in config.h by address
 - code style fixes
 - drop runtime check of that pte is valid as it was checked in L1 page table finding cycle by BUG_ON().
 - update implementation of write_pte() with FENCE rw, rw.
 - add BUILD_BUG_ON() to check that amount of entries aren't bigger then entries in page table.
 - drop set_fixmap, clear_fixmap declarations as they aren't used and defined now
 - update the commit message.
 - s/__ASM_FIXMAP_H/ASM_FIXMAP_H
 - add SPDX-License-Identifier: GPL-2.0 
---
 xen/arch/riscv/include/asm/config.h | 15 ++++++++--
 xen/arch/riscv/include/asm/fixmap.h | 46 +++++++++++++++++++++++++++++
 xen/arch/riscv/include/asm/mm.h     |  2 ++
 xen/arch/riscv/include/asm/page.h   | 13 ++++++++
 xen/arch/riscv/mm.c                 | 43 +++++++++++++++++++++++++++
 xen/arch/riscv/setup.c              |  2 ++
 xen/arch/riscv/xen.lds.S            |  2 +-
 7 files changed, 120 insertions(+), 3 deletions(-)
 create mode 100644 xen/arch/riscv/include/asm/fixmap.h

diff --git a/xen/arch/riscv/include/asm/config.h b/xen/arch/riscv/include/asm/config.h
index 50583aafdc..f55d6c45da 100644
--- a/xen/arch/riscv/include/asm/config.h
+++ b/xen/arch/riscv/include/asm/config.h
@@ -41,8 +41,10 @@
  * Start addr          | End addr         | Slot       | area description
  * ============================================================================
  *                   .....                 L2 511          Unused
- *  0xffffffffc0600000  0xffffffffc0800000 L2 511          Fixmap
- *  0xffffffffc0200000  0xffffffffc0600000 L2 511          FDT
+ *  0xffffffffc0A00000  0xffffffffc0C00000 L2 511          Fixmap
+ *                   ..... ( 2 MB gap )
+ *  0xffffffffc0400000  0xffffffffc0800000 L2 511          FDT
+ *                   ..... ( 2 MB gap )
  *  0xffffffffc0000000  0xffffffffc0200000 L2 511          Xen
  *                   .....                 L2 510          Unused
  *  0x3200000000        0x7f40000000       L2 200-509      Direct map
@@ -74,6 +76,15 @@
 #error "unsupported RV_STAGE1_MODE"
 #endif
 
+#define GAP_SIZE                MB(2)
+
+#define XEN_VIRT_SIZE           MB(2)
+
+#define BOOT_FDT_VIRT_START     (XEN_VIRT_START + XEN_VIRT_SIZE + GAP_SIZE)
+#define BOOT_FDT_VIRT_SIZE      MB(4)
+
+#define FIXMAP_BASE             (BOOT_FDT_VIRT_START + BOOT_FDT_VIRT_SIZE + GAP_SIZE)
+
 #define DIRECTMAP_SLOT_END      509
 #define DIRECTMAP_SLOT_START    200
 #define DIRECTMAP_VIRT_START    SLOTN(DIRECTMAP_SLOT_START)
diff --git a/xen/arch/riscv/include/asm/fixmap.h b/xen/arch/riscv/include/asm/fixmap.h
new file mode 100644
index 0000000000..63732df36c
--- /dev/null
+++ b/xen/arch/riscv/include/asm/fixmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * fixmap.h: compile-time virtual memory allocation
+ */
+#ifndef ASM_FIXMAP_H
+#define ASM_FIXMAP_H
+
+#include <xen/bug.h>
+#include <xen/page-size.h>
+#include <xen/pmap.h>
+
+#include <asm/page.h>
+
+#define FIXMAP_ADDR(n) (FIXMAP_BASE + (n) * PAGE_SIZE)
+
+/* Fixmap slots */
+#define FIX_PMAP_BEGIN (0) /* Start of PMAP */
+#define FIX_PMAP_END (FIX_PMAP_BEGIN + NUM_FIX_PMAP - 1) /* End of PMAP */
+#define FIX_MISC (FIX_PMAP_END + 1)  /* Ephemeral mappings of hardware */
+
+#define FIX_LAST FIX_MISC
+
+#define FIXADDR_START FIXMAP_ADDR(0)
+#define FIXADDR_TOP FIXMAP_ADDR(FIX_LAST + 1)
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Direct access to xen_fixmap[] should only happen when {set,
+ * clear}_fixmap() is unusable (e.g. where we would end up to
+ * recursively call the helpers).
+ */
+extern pte_t xen_fixmap[];
+
+#define fix_to_virt(slot) ((void *)FIXMAP_ADDR(slot))
+
+static inline unsigned int virt_to_fix(vaddr_t vaddr)
+{
+    BUG_ON(vaddr >= FIXADDR_TOP || vaddr < FIXADDR_START);
+
+    return ((vaddr - FIXADDR_START) >> PAGE_SHIFT);
+}
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* ASM_FIXMAP_H */
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 25af9e1aaa..a0bdc2bc3a 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -255,4 +255,6 @@ static inline unsigned int arch_get_dma_bitsize(void)
     return 32; /* TODO */
 }
 
+void setup_fixmap_mappings(void);
+
 #endif /* _ASM_RISCV_MM_H */
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index c831e16417..a7419b93b2 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -9,6 +9,7 @@
 #include <xen/bug.h>
 #include <xen/types.h>
 
+#include <asm/atomic.h>
 #include <asm/mm.h>
 #include <asm/page-bits.h>
 
@@ -81,6 +82,18 @@ static inline void flush_page_to_ram(unsigned long mfn, bool sync_icache)
     BUG_ON("unimplemented");
 }
 
+/* Write a pagetable entry. */
+static inline void write_pte(pte_t *p, pte_t pte)
+{
+    write_atomic(p, pte);
+}
+
+/* Read a pagetable entry. */
+static inline pte_t read_pte(pte_t *p)
+{
+    return read_atomic(p);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_RISCV_PAGE_H */
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 7d09e781bf..b8ff91cf4e 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -12,6 +12,7 @@
 #include <asm/early_printk.h>
 #include <asm/csr.h>
 #include <asm/current.h>
+#include <asm/fixmap.h>
 #include <asm/page.h>
 #include <asm/processor.h>
 
@@ -49,6 +50,9 @@ stage1_pgtbl_root[PAGETABLE_ENTRIES];
 pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
 stage1_pgtbl_nonroot[PGTBL_INITIAL_COUNT * PAGETABLE_ENTRIES];
 
+pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
+xen_fixmap[PAGETABLE_ENTRIES];
+
 #define HANDLE_PGTBL(curr_lvl_num)                                          \
     index = pt_index(curr_lvl_num, page_addr);                              \
     if ( pte_is_valid(pgtbl[index]) )                                       \
@@ -191,6 +195,45 @@ static bool __init check_pgtbl_mode_support(struct mmu_desc *mmu_desc,
     return is_mode_supported;
 }
 
+void __init setup_fixmap_mappings(void)
+{
+    pte_t *pte, tmp;
+    unsigned int i;
+
+    BUILD_BUG_ON(FIX_LAST >= PAGETABLE_ENTRIES);
+
+    pte = &stage1_pgtbl_root[pt_index(HYP_PT_ROOT_LEVEL, FIXMAP_ADDR(0))];
+
+    /*
+     * In RISC-V page table levels are numbered from Lx to L0 where
+     * x is the highest page table level for currect  MMU mode ( for example,
+     * for Sv39 has 3 page tables so the x = 2 (L2 -> L1 -> L0) ).
+     *
+     * In this cycle we want to find L1 page table because as L0 page table
+     * xen_fixmap[] will be used.
+     */
+    for ( i = HYP_PT_ROOT_LEVEL; i-- > 1; )
+    {
+        BUG_ON(!pte_is_valid(*pte));
+
+        pte = (pte_t *)LOAD_TO_LINK(pte_to_paddr(*pte));
+        pte = &pte[pt_index(i, FIXMAP_ADDR(0))];
+    }
+
+    BUG_ON(pte_is_valid(*pte));
+
+    tmp = paddr_to_pte(LINK_TO_LOAD((unsigned long)&xen_fixmap), PTE_TABLE);
+    write_pte(pte, tmp);
+
+    RISCV_FENCE(rw, rw);
+    sfence_vma();
+
+    /*
+     * We only need the zeroeth table allocated, but not the PTEs set, because
+     * set_fixmap() will set them on the fly.
+     */
+}
+
 /*
  * setup_initial_pagetables:
  *
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 4defad68f4..13f0e8c77d 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -46,6 +46,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
     test_macros_from_bug_h();
 #endif
 
+    setup_fixmap_mappings();
+
     printk("All set up\n");
 
     for ( ;; )
diff --git a/xen/arch/riscv/xen.lds.S b/xen/arch/riscv/xen.lds.S
index 070b19d915..7a683f6065 100644
--- a/xen/arch/riscv/xen.lds.S
+++ b/xen/arch/riscv/xen.lds.S
@@ -181,6 +181,6 @@ ASSERT(!SIZEOF(.got.plt),  ".got.plt non-empty")
  * Changing the size of Xen binary can require an update of
  * PGTBL_INITIAL_COUNT.
  */
-ASSERT(_end - _start <= MB(2), "Xen too large for early-boot assumptions")
+ASSERT(_end - _start <= XEN_VIRT_SIZE, "Xen too large for early-boot assumptions")
 
 ASSERT(_ident_end - _ident_start <= IDENT_AREA_SIZE, "identity region is too big");
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 3/7] xen/riscv: introduce asm/pmap.h header
  2024-08-21 16:06 [PATCH v5 0/7] RISCV device tree mapping Oleksii Kurochko
  2024-08-21 16:06 ` [PATCH v5 1/7] xen/riscv: use {read,write}{b,w,l,q}_cpu() to define {read,write}_atomic() Oleksii Kurochko
  2024-08-21 16:06 ` [PATCH v5 2/7] xen/riscv: set up fixmap mappings Oleksii Kurochko
@ 2024-08-21 16:06 ` Oleksii Kurochko
  2024-08-21 16:06 ` [PATCH v5 4/7] xen/riscv: introduce functionality to work with CPU info Oleksii Kurochko
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Oleksii Kurochko @ 2024-08-21 16:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Introduce arch_pmap_{un}map functions and select HAS_PMAP for CONFIG_RISCV.

Add pte_from_mfn() for use in arch_pmap_map().

Introduce flush_xen_tlb_one_local() and use it in arch_pmap_{un}map().

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V5:
 - Add Reviewed-by: Jan Beulich <jbeulich@suse.com>.
 - Fix a typo in "Changes in V4":
   - "drop flush_xen_tlb_range_va_local() as it isn't used in this patch" ->
     "drop flush_xen_tlb_range_va() as it isn't used in this patch"
   - "s/flush_xen_tlb_range_va_local/flush_tlb_range_va_local" ->
     "s/flush_xen_tlb_one_local/flush_tlb_one_local"
---
Changes in V4:
 - mark arch_pmap_{un}map() as __init: documentation purpose and
   a necessary (but not sufficient) condition here, to validly
   use local TLB flushes only.
 - add flush_xen_tlb_one_local() to arch_pmap_map() as absense of
   "negative" TLB entrues will be guaranted only in the case
   when Svvptc extension is present.
 - s/mfn_from_pte/pte_from_mfn
 - drop mfn_to_xen_entry() as pte_from_mfn() does the same thing
 - add flags argument to pte_from_mfn().
 - update the commit message.
 - drop flush_xen_tlb_range_va() as it isn't used in this patch
 - s/flush_xen_tlb_one_local/flush_tlb_one_local
---
Changes in V3:
 - rename argument of function mfn_to_xen_entry(..., attr -> flags ).
 - update the code of mfn_to_xen_entry() to use flags argument.
 - add blank in mfn_from_pte() in return line.
 - introduce flush_xen_tlb_range_va_local() and use it inside arch_pmap_{un}map().
 - s/__ASM_PMAP_H__/ASM_PMAP_H
 - add SPDX-License-Identifier: GPL-2.0 
---
 xen/arch/riscv/Kconfig                |  1 +
 xen/arch/riscv/include/asm/flushtlb.h |  6 +++++
 xen/arch/riscv/include/asm/page.h     |  6 +++++
 xen/arch/riscv/include/asm/pmap.h     | 36 +++++++++++++++++++++++++++
 4 files changed, 49 insertions(+)
 create mode 100644 xen/arch/riscv/include/asm/pmap.h

diff --git a/xen/arch/riscv/Kconfig b/xen/arch/riscv/Kconfig
index 259eea8d3b..0112aa8778 100644
--- a/xen/arch/riscv/Kconfig
+++ b/xen/arch/riscv/Kconfig
@@ -3,6 +3,7 @@ config RISCV
 	select FUNCTION_ALIGNMENT_16B
 	select GENERIC_BUG_FRAME
 	select HAS_DEVICE_TREE
+	select HAS_PMAP
 
 config RISCV_64
 	def_bool y
diff --git a/xen/arch/riscv/include/asm/flushtlb.h b/xen/arch/riscv/include/asm/flushtlb.h
index 7ce32bea0b..f4a735fd6c 100644
--- a/xen/arch/riscv/include/asm/flushtlb.h
+++ b/xen/arch/riscv/include/asm/flushtlb.h
@@ -5,6 +5,12 @@
 #include <xen/bug.h>
 #include <xen/cpumask.h>
 
+/* Flush TLB of local processor for address va. */
+static inline void flush_tlb_one_local(vaddr_t va)
+{
+    asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
+}
+
 /*
  * Filter the given set of CPUs, removing those that definitely flushed their
  * TLB since @page_timestamp.
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index a7419b93b2..55916eaa92 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -94,6 +94,12 @@ static inline pte_t read_pte(pte_t *p)
     return read_atomic(p);
 }
 
+static inline pte_t pte_from_mfn(mfn_t mfn, unsigned int flags)
+{
+    unsigned long pte = (mfn_x(mfn) << PTE_PPN_SHIFT) | flags;
+    return (pte_t){ .pte = pte };
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_RISCV_PAGE_H */
diff --git a/xen/arch/riscv/include/asm/pmap.h b/xen/arch/riscv/include/asm/pmap.h
new file mode 100644
index 0000000000..60065c996f
--- /dev/null
+++ b/xen/arch/riscv/include/asm/pmap.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ASM_PMAP_H
+#define ASM_PMAP_H
+
+#include <xen/bug.h>
+#include <xen/init.h>
+#include <xen/mm.h>
+#include <xen/page-size.h>
+
+#include <asm/fixmap.h>
+#include <asm/flushtlb.h>
+#include <asm/system.h>
+
+static inline void __init arch_pmap_map(unsigned int slot, mfn_t mfn)
+{
+    pte_t *entry = &xen_fixmap[slot];
+    pte_t pte;
+
+    ASSERT(!pte_is_valid(*entry));
+
+    pte = pte_from_mfn(mfn, PAGE_HYPERVISOR_RW);
+    write_pte(entry, pte);
+
+    flush_tlb_one_local(FIXMAP_ADDR(slot));
+}
+
+static inline void __init arch_pmap_unmap(unsigned int slot)
+{
+    pte_t pte = {};
+
+    write_pte(&xen_fixmap[slot], pte);
+
+    flush_tlb_one_local(FIXMAP_ADDR(slot));
+}
+
+#endif /* ASM_PMAP_H */
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-21 16:06 [PATCH v5 0/7] RISCV device tree mapping Oleksii Kurochko
                   ` (2 preceding siblings ...)
  2024-08-21 16:06 ` [PATCH v5 3/7] xen/riscv: introduce asm/pmap.h header Oleksii Kurochko
@ 2024-08-21 16:06 ` Oleksii Kurochko
  2024-08-27 13:44   ` Jan Beulich
  2024-08-21 16:06 ` [PATCH v5 5/7] xen/riscv: introduce and initialize SBI RFENCE extension Oleksii Kurochko
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Oleksii Kurochko @ 2024-08-21 16:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Introduce struct pcpu_info to store pCPU-related information.
Initially, it includes only processor_id and hart id, but it
will be extended to include guest CPU information and
temporary variables for saving/restoring vCPU registers.

Add set_processor_id() and get_processor_id() functions to set
and retrieve the processor_id stored in pcpu_info.

Introduce cpuid_to_hartid_map() to convert Xen logical CPUs to
hart IDs (physical CPU IDs).

Define smp_processor_id() to provide accurate information,
replacing the previous "dummy" value of 0.

Initialize tp registers to point to pcpu_info[0].
Set processor_id to 0 for logical CPU 0 and store the physical
CPU ID in pcpu_info[0].

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
 - add hart_id to pcpu_info;
 - add comments to pcpu_info members.
 - define INVALID_HARTID as ULONG_MAX as mhart_id register has MXLEN which is
   equal to 32 for RV-32 and 64 for RV-64.
 - add hart_id to pcpu_info structure.
 - drop cpuid_to_hartid_map[] and use pcpu_info[] for the same purpose.
 - introduce new function setup_tp(cpuid).
 - add the FIXME commit on top of pcpu_info[].
 - setup TP register before start_xen() being called.
 - update the commit message.
 - change "commit message" to "comment" in "Changes in V4" in "update the comment
   above the code of TP..."
---
Changes in V4:
 - wrap id with () inside set_processor_id().
 - code style fixes
 - update BUG_ON(id > NR_CPUS) in smp_processor_id() and drop the comment
   above BUG_ON().
 - s/__cpuid_to_hartid_map/cpuid_to_hartid_map
 - s/cpuid_to_hartid_map/cpuid_to_harti ( here cpuid_to_hartid_map is the name
   of the macros ).
 - update the comment above the code of TP register initialization in
   start_xen().
 - s/smp_setup_processor_id/smp_setup_bootcpu_id
 - update the commit message.
 - cleanup headers which are included in <asm/processor.h>
---
Changes in V3:
 - new patch.
---
 xen/arch/riscv/Makefile                |  2 ++
 xen/arch/riscv/include/asm/processor.h | 29 ++++++++++++++++++++++++--
 xen/arch/riscv/include/asm/smp.h       | 11 ++++++++++
 xen/arch/riscv/riscv64/head.S          |  4 ++++
 xen/arch/riscv/setup.c                 |  5 +++++
 xen/arch/riscv/smp.c                   | 21 +++++++++++++++++++
 xen/arch/riscv/smpboot.c               |  8 +++++++
 7 files changed, 78 insertions(+), 2 deletions(-)
 create mode 100644 xen/arch/riscv/smp.c
 create mode 100644 xen/arch/riscv/smpboot.c

diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index 81b77b13d6..334fd24547 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -4,6 +4,8 @@ obj-y += mm.o
 obj-$(CONFIG_RISCV_64) += riscv64/
 obj-y += sbi.o
 obj-y += setup.o
+obj-y += smp.o
+obj-y += smpboot.o
 obj-y += stubs.o
 obj-y += traps.o
 obj-y += vm_event.o
diff --git a/xen/arch/riscv/include/asm/processor.h b/xen/arch/riscv/include/asm/processor.h
index 3ae164c265..98c45afb6c 100644
--- a/xen/arch/riscv/include/asm/processor.h
+++ b/xen/arch/riscv/include/asm/processor.h
@@ -12,8 +12,33 @@
 
 #ifndef __ASSEMBLY__
 
-/* TODO: need to be implemeted */
-#define smp_processor_id() 0
+#include <xen/bug.h>
+
+register struct pcpu_info *tp asm ("tp");
+
+struct pcpu_info {
+    unsigned int processor_id; /* Xen CPU id */
+    unsigned long hart_id; /* physical CPU id */
+};
+
+/* tp points to one of these */
+extern struct pcpu_info pcpu_info[NR_CPUS];
+
+#define get_processor_id()      (tp->processor_id)
+#define set_processor_id(id)    do { \
+    tp->processor_id = (id);         \
+} while (0)
+
+static inline unsigned int smp_processor_id(void)
+{
+    unsigned int id;
+
+    id = get_processor_id();
+
+    BUG_ON(id > NR_CPUS);
+
+    return id;
+}
 
 /* On stack VCPU state */
 struct cpu_user_regs
diff --git a/xen/arch/riscv/include/asm/smp.h b/xen/arch/riscv/include/asm/smp.h
index b1ea91b1eb..2b719616ee 100644
--- a/xen/arch/riscv/include/asm/smp.h
+++ b/xen/arch/riscv/include/asm/smp.h
@@ -5,6 +5,10 @@
 #include <xen/cpumask.h>
 #include <xen/percpu.h>
 
+#include <asm/processor.h>
+
+#define INVALID_HARTID ULONG_MAX
+
 DECLARE_PER_CPU(cpumask_var_t, cpu_sibling_mask);
 DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
 
@@ -14,6 +18,13 @@ DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
  */
 #define park_offline_cpus false
 
+void smp_set_bootcpu_id(unsigned long boot_cpu_hartid);
+
+/*
+ * Mapping between linux logical cpu index and hartid.
+ */
+#define cpuid_to_hartid(cpu) pcpu_info[cpu].hart_id
+
 #endif
 
 /*
diff --git a/xen/arch/riscv/riscv64/head.S b/xen/arch/riscv/riscv64/head.S
index 3261e9fce8..9e5b9a0708 100644
--- a/xen/arch/riscv/riscv64/head.S
+++ b/xen/arch/riscv/riscv64/head.S
@@ -55,6 +55,10 @@ FUNC(start)
          */
         jal     reset_stack
 
+        /* Xen's boot cpu id is equal to 0 so setup TP register for it */
+        mv      a0, x0
+        jal     setup_tp
+
         /* restore hart_id ( bootcpu_id ) and dtb address */
         mv      a0, s0
         mv      a1, s1
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 13f0e8c77d..e15f34509c 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -8,6 +8,7 @@
 #include <public/version.h>
 
 #include <asm/early_printk.h>
+#include <asm/smp.h>
 #include <asm/traps.h>
 
 void arch_get_xen_caps(xen_capabilities_info_t *info)
@@ -40,6 +41,10 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
 {
     remove_identity_mapping();
 
+    set_processor_id(0);
+
+    smp_set_bootcpu_id(bootcpu_id);
+
     trap_init();
 
 #ifdef CONFIG_SELF_TESTS
diff --git a/xen/arch/riscv/smp.c b/xen/arch/riscv/smp.c
new file mode 100644
index 0000000000..478ea5aeab
--- /dev/null
+++ b/xen/arch/riscv/smp.c
@@ -0,0 +1,21 @@
+#include <xen/smp.h>
+
+/*
+ * FIXME: make pcpu_info[] dynamically allocated when necessary
+ *        functionality will be ready
+ */
+/* tp points to one of these per cpu */
+struct pcpu_info pcpu_info[NR_CPUS] = { { 0, INVALID_HARTID } };
+
+void setup_tp(unsigned int cpuid)
+{
+    /*
+     * tp register contains an address of physical cpu information.
+     * So write physical CPU info of cpuid to tp register.
+     * It will be used later by get_processor_id() ( look at
+     * <asm/processor.h> ):
+     *   #define get_processor_id()    (tp->processor_id)
+     */
+    asm volatile ( "mv tp, %0"
+                   :: "r" ((unsigned long)&pcpu_info[cpuid]) : "memory" );
+}
diff --git a/xen/arch/riscv/smpboot.c b/xen/arch/riscv/smpboot.c
new file mode 100644
index 0000000000..34319f8875
--- /dev/null
+++ b/xen/arch/riscv/smpboot.c
@@ -0,0 +1,8 @@
+#include <xen/init.h>
+#include <xen/sections.h>
+#include <xen/smp.h>
+
+void __init smp_set_bootcpu_id(unsigned long boot_cpu_hartid)
+{
+    cpuid_to_hartid(0) = boot_cpu_hartid;
+}
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-21 16:06 [PATCH v5 0/7] RISCV device tree mapping Oleksii Kurochko
                   ` (3 preceding siblings ...)
  2024-08-21 16:06 ` [PATCH v5 4/7] xen/riscv: introduce functionality to work with CPU info Oleksii Kurochko
@ 2024-08-21 16:06 ` Oleksii Kurochko
  2024-08-27 14:19   ` Jan Beulich
  2024-08-21 16:06 ` [PATCH v5 6/7] xen/riscv: page table handling Oleksii Kurochko
  2024-08-21 16:06 ` [PATCH v5 7/7] xen/riscv: introduce early_fdt_map() Oleksii Kurochko
  6 siblings, 1 reply; 32+ messages in thread
From: Oleksii Kurochko @ 2024-08-21 16:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Introduce functions to work with the SBI RFENCE extension for issuing
various fence operations to remote CPUs.

Add the sbi_init() function along with auxiliary functions and macro
definitions for proper initialization and checking the availability of
SBI extensions. Currently, this is implemented only for RFENCE.

Introduce sbi_remote_sfence_vma() to send SFENCE_VMA instructions to
a set of target HARTs. This will support the implementation of
flush_xen_tlb_range_va().

Integrate __sbi_rfence_v02 from Linux kernel 6.6.0-rc4 with minimal
modifications:
 - Adapt to Xen code style.
 - Use cpuid_to_hartid() instead of cpuid_to_hartid_map[].
 - Update BIT(...) to BIT(..., UL).
 - Rename __sbi_rfence_v02_call to sbi_rfence_v02_real and
   remove the unused arg5.
 - Handle NULL cpu_mask to execute rfence on all CPUs by calling
   sbi_rfence_v02_real(..., 0UL, -1UL,...) instead of creating hmask.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
 - update the comment for sbi_has_rfence().
 - update the comment for sbi_remote_sfence_vma().
 - update the prototype of sbi_remote_sfence_vma() and declare cpu_mask
   argument as pointer to const.
 - use MASK_EXTR() for sbi_{major, minor}_version().
 - redefine SBI_SPEC_VERSION_MAJOR_MASK as 0x7F000000
 - drop SBI_SPEC_VERSION_MAJOR_SHIFT as unneeded.
 - add BUG_ON(ret.value < 0) inside sbi_ext_base_func() to be sure that
   ret.value is always >= 0 as SBI spec explicitly doesn't say that.
 - s/__sbi_rfence_v02_real/sbi_rfence_v02_real
 - s/__sbi_rfence_v02/sbi_rfence_v02
 - s/__sbi_rfence/sbi_rfence
 - fold cases inside sbi_rfence_v02_real()
 - mark sbi_rfence_v02 with cf_check.
 - code style fixes in sbi_rfence_v02().
 - add the comment with explanation of algorithm used in sbi_rfence_v02().
 - use __ro_after_init for sbi_rfence variable.
 - add ASSERT(sbi_rfebce) inside sbi_remote_sfence_vma to be sure that it
   is not NULL.
 - drop local variable ret inside sbi_init() and init sbi_spec_version
   directly by return value of sbi_get_spec_version() as this function
   should always be must always succeed.
 - add the comment above sbi_get_spec_version().
 - add BUG_ON for sbi_fw_id and sbi_fw_version() to be sure that they
   have correct values.
 - make sbi_fw_id, sbi_fw_version as local because they are used only once
   for printk().
 - s/veriosn/version
 - drop  BUG_ON("At the moment flush_xen_tlb_range_va() uses SBI rfence...")
   as now we have ASSERT() in the flace where sbi_rfence is actually used.
 - update the commit message.
 - s/BUG_ON("Ooops. SBI spec version 0.1 detected. Need to add support")/panic("Ooops. SBI ...");
---
Changes in V4:
 - update the commit message.
 - code style fixes
 - update return type of sbi_has_rfence() from int to bool and drop
   conditional operator inside implementation.
 - Update mapping of SBI_ERR_FAILURE in sbi_err_map_xen_errno().
 - Update return type of sbi_spec_is_0_1() and drop conditional operator
   inside implementation.
 - s/0x%lx/%#lx
 - update the comment above declaration of sbi_remote_sfence_vma() with
   more detailed explanation what the function does.
 - update prototype of sbi_remote_sfence_vma(). Now it receives cpumask_t
   and returns int.
 - refactor __sbi_rfence_v02() take from the Linux kernel as it takes into
   account a case that hart id could be from different hbase. For example,
   the case when hart IDs are the following 0, 3, 65, 2. Or the case when
   hart IDs are unsorted: 0 3 1 2.
 - drop sbi_cpumask_to_hartmask() as it is not needed anymore
 - Update the prototype of sbi_remote_sfence_vma() and implemntation accordingly
   to the fact it returns 'int'.
 - s/flush_xen_tlb_one_local/flush_tlb_one_local
---
Changes in V3:
 - new patch.
---
 xen/arch/riscv/include/asm/sbi.h |  63 +++++++
 xen/arch/riscv/sbi.c             | 273 ++++++++++++++++++++++++++++++-
 xen/arch/riscv/setup.c           |   3 +
 3 files changed, 338 insertions(+), 1 deletion(-)

diff --git a/xen/arch/riscv/include/asm/sbi.h b/xen/arch/riscv/include/asm/sbi.h
index 0e6820a4ed..76921d4cd1 100644
--- a/xen/arch/riscv/include/asm/sbi.h
+++ b/xen/arch/riscv/include/asm/sbi.h
@@ -12,8 +12,41 @@
 #ifndef __ASM_RISCV_SBI_H__
 #define __ASM_RISCV_SBI_H__
 
+#include <xen/cpumask.h>
+
 #define SBI_EXT_0_1_CONSOLE_PUTCHAR		0x1
 
+#define SBI_EXT_BASE                    0x10
+#define SBI_EXT_RFENCE                  0x52464E43
+
+/* SBI function IDs for BASE extension */
+#define SBI_EXT_BASE_GET_SPEC_VERSION   0x0
+#define SBI_EXT_BASE_GET_IMP_ID         0x1
+#define SBI_EXT_BASE_GET_IMP_VERSION    0x2
+#define SBI_EXT_BASE_PROBE_EXT          0x3
+
+/* SBI function IDs for RFENCE extension */
+#define SBI_EXT_RFENCE_REMOTE_FENCE_I           0x0
+#define SBI_EXT_RFENCE_REMOTE_SFENCE_VMA        0x1
+#define SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID   0x2
+#define SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA       0x3
+#define SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID  0x4
+#define SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA       0x5
+#define SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA_ASID  0x6
+
+#define SBI_SPEC_VERSION_MAJOR_MASK     0x7F000000
+#define SBI_SPEC_VERSION_MINOR_MASK     0xffffff
+
+/* SBI return error codes */
+#define SBI_SUCCESS             0
+#define SBI_ERR_FAILURE         (-1)
+#define SBI_ERR_NOT_SUPPORTED   (-2)
+#define SBI_ERR_INVALID_PARAM   (-3)
+#define SBI_ERR_DENIED          (-4)
+#define SBI_ERR_INVALID_ADDRESS (-5)
+
+#define SBI_SPEC_VERSION_DEFAULT 0x1
+
 struct sbiret {
     long error;
     long value;
@@ -31,4 +64,34 @@ struct sbiret sbi_ecall(unsigned long ext, unsigned long fid,
  */
 void sbi_console_putchar(int ch);
 
+/*
+ * Check underlying SBI implementation has RFENCE
+ *
+ * @return true for supported AND false for not-supported
+ */
+bool sbi_has_rfence(void);
+
+/*
+ * Instructs the remote harts to execute one or more SFENCE.VMA
+ * instructions, covering the range of virtual addresses between
+ * [start_addr, start_addr + size).
+ *
+ * Returns 0 if IPI was sent to all the targeted harts successfully
+ * or negative value if start_addr or size is not valid.
+ *
+ * @hart_mask a cpu mask containing all the target harts.
+ * @param start virtual address start
+ * @param size virtual address range size
+ */
+int sbi_remote_sfence_vma(const cpumask_t *cpu_mask,
+                          unsigned long start_addr,
+                          unsigned long size);
+
+/*
+ * Initialize SBI library
+ *
+ * @return 0 on success, otherwise negative errno on failure
+ */
+int sbi_init(void);
+
 #endif /* __ASM_RISCV_SBI_H__ */
diff --git a/xen/arch/riscv/sbi.c b/xen/arch/riscv/sbi.c
index 0ae166c861..c4036c8e4b 100644
--- a/xen/arch/riscv/sbi.c
+++ b/xen/arch/riscv/sbi.c
@@ -5,13 +5,26 @@
  * (anup.patel@wdc.com).
  *
  * Modified by Bobby Eshleman (bobby.eshleman@gmail.com).
+ * Modified by Oleksii Kurochko (oleksii.kurochko@gmail.com).
  *
  * Copyright (c) 2019 Western Digital Corporation or its affiliates.
- * Copyright (c) 2021-2023 Vates SAS.
+ * Copyright (c) 2021-2024 Vates SAS.
  */
 
+#include <xen/compiler.h>
+#include <xen/const.h>
+#include <xen/cpumask.h>
+#include <xen/errno.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/sections.h>
+#include <xen/smp.h>
+
+#include <asm/processor.h>
 #include <asm/sbi.h>
 
+static unsigned long __ro_after_init sbi_spec_version = SBI_SPEC_VERSION_DEFAULT;
+
 struct sbiret sbi_ecall(unsigned long ext, unsigned long fid,
                         unsigned long arg0, unsigned long arg1,
                         unsigned long arg2, unsigned long arg3,
@@ -38,7 +51,265 @@ struct sbiret sbi_ecall(unsigned long ext, unsigned long fid,
     return ret;
 }
 
+static int sbi_err_map_xen_errno(int err)
+{
+    switch ( err )
+    {
+    case SBI_SUCCESS:
+        return 0;
+    case SBI_ERR_DENIED:
+        return -EACCES;
+    case SBI_ERR_INVALID_PARAM:
+        return -EINVAL;
+    case SBI_ERR_INVALID_ADDRESS:
+        return -EFAULT;
+    case SBI_ERR_NOT_SUPPORTED:
+        return -EOPNOTSUPP;
+    case SBI_ERR_FAILURE:
+        fallthrough;
+    default:
+        return -ENOSYS;
+    };
+}
+
 void sbi_console_putchar(int ch)
 {
     sbi_ecall(SBI_EXT_0_1_CONSOLE_PUTCHAR, 0, ch, 0, 0, 0, 0, 0);
 }
+
+static unsigned long sbi_major_version(void)
+{
+    return MASK_EXTR(sbi_spec_version, SBI_SPEC_VERSION_MAJOR_MASK);
+}
+
+static unsigned long sbi_minor_version(void)
+{
+    return MASK_EXTR(sbi_spec_version, SBI_SPEC_VERSION_MINOR_MASK);
+}
+
+static long sbi_ext_base_func(long fid)
+{
+    struct sbiret ret;
+
+    ret = sbi_ecall(SBI_EXT_BASE, fid, 0, 0, 0, 0, 0, 0);
+
+    /*
+     * I wasn't able to find a case in the SBI spec where sbiret.value
+     * could be negative.
+     *
+     * Unfortunately, the spec does not specify the possible values of
+     * sbiret.value, but based on the description of the SBI function,
+     * ret.value >= 0 when sbiret.error = 0. SPI spec specify only
+     * possible value for sbiret.error (<= 0 whwere 0 is SBI_SUCCESS ).
+     *
+     * Just to be sure that SBI base extension functions one day won't
+     * start to return a negative value for sbiret.value when
+     * sbiret.error < 0 BUG_ON() is added.
+     */
+    BUG_ON(ret.value < 0);
+
+    if ( !ret.error )
+        return ret.value;
+    else
+        return ret.error;
+}
+
+static int sbi_rfence_v02_real(unsigned long fid,
+                               unsigned long hmask, unsigned long hbase,
+                               unsigned long start, unsigned long size,
+                               unsigned long arg4)
+{
+    struct sbiret ret = {0};
+    int result = 0;
+
+    switch ( fid )
+    {
+    case SBI_EXT_RFENCE_REMOTE_FENCE_I:
+        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
+                        0, 0, 0, 0);
+        break;
+
+    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA:
+    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA:
+    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA:
+        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
+                        start, size, 0, 0);
+        break;
+
+    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID:
+    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID:
+    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA_ASID:
+        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
+                        start, size, arg4, 0);
+        break;
+
+    default:
+        printk("%s: unknown function ID [%lu]\n",
+               __func__, fid);
+        result = -EINVAL;
+        break;
+    };
+
+    if ( ret.error )
+    {
+        result = sbi_err_map_xen_errno(ret.error);
+        printk("%s: hbase=%lu hmask=%#lx failed (error %d)\n",
+               __func__, hbase, hmask, result);
+    }
+
+    return result;
+}
+
+static int cf_check sbi_rfence_v02(unsigned long fid,
+                                   const cpumask_t *cpu_mask,
+                                   unsigned long start, unsigned long size,
+                                   unsigned long arg4, unsigned long arg5)
+{
+    unsigned long hartid, cpuid, hmask = 0, hbase = 0, htop = 0;
+    int result;
+
+    /*
+     * hart_mask_base can be set to -1 to indicate that hart_mask can be
+     * ignored and all available harts must be considered.
+     */
+    if ( !cpu_mask )
+        return sbi_rfence_v02_real(fid, 0UL, -1UL, start, size, arg4);
+
+    for_each_cpu ( cpuid, cpu_mask )
+    {
+        /*
+        * Hart IDs might not necessarily be numbered contiguously in
+        * a multiprocessor system, but at least one hart must have a
+        * hart ID of zero.
+        *
+        * This means that it is possible for the hart ID mapping to look like:
+        *  0, 1, 3, 65, 66, 69
+        * In such cases, more than one call to sbi_rfence_v02_real() will be
+        * needed, as a single hmask can only cover sizeof(unsigned long) CPUs:
+        *  1. sbi_rfence_v02_real(hmask=0b1011, hbase=0)
+        *  2. sbi_rfence_v02_real(hmask=0b1011, hbase=65)
+        *
+        * The algorithm below tries to batch as many harts as possible before
+        * making an SBI call. However, batching may not always be possible.
+        * For example, consider the hart ID mapping:
+        *   0, 64, 1, 65, 2, 66
+        */
+        hartid = cpuid_to_hartid(cpuid);
+        if ( hmask )
+        {
+            if ( hartid + BITS_PER_LONG <= htop ||
+                 hbase + BITS_PER_LONG <= hartid )
+            {
+                result = sbi_rfence_v02_real(fid, hmask, hbase,
+                                             start, size, arg4);
+                if ( result )
+                    return result;
+                hmask = 0;
+            }
+            else if ( hartid < hbase )
+            {
+                /* shift the mask to fit lower hartid */
+                hmask <<= hbase - hartid;
+                hbase = hartid;
+            }
+        }
+
+        if ( !hmask )
+        {
+            hbase = hartid;
+            htop = hartid;
+        }
+        else if ( hartid > htop )
+            htop = hartid;
+
+        hmask |= BIT(hartid - hbase, UL);
+    }
+
+    if ( hmask )
+    {
+        result = sbi_rfence_v02_real(fid, hmask, hbase,
+                                     start, size, arg4);
+        if ( result )
+            return result;
+    }
+
+    return 0;
+}
+
+static int (* __ro_after_init sbi_rfence)(unsigned long fid,
+                                          const cpumask_t *cpu_mask,
+                                          unsigned long start,
+                                          unsigned long size,
+                                          unsigned long arg4,
+                                          unsigned long arg5);
+
+int sbi_remote_sfence_vma(const cpumask_t *cpu_mask,
+                          unsigned long start_addr,
+                          unsigned long size)
+{
+    ASSERT(sbi_rfence);
+
+    return sbi_rfence(SBI_EXT_RFENCE_REMOTE_SFENCE_VMA,
+                      cpu_mask, start_addr, size, 0, 0);
+}
+
+/* This function must always succeed. */
+#define sbi_get_spec_version()  \
+    sbi_ext_base_func(SBI_EXT_BASE_GET_SPEC_VERSION)
+
+#define sbi_get_firmware_id()   \
+    sbi_ext_base_func(SBI_EXT_BASE_GET_IMP_ID)
+
+#define sbi_get_firmware_version()  \
+    sbi_ext_base_func(SBI_EXT_BASE_GET_IMP_VERSION)
+
+int sbi_probe_extension(long extid)
+{
+    struct sbiret ret;
+
+    ret = sbi_ecall(SBI_EXT_BASE, SBI_EXT_BASE_PROBE_EXT, extid,
+                    0, 0, 0, 0, 0);
+    if ( !ret.error && ret.value )
+        return ret.value;
+
+    return -EOPNOTSUPP;
+}
+
+static bool sbi_spec_is_0_1(void)
+{
+    return (sbi_spec_version == SBI_SPEC_VERSION_DEFAULT);
+}
+
+bool sbi_has_rfence(void)
+{
+    return (sbi_rfence != NULL);
+}
+
+int __init sbi_init(void)
+{
+    sbi_spec_version = sbi_get_spec_version();
+
+    printk("SBI specification v%lu.%lu detected\n",
+            sbi_major_version(), sbi_minor_version());
+
+    if ( !sbi_spec_is_0_1() )
+    {
+        long sbi_fw_id = sbi_get_firmware_id();
+        long sbi_fw_version = sbi_get_firmware_version();
+
+        BUG_ON((sbi_fw_id < 0) || (sbi_fw_version < 0));
+
+        printk("SBI implementation ID=%#lx Version=%#lx\n",
+            sbi_fw_id, sbi_fw_version);
+
+        if ( sbi_probe_extension(SBI_EXT_RFENCE) > 0 )
+        {
+            sbi_rfence = sbi_rfence_v02;
+            printk("SBI v0.2 RFENCE extension detected\n");
+        }
+    }
+    else
+        panic("Ooops. SBI spec version 0.1 detected. Need to add support");
+
+    return 0;
+}
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index e15f34509c..f147ba672f 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -8,6 +8,7 @@
 #include <public/version.h>
 
 #include <asm/early_printk.h>
+#include <asm/sbi.h>
 #include <asm/smp.h>
 #include <asm/traps.h>
 
@@ -47,6 +48,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
 
     trap_init();
 
+    sbi_init();
+
 #ifdef CONFIG_SELF_TESTS
     test_macros_from_bug_h();
 #endif
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 6/7] xen/riscv: page table handling
  2024-08-21 16:06 [PATCH v5 0/7] RISCV device tree mapping Oleksii Kurochko
                   ` (4 preceding siblings ...)
  2024-08-21 16:06 ` [PATCH v5 5/7] xen/riscv: introduce and initialize SBI RFENCE extension Oleksii Kurochko
@ 2024-08-21 16:06 ` Oleksii Kurochko
  2024-08-27 15:00   ` Jan Beulich
  2024-08-21 16:06 ` [PATCH v5 7/7] xen/riscv: introduce early_fdt_map() Oleksii Kurochko
  6 siblings, 1 reply; 32+ messages in thread
From: Oleksii Kurochko @ 2024-08-21 16:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Implement map_pages_to_xen() which requires several
functions to manage page tables and entries:
- pt_update()
- pt_mapping_level()
- pt_update_entry()
- pt_next_level()
- pt_check_entry()

To support these operations, add functions for creating,
mapping, and unmapping Xen tables:
- create_table()
- map_table()
- unmap_table()

Introduce internal macros starting with PTE_* for convenience.
These macros closely resemble PTE bits, with the exception of
PTE_SMALL, which indicates that 4KB is needed.

In addition introduce flush_tlb_range_va() for TLB flushing across
CPUs after updating the PTE for the requested mapping.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
 - s/xen_{un}map/{un}map
 - introduce PTE_SMALL instead of PTE_BLOCK.
 - update the comment above defintion of PTE_4K_PAGES.
 - code style fixes.
 - s/RV_STAGE1_MODE > SATP_MODE_SV48/RV_STAGE1_MODE > SATP_MODE_SV39 around
   DECLARE_OFFSETS macros.
 - change type of root_maddr from unsgined long to maddr_t.
 - drop duplicated check ( if (rc) break ) in pt_update() inside while cycle.
 - s/1U/1UL
 - put 'spin_unlock(&xen_pt_lock);' ahead of TLB flush in pt_update().
 - update the commit message.
 - update the comment above ASSERT() in map_pages_to_xen() and also update
   the check within ASSERT() to check that flags has PTE_VALID bit set.
 - update the comment above pt_update() function.
 - add the comment inside pt_check_entry().
 - update the TLB flushing region in pt_update().
 - s/alloc_only/alloc_tbl
---
Changes in V4:
 - update the commit message.
 - drop xen_ prefix for functions: xen_pt_update(), xen_pt_mapping_level(),
   xen_pt_update_entry(), xen_pt_next_level(), xen_pt_check_entry().
 - drop 'select GENERIC_PT' for CONFIG_RISCV. There is no GENERIC_PT anymore.
 - update implementation of flush_xen_tlb_range_va and s/flush_xen_tlb_range_va/flush_tlb_range_va
 - s/pte_get_mfn/mfn_from_pte. Others similar definitions I decided not to touch as
   they were introduced before and this patter of naming such type of macros will be applied
   for newly introduced macros.
 - drop _PAGE_* definitions and use analogues of PTE_*.
 - introduce PTE_{W,X,R}_MASK and drop PAGE_{XN,W,X}_MASK. Also drop _PAGE_{*}_BIT
 - introduce PAGE_HYPERVISOR_RX.
 - drop unused now l3_table_offset.
 - drop struct pt_t as it was used only for one function. If it will be needed in the future
   pt_t will be re-introduced.
 - code styles fixes in pte_is_table(). drop level argument from t.
 - update implementation and prototype of pte_is_mapping().
 - drop level argument from pt_next_level().
 - introduce definition of SATP_PPN_MASK.
 - isolate PPN of CSR_SATP before shift by PAGE_SHIFT.
 - drop set_permission() functions as it is not used more then once.
 - update prototype of pt_check_entry(): drop level argument as it is not used.
 - pt_check_entry():
   - code style fixes
   - update the sanity check when modifying an entry
   - update the sanity check when when removing a mapping.
 - s/read_only/alloc_only.
 - code style fixes for pt_next_level().
 - pt_update_entry() changes:
   - drop arch_level variable inisde pt_update_entry()
   - drop convertion near virt to paddr_t in DECLARE_OFFSETS(offsets, virt);
   - pull out "goto out inside first 'for' cycle.
   - drop braces for 'if' cases which has only one line.
   - ident 'out' label with one blank.
   - update the comment above alloc_only and also definition to take into
     account  that if pte population was requested or not.
   - drop target variable and rename arch_target argument of the function to
     target.
 - pt_mapping_level() changes:
   - move the check if PTE_BLOCK should be mapped on the top of the function.
   - change int i to unsigned int and update 'for' cycle correspondingly.
 - update prototye of pt_update():
   - drop the comment  above nr_mfns and drop const to be consistent with other
     arguments.
   - always flush TLB at the end of the function as non-present entries can be put
     in the TLB.
   - add fence before TLB flush to ensure that PTEs are all updated before flushing.
 - s/XEN_TABLE_NORMAL_PAGE/XEN_TABLE_NORMAL
 - add a check in map_pages_to_xen() the mfn is not INVALID_MFN.
 - add the comment on top of pt_update() how mfn = INVALID_MFN is considered.
 - s/_PAGE_BLOCK/PTE_BLOCK.
 - add the comment with additional explanation for PTE_BLOCK.
 - drop defintion of FIRST_SIZE as it isn't used.
---
Changes in V3:
 - new patch. ( Technically it is reworked version of the generic approach
   which I tried to suggest in the previous version )
---
 xen/arch/riscv/Makefile                     |   1 +
 xen/arch/riscv/include/asm/flushtlb.h       |  12 +
 xen/arch/riscv/include/asm/mm.h             |   2 +
 xen/arch/riscv/include/asm/page.h           |  57 +++
 xen/arch/riscv/include/asm/riscv_encoding.h |   1 +
 xen/arch/riscv/mm.c                         |   9 -
 xen/arch/riscv/pt.c                         | 420 ++++++++++++++++++++
 7 files changed, 493 insertions(+), 9 deletions(-)
 create mode 100644 xen/arch/riscv/pt.c

diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index 334fd24547..d058ea4e95 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -1,6 +1,7 @@
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 obj-y += entry.o
 obj-y += mm.o
+obj-y += pt.o
 obj-$(CONFIG_RISCV_64) += riscv64/
 obj-y += sbi.o
 obj-y += setup.o
diff --git a/xen/arch/riscv/include/asm/flushtlb.h b/xen/arch/riscv/include/asm/flushtlb.h
index f4a735fd6c..031d781aa2 100644
--- a/xen/arch/riscv/include/asm/flushtlb.h
+++ b/xen/arch/riscv/include/asm/flushtlb.h
@@ -5,12 +5,24 @@
 #include <xen/bug.h>
 #include <xen/cpumask.h>
 
+#include <asm/sbi.h>
+
 /* Flush TLB of local processor for address va. */
 static inline void flush_tlb_one_local(vaddr_t va)
 {
     asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
 }
 
+/*
+ * Flush a range of VA's hypervisor mappings from the TLB of all
+ * processors in the inner-shareable domain.
+ */
+static inline void flush_tlb_range_va(vaddr_t va, size_t size)
+{
+    BUG_ON(!sbi_has_rfence());
+    sbi_remote_sfence_vma(NULL, va, size);
+}
+
 /*
  * Filter the given set of CPUs, removing those that definitely flushed their
  * TLB since @page_timestamp.
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index a0bdc2bc3a..ce1557bb27 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -42,6 +42,8 @@ static inline void *maddr_to_virt(paddr_t ma)
 #define virt_to_mfn(va)     __virt_to_mfn(va)
 #define mfn_to_virt(mfn)    __mfn_to_virt(mfn)
 
+#define mfn_from_pte(pte) maddr_to_mfn(pte_to_paddr(pte))
+
 struct page_info
 {
     /* Each frame can be threaded onto a doubly-linked list. */
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index 55916eaa92..f148d82261 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -21,6 +21,11 @@
 #define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl) - 1))
 #define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK << XEN_PT_LEVEL_SHIFT(lvl))
 
+/*
+ * PTE format:
+ * | XLEN-1  10 | 9             8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
+ *       PFN      reserved for SW   D   A   G   U   X   W   R   V
+ */
 #define PTE_VALID                   BIT(0, UL)
 #define PTE_READABLE                BIT(1, UL)
 #define PTE_WRITABLE                BIT(2, UL)
@@ -34,15 +39,53 @@
 #define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
 #define PTE_TABLE                   (PTE_VALID)
 
+#define PAGE_HYPERVISOR_RO          (PTE_VALID | PTE_READABLE)
 #define PAGE_HYPERVISOR_RW          (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
+#define PAGE_HYPERVISOR_RX          (PTE_VALID | PTE_READABLE | PTE_EXECUTABLE)
 
 #define PAGE_HYPERVISOR             PAGE_HYPERVISOR_RW
 
+/*
+ * The PTE format does not contain the following bits within itself;
+ * they are created artificially to inform the Xen page table
+ * handling algorithm. These bits should not be explicitly written
+ * to the PTE entry.
+ */
+#define PTE_SMALL       BIT(10, UL)
+#define PTE_POPULATE    BIT(11, UL)
+
+#define PTE_R_MASK(x)   ((x) & PTE_READABLE)
+#define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
+#define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
+
+#define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE))
+
 /* Calculate the offsets into the pagetables for a given VA */
 #define pt_linear_offset(lvl, va)   ((va) >> XEN_PT_LEVEL_SHIFT(lvl))
 
 #define pt_index(lvl, va) (pt_linear_offset((lvl), (va)) & VPN_MASK)
 
+#define PAGETABLE_ORDER_MASK ((_AC(1, U) << PAGETABLE_ORDER) - 1)
+#define TABLE_OFFSET(offs) (_AT(unsigned int, offs) & PAGETABLE_ORDER_MASK)
+
+#if RV_STAGE1_MODE > SATP_MODE_SV39
+#error "need to to update DECLARE_OFFSETS macros"
+#else
+
+#define l0_table_offset(va) TABLE_OFFSET(pt_linear_offset(0, va))
+#define l1_table_offset(va) TABLE_OFFSET(pt_linear_offset(1, va))
+#define l2_table_offset(va) TABLE_OFFSET(pt_linear_offset(2, va))
+
+/* Generate an array @var containing the offset for each level from @addr */
+#define DECLARE_OFFSETS(var, addr)          \
+    const unsigned int var[] = {            \
+        l0_table_offset(addr),              \
+        l1_table_offset(addr),              \
+        l2_table_offset(addr),              \
+    }
+
+#endif
+
 /* Page Table entry */
 typedef struct {
 #ifdef CONFIG_RISCV_64
@@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p)
     return p.pte & PTE_VALID;
 }
 
+inline bool pte_is_table(const pte_t p)
+{
+    return ((p.pte & (PTE_VALID |
+                      PTE_READABLE |
+                      PTE_WRITABLE |
+                      PTE_EXECUTABLE)) == PTE_VALID);
+}
+
+static inline bool pte_is_mapping(const pte_t p)
+{
+    return (p.pte & PTE_VALID) &&
+           (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE));
+}
+
 static inline void invalidate_icache(void)
 {
     BUG_ON("unimplemented");
diff --git a/xen/arch/riscv/include/asm/riscv_encoding.h b/xen/arch/riscv/include/asm/riscv_encoding.h
index 58abe5eccc..d80cef0093 100644
--- a/xen/arch/riscv/include/asm/riscv_encoding.h
+++ b/xen/arch/riscv/include/asm/riscv_encoding.h
@@ -164,6 +164,7 @@
 #define SSTATUS_SD			SSTATUS64_SD
 #define SATP_MODE			SATP64_MODE
 #define SATP_MODE_SHIFT			SATP64_MODE_SHIFT
+#define SATP_PPN_MASK			_UL(0x00000FFFFFFFFFFF)
 
 #define HGATP_PPN			HGATP64_PPN
 #define HGATP_VMID_SHIFT		HGATP64_VMID_SHIFT
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index b8ff91cf4e..e8430def14 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -369,12 +369,3 @@ int destroy_xen_mappings(unsigned long s, unsigned long e)
     BUG_ON("unimplemented");
     return -1;
 }
-
-int map_pages_to_xen(unsigned long virt,
-                     mfn_t mfn,
-                     unsigned long nr_mfns,
-                     unsigned int flags)
-{
-    BUG_ON("unimplemented");
-    return -1;
-}
diff --git a/xen/arch/riscv/pt.c b/xen/arch/riscv/pt.c
new file mode 100644
index 0000000000..15eb02fe9e
--- /dev/null
+++ b/xen/arch/riscv/pt.c
@@ -0,0 +1,420 @@
+#include <xen/bug.h>
+#include <xen/domain_page.h>
+#include <xen/errno.h>
+#include <xen/mm.h>
+#include <xen/mm-frame.h>
+#include <xen/pmap.h>
+#include <xen/spinlock.h>
+
+#include <asm/flushtlb.h>
+#include <asm/page.h>
+
+static inline const mfn_t get_root_page(void)
+{
+    paddr_t root_maddr = (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
+
+    return maddr_to_mfn(root_maddr);
+}
+
+/* Sanity check of the entry. */
+static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int flags)
+{
+    /*
+     * See the comment about the possible combination of (mfn, flags) in
+     * the comment above pt_update().
+     */
+
+    /* Sanity check when modifying an entry. */
+    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
+    {
+        /* We don't allow modifying an invalid entry. */
+        if ( !pte_is_valid(entry) )
+        {
+            printk("Modifying invalid entry is not allowed.\n");
+            return false;
+        }
+
+        /* We don't allow modifying a table entry */
+        if ( pte_is_table(entry) )
+        {
+            printk("Modifying a table entry is not allowed.\n");
+            return false;
+        }
+    }
+    /* Sanity check when inserting a mapping */
+    else if ( flags & PTE_VALID )
+    {
+        /* We should be here with a valid MFN. */
+        ASSERT(!mfn_eq(mfn, INVALID_MFN));
+
+        /*
+         * We don't allow replacing any valid entry.
+         *
+         * Note that the function pt_update() relies on this
+         * assumption and will skip the TLB flush (when Svvptc
+         * extension will be ratified). The function will need
+         * to be updated if the check is relaxed.
+         */
+        if ( pte_is_valid(entry) )
+        {
+            if ( pte_is_mapping(entry) )
+                printk("Changing MFN for a valid entry is not allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n",
+                       mfn_x(mfn_from_pte(entry)), mfn_x(mfn));
+            else
+                printk("Trying to replace a table with a mapping.\n");
+            return false;
+        }
+    }
+    /* Sanity check when removing a mapping. */
+    else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )
+    {
+        /* We should be here with an invalid MFN. */
+        ASSERT(mfn_eq(mfn, INVALID_MFN));
+
+        /* We don't allow removing a table */
+        if ( pte_is_table(entry) )
+        {
+            printk("Removing a table is not allowed.\n");
+            return false;
+        }
+    }
+    /* Sanity check when populating the page-table. No check so far. */
+    else
+    {
+        ASSERT(flags & PTE_POPULATE);
+        /* We should be here with an invalid MFN */
+        ASSERT(mfn_eq(mfn, INVALID_MFN));
+    }
+
+    return true;
+}
+
+static pte_t *map_table(mfn_t mfn)
+{
+    /*
+     * During early boot, map_domain_page() may be unusable. Use the
+     * PMAP to map temporarily a page-table.
+     */
+    if ( system_state == SYS_STATE_early_boot )
+        return pmap_map(mfn);
+
+    return map_domain_page(mfn);
+}
+
+static void unmap_table(const pte_t *table)
+{
+    /*
+     * During early boot, map_table() will not use map_domain_page()
+     * but the PMAP.
+     */
+    if ( system_state == SYS_STATE_early_boot )
+        pmap_unmap(table);
+    else
+        unmap_domain_page(table);
+}
+
+static int create_table(pte_t *entry)
+{
+    mfn_t mfn;
+    void *p;
+    pte_t pte;
+
+    if ( system_state != SYS_STATE_early_boot )
+    {
+        struct page_info *pg = alloc_domheap_page(NULL, 0);
+
+        if ( pg == NULL )
+            return -ENOMEM;
+
+        mfn = page_to_mfn(pg);
+    }
+    else
+        mfn = alloc_boot_pages(1, 1);
+
+    p = map_table(mfn);
+    clear_page(p);
+    unmap_table(p);
+
+    pte = pte_from_mfn(mfn, PTE_TABLE);
+    write_pte(entry, pte);
+
+    return 0;
+}
+
+#define XEN_TABLE_MAP_FAILED 0
+#define XEN_TABLE_SUPER_PAGE 1
+#define XEN_TABLE_NORMAL 2
+
+/*
+ * Take the currently mapped table, find the corresponding entry,
+ * and map the next table, if available.
+ *
+ * The alloc_tbl parameters indicates whether intermediate tables should
+ * be allocated when not present.
+ *
+ * Return values:
+ *  XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry
+ *  was empty, or allocating a new page failed.
+ *  XEN_TABLE_NORMAL: next level or leaf mapped normally
+ *  XEN_TABLE_SUPER_PAGE: The next entry points to a superpage.
+ */
+static int pt_next_level(bool alloc_tbl, pte_t **table, unsigned int offset)
+{
+    pte_t *entry;
+    int ret;
+    mfn_t mfn;
+
+    entry = *table + offset;
+
+    if ( !pte_is_valid(*entry) )
+    {
+        if ( alloc_tbl )
+            return XEN_TABLE_MAP_FAILED;
+
+        ret = create_table(entry);
+        if ( ret )
+            return XEN_TABLE_MAP_FAILED;
+    }
+
+    if ( pte_is_mapping(*entry) )
+        return XEN_TABLE_SUPER_PAGE;
+
+    mfn = mfn_from_pte(*entry);
+
+    unmap_table(*table);
+    *table = map_table(mfn);
+
+    return XEN_TABLE_NORMAL;
+}
+
+/* Update an entry at the level @target. */
+static int pt_update_entry(mfn_t root, unsigned long virt,
+                           mfn_t mfn, unsigned int target,
+                           unsigned int flags)
+{
+    int rc;
+    unsigned int level = HYP_PT_ROOT_LEVEL;
+    pte_t *table;
+    /*
+     * The intermediate page table shouldn't be allocated when MFN isn't
+     * valid and we are not populating page table.
+     * This means we either modify permissions or remove an entry, or
+     * inserting brand new entry.
+     *
+     * See the comment above pt_update() for an additional explanation about
+     * combinations of (mfn, flags).
+    */
+    bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags & PTE_POPULATE);
+    pte_t pte, *entry;
+
+    /* convenience aliases */
+    DECLARE_OFFSETS(offsets, virt);
+
+    table = map_table(root);
+    for ( ; level > target; level-- )
+    {
+        rc = pt_next_level(alloc_tbl, &table, offsets[level]);
+        if ( rc == XEN_TABLE_MAP_FAILED )
+        {
+            rc = 0;
+
+            /*
+             * We are here because pt_next_level has failed to map
+             * the intermediate page table (e.g the table does not exist
+             * and the pt is read-only). It is a valid case when
+             * removing a mapping as it may not exist in the page table.
+             * In this case, just ignore it.
+             */
+            if ( flags & PTE_VALID )
+            {
+                printk("%s: Unable to map level %u\n", __func__, level);
+                rc = -ENOENT;
+            }
+
+            goto out;
+        }
+        else if ( rc != XEN_TABLE_NORMAL )
+            break;
+    }
+
+    if ( level != target )
+    {
+        printk("%s: Shattering superpage is not supported\n", __func__);
+        rc = -EOPNOTSUPP;
+        goto out;
+    }
+
+    entry = table + offsets[level];
+
+    rc = -EINVAL;
+    if ( !pt_check_entry(*entry, mfn, flags) )
+        goto out;
+
+    /* We are removing the page */
+    if ( !(flags & PTE_VALID) )
+        memset(&pte, 0x00, sizeof(pte));
+    else
+    {
+        /* We are inserting a mapping => Create new pte. */
+        if ( !mfn_eq(mfn, INVALID_MFN) )
+            pte = pte_from_mfn(mfn, PTE_VALID);
+        else /* We are updating the permission => Copy the current pte. */
+            pte = *entry;
+
+        /* update permission according to the flags */
+        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
+    }
+
+    write_pte(entry, pte);
+
+    rc = 0;
+
+ out:
+    unmap_table(table);
+
+    return rc;
+}
+
+/* Return the level where mapping should be done */
+static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned long nr,
+                            unsigned int flags)
+{
+    unsigned int level = 0;
+    unsigned long mask;
+    unsigned int i;
+
+    /* Use blocking mapping unless the caller requests 4K mapping */
+    if ( unlikely(flags & PTE_SMALL) )
+        return level;
+
+    /*
+     * Don't take into account the MFN when removing mapping (i.e
+     * MFN_INVALID) to calculate the correct target order.
+     *
+     * `vfn` and `mfn` must be both superpage aligned.
+     * They are or-ed together and then checked against the size of
+     * each level.
+     *
+     * `left` is not included and checked separately to allow
+     * superpage mapping even if it is not properly aligned (the
+     * user may have asked to map 2MB + 4k).
+     */
+    mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
+    mask |= vfn;
+
+    for ( i = HYP_PT_ROOT_LEVEL; i != 0; i-- )
+    {
+        if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) &&
+             (nr >= BIT(XEN_PT_LEVEL_ORDER(i), UL)) )
+        {
+            level = i;
+            break;
+        }
+    }
+
+    return level;
+}
+
+static DEFINE_SPINLOCK(xen_pt_lock);
+
+/*
+ * If `mfn` equals `INVALID_MFN`, it indicates that the following page table
+ * update operation might be related to either populating the table (
+ * PTE_POPULATE will be set additionaly), destroying a mapping, or modifying
+ * an existing mapping.
+ *
+ * If `mfn` is valid and flags has PTE_VALID bit set then it means that
+ * inserting will be done.
+ */
+static int pt_update(unsigned long virt,
+                     mfn_t mfn,
+                     unsigned long nr_mfns,
+                     unsigned int flags)
+{
+    int rc = 0;
+    unsigned long vfn = virt >> PAGE_SHIFT;
+    unsigned long left = nr_mfns;
+
+    const mfn_t root = get_root_page();
+
+    /*
+     * It is bad idea to have mapping both writeable and
+     * executable.
+     * When modifying/creating mapping (i.e PTE_VALID is set),
+     * prevent any update if this happen.
+     */
+    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) && PTE_X_MASK(flags) )
+    {
+        printk("Mappings should not be both Writeable and Executable.\n");
+        return -EINVAL;
+    }
+
+    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
+    {
+        printk("The virtual address is not aligned to the page-size.\n");
+        return -EINVAL;
+    }
+
+    spin_lock(&xen_pt_lock);
+
+    while ( left )
+    {
+        unsigned int order, level;
+
+        level = pt_mapping_level(vfn, mfn, left, flags);
+        order = XEN_PT_LEVEL_ORDER(level);
+
+        ASSERT(left >= BIT(order, UL));
+
+        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level, flags);
+        if ( rc )
+            break;
+
+        vfn += 1UL << order;
+        if ( !mfn_eq(mfn, INVALID_MFN) )
+            mfn = mfn_add(mfn, 1UL << order);
+
+        left -= (1UL << order);
+    }
+
+    /* Ensure that PTEs are all updated before flushing */
+    RISCV_FENCE(rw, rw);
+
+    spin_unlock(&xen_pt_lock);
+
+    /*
+     * Always flush TLB at the end of the function as non-present entries
+     * can be put in the TLB.
+     *
+     * The remote fence operation applies to the entire address space if
+     * either:
+     *  - start and size are both 0, or
+     *  - size is equal to 2^XLEN-1.
+     *
+     * TODO: come up with something which will allow not to flash the entire
+     *       address space.
+     */
+    flush_tlb_range_va(0, 0);
+
+    return rc;
+}
+
+int map_pages_to_xen(unsigned long virt,
+                     mfn_t mfn,
+                     unsigned long nr_mfns,
+                     unsigned int flags)
+{
+    /*
+     * Ensure that flags has PTE_VALID bit as map_pages_to_xen() is supposed
+     * to create a mapping.
+     *
+     * Ensure that we have a valid MFN before proceeding.
+     *
+     * If the MFN is invalid, pt_update() might misinterpret the operation,
+     * treating it as either a population, a mapping destruction,
+     * or a mapping modification.
+     */
+    ASSERT(!mfn_eq(mfn, INVALID_MFN) || (flags & PTE_VALID));
+
+    return pt_update(virt, mfn, nr_mfns, flags);
+}
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 7/7] xen/riscv: introduce early_fdt_map()
  2024-08-21 16:06 [PATCH v5 0/7] RISCV device tree mapping Oleksii Kurochko
                   ` (5 preceding siblings ...)
  2024-08-21 16:06 ` [PATCH v5 6/7] xen/riscv: page table handling Oleksii Kurochko
@ 2024-08-21 16:06 ` Oleksii Kurochko
  6 siblings, 0 replies; 32+ messages in thread
From: Oleksii Kurochko @ 2024-08-21 16:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Introduce function which allows to map FDT to Xen.

Also, initialization of device_tree_flattened happens using
early_fdt_map().

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V5:
 - drop usage of PTE_BLOCK for flag argument of map_pages_to_xen() in early_fdt_map()
   as block mapping is now default behaviour. Also PTE_BLOCK was dropped in the patch
   "xen/riscv: page table handling".
---
Changes in V4:
 - s/_PAGE_BLOCK/PTE_BLOCK
 - Add Acked-by: Jan Beulich <jbeulich@suse.com>
 - unwarap two lines in panic() in case when device_tree_flattened is NULL
   so  grep-ing for any part of the message line will always produce a hit.
 - slightly update the commit message.
---
Changes in V3:
 - Code style fixes
 - s/SZ_2M/MB(2)
 - fix condition to check if early_fdt_map() in setup.c return NULL or not.
---
Changes in V2:
 - rework early_fdt_map to use map_pages_to_xen()
 - move call early_fdt_map() to C code after MMU is enabled.
---
 xen/arch/riscv/include/asm/mm.h |  2 ++
 xen/arch/riscv/mm.c             | 55 +++++++++++++++++++++++++++++++++
 xen/arch/riscv/setup.c          |  7 +++++
 3 files changed, 64 insertions(+)

diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index ce1557bb27..4b7b00b850 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -259,4 +259,6 @@ static inline unsigned int arch_get_dma_bitsize(void)
 
 void setup_fixmap_mappings(void);
 
+void *early_fdt_map(paddr_t fdt_paddr);
+
 #endif /* _ASM_RISCV_MM_H */
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index e8430def14..4a628aef83 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -1,13 +1,16 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 
+#include <xen/bootfdt.h>
 #include <xen/bug.h>
 #include <xen/compiler.h>
 #include <xen/init.h>
 #include <xen/kernel.h>
+#include <xen/libfdt/libfdt.h>
 #include <xen/macros.h>
 #include <xen/mm.h>
 #include <xen/pfn.h>
 #include <xen/sections.h>
+#include <xen/sizes.h>
 
 #include <asm/early_printk.h>
 #include <asm/csr.h>
@@ -369,3 +372,55 @@ int destroy_xen_mappings(unsigned long s, unsigned long e)
     BUG_ON("unimplemented");
     return -1;
 }
+
+void * __init early_fdt_map(paddr_t fdt_paddr)
+{
+    /* We are using 2MB superpage for mapping the FDT */
+    paddr_t base_paddr = fdt_paddr & XEN_PT_LEVEL_MAP_MASK(1);
+    paddr_t offset;
+    void *fdt_virt;
+    uint32_t size;
+    int rc;
+
+    /*
+     * Check whether the physical FDT address is set and meets the minimum
+     * alignment requirement. Since we are relying on MIN_FDT_ALIGN to be at
+     * least 8 bytes so that we always access the magic and size fields
+     * of the FDT header after mapping the first chunk, double check if
+     * that is indeed the case.
+     */
+    BUILD_BUG_ON(MIN_FDT_ALIGN < 8);
+    if ( !fdt_paddr || fdt_paddr % MIN_FDT_ALIGN )
+        return NULL;
+
+    /* The FDT is mapped using 2MB superpage */
+    BUILD_BUG_ON(BOOT_FDT_VIRT_START % MB(2));
+
+    rc = map_pages_to_xen(BOOT_FDT_VIRT_START, maddr_to_mfn(base_paddr),
+                          MB(2) >> PAGE_SHIFT,
+                          PAGE_HYPERVISOR_RO);
+    if ( rc )
+        panic("Unable to map the device-tree.\n");
+
+    offset = fdt_paddr % XEN_PT_LEVEL_SIZE(1);
+    fdt_virt = (void *)BOOT_FDT_VIRT_START + offset;
+
+    if ( fdt_magic(fdt_virt) != FDT_MAGIC )
+        return NULL;
+
+    size = fdt_totalsize(fdt_virt);
+    if ( size > BOOT_FDT_VIRT_SIZE )
+        return NULL;
+
+    if ( (offset + size) > MB(2) )
+    {
+        rc = map_pages_to_xen(BOOT_FDT_VIRT_START + MB(2),
+                              maddr_to_mfn(base_paddr + MB(2)),
+                              MB(2) >> PAGE_SHIFT,
+                              PAGE_HYPERVISOR_RO);
+        if ( rc )
+            panic("Unable to map the device-tree\n");
+    }
+
+    return fdt_virt;
+}
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index f147ba672f..c9a6909c91 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -2,6 +2,7 @@
 
 #include <xen/bug.h>
 #include <xen/compile.h>
+#include <xen/device_tree.h>
 #include <xen/init.h>
 #include <xen/mm.h>
 
@@ -56,6 +57,12 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
 
     setup_fixmap_mappings();
 
+    device_tree_flattened = early_fdt_map(dtb_addr);
+    if ( !device_tree_flattened )
+        panic("Invalid device tree blob at physical address %#lx. The DTB must be 8-byte aligned and must not exceed %lld bytes in size.\n\n"
+              "Please check your bootloader.\n",
+              dtb_addr, BOOT_FDT_VIRT_SIZE);
+
     printk("All set up\n");
 
     for ( ;; )
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/7] xen/riscv: use {read,write}{b,w,l,q}_cpu() to define {read,write}_atomic()
  2024-08-21 16:06 ` [PATCH v5 1/7] xen/riscv: use {read,write}{b,w,l,q}_cpu() to define {read,write}_atomic() Oleksii Kurochko
@ 2024-08-27 10:06   ` Jan Beulich
  2024-08-28  9:21     ` oleksii.kurochko
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2024-08-27 10:06 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 21.08.2024 18:06, Oleksii Kurochko wrote:
> In Xen, memory-ordered atomic operations are not necessary,

This is an interesting statement. I'd like to suggest that you at least
limit it to the two constructs in question, rather than stating this
globally for everything.

> based on {read,write}_atomic() implementations for other architectures.
> Therefore, {read,write}{b,w,l,q}_cpu() can be used instead of
> {read,write}{b,w,l,q}(), allowing the caller to decide if additional
> fences should be applied before or after {read,write}_atomic().
> 
> Change the declaration of _write_atomic() to accept a 'volatile void *'
> type for the 'x' argument instead of 'unsigned long'.
> This prevents compilation errors such as:
> 1."discards 'volatile' qualifier from pointer target type," which occurs
>   due to the initialization of a volatile pointer,
>   e.g., `volatile uint8_t *ptr = p;` in _add_sized().

I don't follow you here. It's the other argument of write_atomic() that
has ptr passed there.

> 2."incompatible type for argument 2 of '_write_atomic'," which can occur
>   when calling write_pte(), where 'x' is of type pte_t rather than
>   unsigned long.

How's this related to the change at hand? That isn't different ahead of
this change, is it?

> --- a/xen/arch/riscv/include/asm/atomic.h
> +++ b/xen/arch/riscv/include/asm/atomic.h
> @@ -31,21 +31,17 @@
>  
>  void __bad_atomic_size(void);
>  
> -/*
> - * Legacy from Linux kernel. For some reason they wanted to have ordered
> - * read/write access. Thereby read* is used instead of read*_cpu()
> - */
>  static always_inline void read_atomic_size(const volatile void *p,
>                                             void *res,
>                                             unsigned int size)
>  {
>      switch ( size )
>      {
> -    case 1: *(uint8_t *)res = readb(p); break;
> -    case 2: *(uint16_t *)res = readw(p); break;
> -    case 4: *(uint32_t *)res = readl(p); break;
> +    case 1: *(uint8_t *)res = readb_cpu(p); break;
> +    case 2: *(uint16_t *)res = readw_cpu(p); break;
> +    case 4: *(uint32_t *)res = readl_cpu(p); break;
>  #ifndef CONFIG_RISCV_32
> -    case 8: *(uint32_t *)res = readq(p); break;
> +    case 8: *(uint32_t *)res = readq_cpu(p); break;
>  #endif
>      default: __bad_atomic_size(); break;
>      }
> @@ -58,15 +54,16 @@ static always_inline void read_atomic_size(const volatile void *p,
>  })
>  
>  static always_inline void _write_atomic(volatile void *p,
> -                                       unsigned long x, unsigned int size)
> +                                        volatile void *x,

If this really needs to become a pointer, it ought to also be pointer-
to-const. Otherwise it is yet more confusing which operand is which.

> +                                        unsigned int size)
>  {
>      switch ( size )
>      {
> -    case 1: writeb(x, p); break;
> -    case 2: writew(x, p); break;
> -    case 4: writel(x, p); break;
> +    case 1: writeb_cpu(*(uint8_t *)x, p); break;
> +    case 2: writew_cpu(*(uint16_t *)x, p); break;
> +    case 4: writel_cpu(*(uint32_t *)x, p); break;
>  #ifndef CONFIG_RISCV_32
> -    case 8: writeq(x, p); break;
> +    case 8: writeq_cpu(*(uint64_t *)x, p); break;

Of course you may not cast away const-ness then. You also be casting
away volatile-ness, but (as per above) I question the need for volatile
on x.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/7] xen/riscv: set up fixmap mappings
  2024-08-21 16:06 ` [PATCH v5 2/7] xen/riscv: set up fixmap mappings Oleksii Kurochko
@ 2024-08-27 10:29   ` Jan Beulich
  2024-08-28  9:53     ` oleksii.kurochko
  2024-08-30 11:55     ` oleksii.kurochko
  0 siblings, 2 replies; 32+ messages in thread
From: Jan Beulich @ 2024-08-27 10:29 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 21.08.2024 18:06, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/config.h
> +++ b/xen/arch/riscv/include/asm/config.h
> @@ -41,8 +41,10 @@
>   * Start addr          | End addr         | Slot       | area description
>   * ============================================================================
>   *                   .....                 L2 511          Unused
> - *  0xffffffffc0600000  0xffffffffc0800000 L2 511          Fixmap
> - *  0xffffffffc0200000  0xffffffffc0600000 L2 511          FDT
> + *  0xffffffffc0A00000  0xffffffffc0C00000 L2 511          Fixmap

Nit: Please can you avoid using mixed case in numbers?

> @@ -74,6 +76,15 @@
>  #error "unsupported RV_STAGE1_MODE"
>  #endif
>  
> +#define GAP_SIZE                MB(2)
> +
> +#define XEN_VIRT_SIZE           MB(2)
> +
> +#define BOOT_FDT_VIRT_START     (XEN_VIRT_START + XEN_VIRT_SIZE + GAP_SIZE)
> +#define BOOT_FDT_VIRT_SIZE      MB(4)
> +
> +#define FIXMAP_BASE             (BOOT_FDT_VIRT_START + BOOT_FDT_VIRT_SIZE + GAP_SIZE)

Nit: Overly long line.

> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/fixmap.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * fixmap.h: compile-time virtual memory allocation
> + */
> +#ifndef ASM_FIXMAP_H
> +#define ASM_FIXMAP_H
> +
> +#include <xen/bug.h>
> +#include <xen/page-size.h>
> +#include <xen/pmap.h>
> +
> +#include <asm/page.h>
> +
> +#define FIXMAP_ADDR(n) (FIXMAP_BASE + (n) * PAGE_SIZE)
> +
> +/* Fixmap slots */
> +#define FIX_PMAP_BEGIN (0) /* Start of PMAP */
> +#define FIX_PMAP_END (FIX_PMAP_BEGIN + NUM_FIX_PMAP - 1) /* End of PMAP */
> +#define FIX_MISC (FIX_PMAP_END + 1)  /* Ephemeral mappings of hardware */
> +
> +#define FIX_LAST FIX_MISC
> +
> +#define FIXADDR_START FIXMAP_ADDR(0)
> +#define FIXADDR_TOP FIXMAP_ADDR(FIX_LAST + 1)
> +
> +#ifndef __ASSEMBLY__
> +
> +/*
> + * Direct access to xen_fixmap[] should only happen when {set,
> + * clear}_fixmap() is unusable (e.g. where we would end up to
> + * recursively call the helpers).
> + */
> +extern pte_t xen_fixmap[];

I'm afraid I keep being irritated by the comment: What recursive use of
helpers is being talked about here? I can't see anything recursive in this
patch. If this starts happening with a subsequent patch, then you have
two options: Move the declaration + comment there, or clarify in the
description (in enough detail) what this is about.

> @@ -81,6 +82,18 @@ static inline void flush_page_to_ram(unsigned long mfn, bool sync_icache)
>      BUG_ON("unimplemented");
>  }
>  
> +/* Write a pagetable entry. */
> +static inline void write_pte(pte_t *p, pte_t pte)
> +{
> +    write_atomic(p, pte);
> +}
> +
> +/* Read a pagetable entry. */
> +static inline pte_t read_pte(pte_t *p)
> +{
> +    return read_atomic(p);

This only works because of the strange type trickery you're playing in
read_atomic(). Look at x86 code - there's a strict expectation that the
type can be converted to/from unsigned long. And page table accessors
are written with that taken into consideration. Same goes for write_pte()
of course, with the respective comment on the earlier patch in mind.

Otoh I see that Arm does something very similar. If you have a strong
need / desire to follow that, then please at least split the two
entirely separate aspects that patch 1 presently changes both in one go.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-21 16:06 ` [PATCH v5 4/7] xen/riscv: introduce functionality to work with CPU info Oleksii Kurochko
@ 2024-08-27 13:44   ` Jan Beulich
  2024-08-28 10:56     ` oleksii.kurochko
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2024-08-27 13:44 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 21.08.2024 18:06, Oleksii Kurochko wrote:
> Introduce struct pcpu_info to store pCPU-related information.
> Initially, it includes only processor_id and hart id, but it
> will be extended to include guest CPU information and
> temporary variables for saving/restoring vCPU registers.
> 
> Add set_processor_id() and get_processor_id() functions to set
> and retrieve the processor_id stored in pcpu_info.
> 
> Introduce cpuid_to_hartid_map() to convert Xen logical CPUs to
> hart IDs (physical CPU IDs).

There's no function of that name anymore.

> ---
> Changes in V5:
>  - add hart_id to pcpu_info;
>  - add comments to pcpu_info members.
>  - define INVALID_HARTID as ULONG_MAX as mhart_id register has MXLEN which is
>    equal to 32 for RV-32 and 64 for RV-64.
>  - add hart_id to pcpu_info structure.
>  - drop cpuid_to_hartid_map[] and use pcpu_info[] for the same purpose.
>  - introduce new function setup_tp(cpuid).
>  - add the FIXME commit on top of pcpu_info[].

Once again "comment" here? And that despite ...

>  - setup TP register before start_xen() being called.
>  - update the commit message.
>  - change "commit message" to "comment" in "Changes in V4" in "update the comment
>    above the code of TP..."

... this?

> --- a/xen/arch/riscv/include/asm/processor.h
> +++ b/xen/arch/riscv/include/asm/processor.h
> @@ -12,8 +12,33 @@
>  
>  #ifndef __ASSEMBLY__
>  
> -/* TODO: need to be implemeted */
> -#define smp_processor_id() 0
> +#include <xen/bug.h>
> +
> +register struct pcpu_info *tp asm ("tp");

Nit: Strictly speaking there need to be blanks inside the parentheses.
But maybe an exception for a register variable name declaration is okay.

> +struct pcpu_info {
> +    unsigned int processor_id; /* Xen CPU id */
> +    unsigned long hart_id; /* physical CPU id */
> +};
> +
> +/* tp points to one of these */
> +extern struct pcpu_info pcpu_info[NR_CPUS];
> +
> +#define get_processor_id()      (tp->processor_id)
> +#define set_processor_id(id)    do { \
> +    tp->processor_id = (id);         \
> +} while (0)
> +
> +static inline unsigned int smp_processor_id(void)
> +{
> +    unsigned int id;
> +
> +    id = get_processor_id();

Nit: This can easily be the initializer of the variable.

> +    BUG_ON(id > NR_CPUS);

I'm pretty sure I pointed out before that this is off by 1: NR_CPUS
itself is invalid, too.

> --- a/xen/arch/riscv/include/asm/smp.h
> +++ b/xen/arch/riscv/include/asm/smp.h
> @@ -5,6 +5,10 @@
>  #include <xen/cpumask.h>
>  #include <xen/percpu.h>
>  
> +#include <asm/processor.h>
> +
> +#define INVALID_HARTID ULONG_MAX

So what if the firmware report this value for one of the harts?

> @@ -14,6 +18,13 @@ DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
>   */
>  #define park_offline_cpus false
>  
> +void smp_set_bootcpu_id(unsigned long boot_cpu_hartid);
> +
> +/*
> + * Mapping between linux logical cpu index and hartid.
> + */
> +#define cpuid_to_hartid(cpu) pcpu_info[cpu].hart_id

If I'm not mistaken Misra demands parentheses around the expression
even in cases like this one (where at the use sites you can't really
do anything [that makes at least some sense] to break what the macro
is supposed to do).

> --- a/xen/arch/riscv/riscv64/head.S
> +++ b/xen/arch/riscv/riscv64/head.S
> @@ -55,6 +55,10 @@ FUNC(start)
>           */
>          jal     reset_stack
>  
> +        /* Xen's boot cpu id is equal to 0 so setup TP register for it */
> +        mv      a0, x0
> +        jal     setup_tp

I'm not going to insist, but for the casual reader "li a0, 0" may be
more obvious as to what it does, and if I'm not mistaken that actually
expands to the same underlying insn as the "mv" you use.

> --- a/xen/arch/riscv/setup.c
> +++ b/xen/arch/riscv/setup.c
> @@ -8,6 +8,7 @@
>  #include <public/version.h>
>  
>  #include <asm/early_printk.h>
> +#include <asm/smp.h>
>  #include <asm/traps.h>
>  
>  void arch_get_xen_caps(xen_capabilities_info_t *info)
> @@ -40,6 +41,10 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
>  {
>      remove_identity_mapping();
>  
> +    set_processor_id(0);

This isn't really needed, is it? The pcpu_info[] initializer already
installs the necessary 0. Another thing would be if the initializer
set the field to, say, NR_CPUS.

> --- /dev/null
> +++ b/xen/arch/riscv/smp.c
> @@ -0,0 +1,21 @@
> +#include <xen/smp.h>
> +
> +/*
> + * FIXME: make pcpu_info[] dynamically allocated when necessary
> + *        functionality will be ready
> + */
> +/* tp points to one of these per cpu */
> +struct pcpu_info pcpu_info[NR_CPUS] = { { 0, INVALID_HARTID } };

As to the initializer - what about CPUs other than CPU0? Would they
better all have hart_id set to invalid?

Also, as a pretty strong suggestion to avoid excessive churn going
forward: Please consider using dedicated initializers here. IOW
perhaps

struct pcpu_info pcpu_info[NR_CPUS] = { [0 ... NR_CPUS - 1] = {
    .hart_id = INVALID_HARTID,
}};

Yet as said earlier - in addition you likely want to make sure no
two CPUs have (part of) their struct instance in the same cache line.
That won't matter right now, as you have no fields you alter at
runtime, but I expect such fields will appear.

> +void setup_tp(unsigned int cpuid)
> +{
> +    /*
> +     * tp register contains an address of physical cpu information.
> +     * So write physical CPU info of cpuid to tp register.
> +     * It will be used later by get_processor_id() ( look at
> +     * <asm/processor.h> ):
> +     *   #define get_processor_id()    (tp->processor_id)
> +     */
> +    asm volatile ( "mv tp, %0"
> +                   :: "r" ((unsigned long)&pcpu_info[cpuid]) : "memory" );
> +}

So you've opted to still do this in C. Which means there's still a
residual risk of the compiler assuming it can already to tp. What's
the problem with doing this properly in assembly?

As to the memory clobber - in an isolated, non-inline function its
significance is reduced mostly to the case of LTO (which I'm not
sure you even target). Nevertheless probably worth keeping, even if
mainly for documentation purposes. Provided of course this C function
is to remain.

> --- /dev/null
> +++ b/xen/arch/riscv/smpboot.c
> @@ -0,0 +1,8 @@
> +#include <xen/init.h>
> +#include <xen/sections.h>
> +#include <xen/smp.h>
> +
> +void __init smp_set_bootcpu_id(unsigned long boot_cpu_hartid)
> +{
> +    cpuid_to_hartid(0) = boot_cpu_hartid;
> +}

Does this really need its own function?

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-21 16:06 ` [PATCH v5 5/7] xen/riscv: introduce and initialize SBI RFENCE extension Oleksii Kurochko
@ 2024-08-27 14:19   ` Jan Beulich
  2024-08-28 13:11     ` oleksii.kurochko
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2024-08-27 14:19 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 21.08.2024 18:06, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/sbi.h
> +++ b/xen/arch/riscv/include/asm/sbi.h
> @@ -12,8 +12,41 @@
>  #ifndef __ASM_RISCV_SBI_H__
>  #define __ASM_RISCV_SBI_H__
>  
> +#include <xen/cpumask.h>
> +
>  #define SBI_EXT_0_1_CONSOLE_PUTCHAR		0x1
>  
> +#define SBI_EXT_BASE                    0x10
> +#define SBI_EXT_RFENCE                  0x52464E43
> +
> +/* SBI function IDs for BASE extension */
> +#define SBI_EXT_BASE_GET_SPEC_VERSION   0x0
> +#define SBI_EXT_BASE_GET_IMP_ID         0x1
> +#define SBI_EXT_BASE_GET_IMP_VERSION    0x2
> +#define SBI_EXT_BASE_PROBE_EXT          0x3
> +
> +/* SBI function IDs for RFENCE extension */
> +#define SBI_EXT_RFENCE_REMOTE_FENCE_I           0x0
> +#define SBI_EXT_RFENCE_REMOTE_SFENCE_VMA        0x1
> +#define SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID   0x2
> +#define SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA       0x3
> +#define SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID  0x4
> +#define SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA       0x5
> +#define SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA_ASID  0x6
> +
> +#define SBI_SPEC_VERSION_MAJOR_MASK     0x7F000000
> +#define SBI_SPEC_VERSION_MINOR_MASK     0xffffff

Nit: Perhaps neater / more clear as

#define SBI_SPEC_VERSION_MAJOR_MASK     0x7f000000
#define SBI_SPEC_VERSION_MINOR_MASK     0x00ffffff

> @@ -31,4 +64,34 @@ struct sbiret sbi_ecall(unsigned long ext, unsigned long fid,
>   */
>  void sbi_console_putchar(int ch);
>  
> +/*
> + * Check underlying SBI implementation has RFENCE
> + *
> + * @return true for supported AND false for not-supported
> + */
> +bool sbi_has_rfence(void);
> +
> +/*
> + * Instructs the remote harts to execute one or more SFENCE.VMA
> + * instructions, covering the range of virtual addresses between
> + * [start_addr, start_addr + size).
> + *
> + * Returns 0 if IPI was sent to all the targeted harts successfully
> + * or negative value if start_addr or size is not valid.
> + *
> + * @hart_mask a cpu mask containing all the target harts.
> + * @param start virtual address start
> + * @param size virtual address range size
> + */
> +int sbi_remote_sfence_vma(const cpumask_t *cpu_mask,
> +                          unsigned long start_addr,
> +                          unsigned long size);

I may have asked before: Why not vaddr_t and size_t respectively?

> @@ -38,7 +51,265 @@ struct sbiret sbi_ecall(unsigned long ext, unsigned long fid,
>      return ret;
>  }
>  
> +static int sbi_err_map_xen_errno(int err)
> +{
> +    switch ( err )
> +    {
> +    case SBI_SUCCESS:
> +        return 0;
> +    case SBI_ERR_DENIED:
> +        return -EACCES;
> +    case SBI_ERR_INVALID_PARAM:
> +        return -EINVAL;
> +    case SBI_ERR_INVALID_ADDRESS:
> +        return -EFAULT;
> +    case SBI_ERR_NOT_SUPPORTED:
> +        return -EOPNOTSUPP;
> +    case SBI_ERR_FAILURE:
> +        fallthrough;
> +    default:

What's the significance of the "fallthrough" here?

> +static unsigned long sbi_major_version(void)
> +{
> +    return MASK_EXTR(sbi_spec_version, SBI_SPEC_VERSION_MAJOR_MASK);
> +}
> +
> +static unsigned long sbi_minor_version(void)
> +{
> +    return MASK_EXTR(sbi_spec_version, SBI_SPEC_VERSION_MINOR_MASK);
> +}

Both functions return less than 32-bit wide values. Why unsigned long
return types?

> +static long sbi_ext_base_func(long fid)
> +{
> +    struct sbiret ret;
> +
> +    ret = sbi_ecall(SBI_EXT_BASE, fid, 0, 0, 0, 0, 0, 0);
> +
> +    /*
> +     * I wasn't able to find a case in the SBI spec where sbiret.value
> +     * could be negative.
> +     *
> +     * Unfortunately, the spec does not specify the possible values of
> +     * sbiret.value, but based on the description of the SBI function,
> +     * ret.value >= 0 when sbiret.error = 0. SPI spec specify only
> +     * possible value for sbiret.error (<= 0 whwere 0 is SBI_SUCCESS ).
> +     *
> +     * Just to be sure that SBI base extension functions one day won't
> +     * start to return a negative value for sbiret.value when
> +     * sbiret.error < 0 BUG_ON() is added.
> +     */
> +    BUG_ON(ret.value < 0);

I'd be careful here and move this ...

> +    if ( !ret.error )
> +        return ret.value;

... into the if() body here, just to avoid the BUG_ON() pointlessly
triggering ...

> +    else
> +        return ret.error;

... when an error is returned anyway. After all, if an error is returned,
ret.value presumably is (deemed) undefined.

> +static int sbi_rfence_v02_real(unsigned long fid,
> +                               unsigned long hmask, unsigned long hbase,
> +                               unsigned long start, unsigned long size,

Again vaddr_t / size_t perhaps? And then again elsewhere as well.

> +                               unsigned long arg4)
> +{
> +    struct sbiret ret = {0};
> +    int result = 0;
> +
> +    switch ( fid )
> +    {
> +    case SBI_EXT_RFENCE_REMOTE_FENCE_I:
> +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> +                        0, 0, 0, 0);
> +        break;
> +
> +    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA:
> +    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA:
> +    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA:
> +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> +                        start, size, 0, 0);
> +        break;
> +
> +    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID:
> +    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID:
> +    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA_ASID:
> +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> +                        start, size, arg4, 0);
> +        break;
> +
> +    default:
> +        printk("%s: unknown function ID [%lu]\n",

I wonder how useful the logging in decimal of (perhaps large) unknown
values is.

> +               __func__, fid);
> +        result = -EINVAL;
> +        break;
> +    };
> +
> +    if ( ret.error )
> +    {
> +        result = sbi_err_map_xen_errno(ret.error);
> +        printk("%s: hbase=%lu hmask=%#lx failed (error %d)\n",
> +               __func__, hbase, hmask, result);

Considering that sbi_err_map_xen_errno() may lose information, I'd
recommend logging ret.error here.

> +static int cf_check sbi_rfence_v02(unsigned long fid,
> +                                   const cpumask_t *cpu_mask,
> +                                   unsigned long start, unsigned long size,
> +                                   unsigned long arg4, unsigned long arg5)
> +{
> +    unsigned long hartid, cpuid, hmask = 0, hbase = 0, htop = 0;
> +    int result;
> +
> +    /*
> +     * hart_mask_base can be set to -1 to indicate that hart_mask can be
> +     * ignored and all available harts must be considered.
> +     */
> +    if ( !cpu_mask )
> +        return sbi_rfence_v02_real(fid, 0UL, -1UL, start, size, arg4);
> +
> +    for_each_cpu ( cpuid, cpu_mask )
> +    {
> +        /*
> +        * Hart IDs might not necessarily be numbered contiguously in
> +        * a multiprocessor system, but at least one hart must have a
> +        * hart ID of zero.

Does this latter fact matter here in any way?

> +        * This means that it is possible for the hart ID mapping to look like:
> +        *  0, 1, 3, 65, 66, 69
> +        * In such cases, more than one call to sbi_rfence_v02_real() will be
> +        * needed, as a single hmask can only cover sizeof(unsigned long) CPUs:
> +        *  1. sbi_rfence_v02_real(hmask=0b1011, hbase=0)
> +        *  2. sbi_rfence_v02_real(hmask=0b1011, hbase=65)
> +        *
> +        * The algorithm below tries to batch as many harts as possible before
> +        * making an SBI call. However, batching may not always be possible.
> +        * For example, consider the hart ID mapping:
> +        *   0, 64, 1, 65, 2, 66

Just to mention it: Batching is also possible here: First (0,1,2), then
(64,65,66). It just requires a different approach. Whether switching is
worthwhile depends on how numbering is done on real world systems.

> +        */
> +        hartid = cpuid_to_hartid(cpuid);
> +        if ( hmask )
> +        {
> +            if ( hartid + BITS_PER_LONG <= htop ||
> +                 hbase + BITS_PER_LONG <= hartid )
> +            {
> +                result = sbi_rfence_v02_real(fid, hmask, hbase,
> +                                             start, size, arg4);
> +                if ( result )
> +                    return result;
> +                hmask = 0;
> +            }
> +            else if ( hartid < hbase )
> +            {
> +                /* shift the mask to fit lower hartid */
> +                hmask <<= hbase - hartid;
> +                hbase = hartid;
> +            }
> +        }
> +
> +        if ( !hmask )
> +        {
> +            hbase = hartid;
> +            htop = hartid;
> +        }
> +        else if ( hartid > htop )
> +            htop = hartid;
> +
> +        hmask |= BIT(hartid - hbase, UL);
> +    }
> +
> +    if ( hmask )
> +    {
> +        result = sbi_rfence_v02_real(fid, hmask, hbase,
> +                                     start, size, arg4);
> +        if ( result )
> +            return result;

It's a little odd to have this here, rather than ...

> +    }
> +
> +    return 0;
> +}

... making these two a uniform return path. If you wanted you
could even replace the return in the loop with a simple break,
merely requiring the clearing of hmask to come first.

> +static int (* __ro_after_init sbi_rfence)(unsigned long fid,
> +                                          const cpumask_t *cpu_mask,
> +                                          unsigned long start,
> +                                          unsigned long size,
> +                                          unsigned long arg4,
> +                                          unsigned long arg5);
> +
> +int sbi_remote_sfence_vma(const cpumask_t *cpu_mask,
> +                          unsigned long start_addr,

To match other functions, perhaps just "start"?

> +int sbi_probe_extension(long extid)
> +{
> +    struct sbiret ret;
> +
> +    ret = sbi_ecall(SBI_EXT_BASE, SBI_EXT_BASE_PROBE_EXT, extid,
> +                    0, 0, 0, 0, 0);
> +    if ( !ret.error && ret.value )
> +        return ret.value;
> +
> +    return -EOPNOTSUPP;

Any reason not to use sbi_err_map_xen_errno() here?

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/7] xen/riscv: page table handling
  2024-08-21 16:06 ` [PATCH v5 6/7] xen/riscv: page table handling Oleksii Kurochko
@ 2024-08-27 15:00   ` Jan Beulich
  2024-08-28 16:11     ` oleksii.kurochko
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2024-08-27 15:00 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 21.08.2024 18:06, Oleksii Kurochko wrote:
> Implement map_pages_to_xen() which requires several
> functions to manage page tables and entries:
> - pt_update()
> - pt_mapping_level()
> - pt_update_entry()
> - pt_next_level()
> - pt_check_entry()
> 
> To support these operations, add functions for creating,
> mapping, and unmapping Xen tables:
> - create_table()
> - map_table()
> - unmap_table()
> 
> Introduce internal macros starting with PTE_* for convenience.
> These macros closely resemble PTE bits, with the exception of
> PTE_SMALL, which indicates that 4KB is needed.

What macros are you talking about here? Is this partially stale, as
only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being
added?

> --- a/xen/arch/riscv/include/asm/flushtlb.h
> +++ b/xen/arch/riscv/include/asm/flushtlb.h
> @@ -5,12 +5,24 @@
>  #include <xen/bug.h>
>  #include <xen/cpumask.h>
>  
> +#include <asm/sbi.h>
> +
>  /* Flush TLB of local processor for address va. */
>  static inline void flush_tlb_one_local(vaddr_t va)
>  {
>      asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
>  }
>  
> +/*
> + * Flush a range of VA's hypervisor mappings from the TLB of all
> + * processors in the inner-shareable domain.
> + */

Isn't inner-sharable an Arm term? Don't you simply mean "all" here?

> @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p)
>      return p.pte & PTE_VALID;
>  }
>  
> +inline bool pte_is_table(const pte_t p)
> +{
> +    return ((p.pte & (PTE_VALID |
> +                      PTE_READABLE |
> +                      PTE_WRITABLE |
> +                      PTE_EXECUTABLE)) == PTE_VALID);
> +}

In how far is the READABLE check valid here? You (imo correctly) ...

> +static inline bool pte_is_mapping(const pte_t p)
> +{
> +    return (p.pte & PTE_VALID) &&
> +           (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE));
> +}

... don't consider this bit here.

> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
> @@ -164,6 +164,7 @@
>  #define SSTATUS_SD			SSTATUS64_SD
>  #define SATP_MODE			SATP64_MODE
>  #define SATP_MODE_SHIFT			SATP64_MODE_SHIFT
> +#define SATP_PPN_MASK			_UL(0x00000FFFFFFFFFFF)
>  
>  #define HGATP_PPN			HGATP64_PPN
>  #define HGATP_VMID_SHIFT		HGATP64_VMID_SHIFT

This looks odd, padding-wise, but that's because hard tabs are being
used here. Is that intentional?

> --- /dev/null
> +++ b/xen/arch/riscv/pt.c
> @@ -0,0 +1,420 @@
> +#include <xen/bug.h>
> +#include <xen/domain_page.h>
> +#include <xen/errno.h>
> +#include <xen/mm.h>
> +#include <xen/mm-frame.h>
> +#include <xen/pmap.h>
> +#include <xen/spinlock.h>
> +
> +#include <asm/flushtlb.h>
> +#include <asm/page.h>
> +
> +static inline const mfn_t get_root_page(void)
> +{
> +    paddr_t root_maddr = (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
> +
> +    return maddr_to_mfn(root_maddr);
> +}
> +
> +/* Sanity check of the entry. */
> +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int flags)
> +{
> +    /*
> +     * See the comment about the possible combination of (mfn, flags) in
> +     * the comment above pt_update().
> +     */
> +
> +    /* Sanity check when modifying an entry. */
> +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> +    {
> +        /* We don't allow modifying an invalid entry. */
> +        if ( !pte_is_valid(entry) )
> +        {
> +            printk("Modifying invalid entry is not allowed.\n");

Perhaps all of these printk()s should be dprintk()? And not have a full
stop?

> +            return false;
> +        }
> +
> +        /* We don't allow modifying a table entry */
> +        if ( pte_is_table(entry) )
> +        {
> +            printk("Modifying a table entry is not allowed.\n");
> +            return false;
> +        }
> +    }
> +    /* Sanity check when inserting a mapping */
> +    else if ( flags & PTE_VALID )
> +    {
> +        /* We should be here with a valid MFN. */
> +        ASSERT(!mfn_eq(mfn, INVALID_MFN));

This is odd to have here, considering the if() further up.

> +        /*
> +         * We don't allow replacing any valid entry.
> +         *
> +         * Note that the function pt_update() relies on this
> +         * assumption and will skip the TLB flush (when Svvptc
> +         * extension will be ratified). The function will need
> +         * to be updated if the check is relaxed.
> +         */
> +        if ( pte_is_valid(entry) )
> +        {
> +            if ( pte_is_mapping(entry) )
> +                printk("Changing MFN for a valid entry is not allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n",
> +                       mfn_x(mfn_from_pte(entry)), mfn_x(mfn));
> +            else
> +                printk("Trying to replace a table with a mapping.\n");
> +            return false;
> +        }
> +    }
> +    /* Sanity check when removing a mapping. */
> +    else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )

The PTE_VALID part of the check is pointless considering the earlier
if(). I guess you may want to have it for doc purposes ...

Since further up you're using "else if ( flags & PTE_VALID )" imo
here you want to use "else if ( !(flags & ...) )".

> +    {
> +        /* We should be here with an invalid MFN. */
> +        ASSERT(mfn_eq(mfn, INVALID_MFN));
> +
> +        /* We don't allow removing a table */
> +        if ( pte_is_table(entry) )
> +        {
> +            printk("Removing a table is not allowed.\n");
> +            return false;
> +        }

Is this restriction temporary?

> +    }
> +    /* Sanity check when populating the page-table. No check so far. */
> +    else
> +    {
> +        ASSERT(flags & PTE_POPULATE);

This again is redundant with earlier if() conditions.

> +#define XEN_TABLE_MAP_FAILED 0
> +#define XEN_TABLE_SUPER_PAGE 1
> +#define XEN_TABLE_NORMAL 2
> +
> +/*
> + * Take the currently mapped table, find the corresponding entry,
> + * and map the next table, if available.
> + *
> + * The alloc_tbl parameters indicates whether intermediate tables should
> + * be allocated when not present.
> + *
> + * Return values:
> + *  XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry
> + *  was empty, or allocating a new page failed.
> + *  XEN_TABLE_NORMAL: next level or leaf mapped normally
> + *  XEN_TABLE_SUPER_PAGE: The next entry points to a superpage.
> + */
> +static int pt_next_level(bool alloc_tbl, pte_t **table, unsigned int offset)

Having the boolean first is unusual, but well - it's your choice.

> +{
> +    pte_t *entry;
> +    int ret;
> +    mfn_t mfn;
> +
> +    entry = *table + offset;
> +
> +    if ( !pte_is_valid(*entry) )
> +    {
> +        if ( alloc_tbl )
> +            return XEN_TABLE_MAP_FAILED;

Is this condition meant to be inverted?

> +        ret = create_table(entry);
> +        if ( ret )
> +            return XEN_TABLE_MAP_FAILED;

You don't really use "ret". Why not omit the local variable, even
more so that it has too wide scope?

> +/* Update an entry at the level @target. */
> +static int pt_update_entry(mfn_t root, unsigned long virt,
> +                           mfn_t mfn, unsigned int target,
> +                           unsigned int flags)
> +{
> +    int rc;
> +    unsigned int level = HYP_PT_ROOT_LEVEL;
> +    pte_t *table;
> +    /*
> +     * The intermediate page table shouldn't be allocated when MFN isn't
> +     * valid and we are not populating page table.
> +     * This means we either modify permissions or remove an entry, or
> +     * inserting brand new entry.
> +     *
> +     * See the comment above pt_update() for an additional explanation about
> +     * combinations of (mfn, flags).
> +    */
> +    bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags & PTE_POPULATE);

Is this meant to be inverted, too (to actually match variable name and
comment)?

> +    pte_t pte, *entry;
> +
> +    /* convenience aliases */
> +    DECLARE_OFFSETS(offsets, virt);
> +
> +    table = map_table(root);
> +    for ( ; level > target; level-- )
> +    {
> +        rc = pt_next_level(alloc_tbl, &table, offsets[level]);
> +        if ( rc == XEN_TABLE_MAP_FAILED )
> +        {
> +            rc = 0;
> +
> +            /*
> +             * We are here because pt_next_level has failed to map
> +             * the intermediate page table (e.g the table does not exist
> +             * and the pt is read-only). It is a valid case when

I'm sorry, but there's still a "read-only" left here.

> +             * removing a mapping as it may not exist in the page table.
> +             * In this case, just ignore it.
> +             */
> +            if ( flags & PTE_VALID )
> +            {
> +                printk("%s: Unable to map level %u\n", __func__, level);
> +                rc = -ENOENT;
> +            }
> +
> +            goto out;
> +        }
> +        else if ( rc != XEN_TABLE_NORMAL )

No need for "else" when the earlier if() ends in "goto".

> +            break;
> +    }
> +
> +    if ( level != target )
> +    {
> +        printk("%s: Shattering superpage is not supported\n", __func__);
> +        rc = -EOPNOTSUPP;
> +        goto out;
> +    }
> +
> +    entry = table + offsets[level];
> +
> +    rc = -EINVAL;
> +    if ( !pt_check_entry(*entry, mfn, flags) )
> +        goto out;
> +
> +    /* We are removing the page */
> +    if ( !(flags & PTE_VALID) )
> +        memset(&pte, 0x00, sizeof(pte));
> +    else
> +    {
> +        /* We are inserting a mapping => Create new pte. */
> +        if ( !mfn_eq(mfn, INVALID_MFN) )
> +            pte = pte_from_mfn(mfn, PTE_VALID);
> +        else /* We are updating the permission => Copy the current pte. */
> +            pte = *entry;
> +
> +        /* update permission according to the flags */
> +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;

When updating an entry, don't you also need to clear (some of) the flags?

> +/* Return the level where mapping should be done */
> +static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned long nr,
> +                            unsigned int flags)
> +{
> +    unsigned int level = 0;
> +    unsigned long mask;
> +    unsigned int i;
> +
> +    /* Use blocking mapping unless the caller requests 4K mapping */
> +    if ( unlikely(flags & PTE_SMALL) )
> +        return level;
> +
> +    /*
> +     * Don't take into account the MFN when removing mapping (i.e
> +     * MFN_INVALID) to calculate the correct target order.
> +     *
> +     * `vfn` and `mfn` must be both superpage aligned.
> +     * They are or-ed together and then checked against the size of
> +     * each level.
> +     *
> +     * `left` is not included and checked separately to allow
> +     * superpage mapping even if it is not properly aligned (the
> +     * user may have asked to map 2MB + 4k).

What is this about? There's nothing named "left" here.

> +     */
> +    mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
> +    mask |= vfn;
> +
> +    for ( i = HYP_PT_ROOT_LEVEL; i != 0; i-- )
> +    {
> +        if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) &&
> +             (nr >= BIT(XEN_PT_LEVEL_ORDER(i), UL)) )
> +        {
> +            level = i;
> +            break;
> +        }
> +    }
> +
> +    return level;
> +}
> +
> +static DEFINE_SPINLOCK(xen_pt_lock);

Another largely meaningless xen_ prefix?

> +/*
> + * If `mfn` equals `INVALID_MFN`, it indicates that the following page table
> + * update operation might be related to either populating the table (
> + * PTE_POPULATE will be set additionaly), destroying a mapping, or modifying
> + * an existing mapping.

And the latter two are distinguished by? PTE_VALID?

> + * If `mfn` is valid and flags has PTE_VALID bit set then it means that
> + * inserting will be done.
> + */

What about mfn != INVALID_MFN and PTE_VALID clear? Also note that "`mfn` is
valid" isn't the same as "mfn != INVALID_MFN". You want to be precise here,
to avoid confusion later on. (I say that knowing that we're still fighting
especially shadow paging code on x86 not having those properly separated.)

> +static int pt_update(unsigned long virt,
> +                     mfn_t mfn,
> +                     unsigned long nr_mfns,
> +                     unsigned int flags)
> +{
> +    int rc = 0;
> +    unsigned long vfn = virt >> PAGE_SHIFT;

Please don't open-code e.g PFN_DOWN().

> +    unsigned long left = nr_mfns;
> +
> +    const mfn_t root = get_root_page();
> +
> +    /*
> +     * It is bad idea to have mapping both writeable and
> +     * executable.
> +     * When modifying/creating mapping (i.e PTE_VALID is set),
> +     * prevent any update if this happen.
> +     */
> +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) && PTE_X_MASK(flags) )

Seeing them in use, I wonder about the naming of those PTE_?_MASK()
macros. Along with the lhs, why not simply (flags & PTE_...)?

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/7] xen/riscv: use {read,write}{b,w,l,q}_cpu() to define {read,write}_atomic()
  2024-08-27 10:06   ` Jan Beulich
@ 2024-08-28  9:21     ` oleksii.kurochko
  2024-08-28  9:42       ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: oleksii.kurochko @ 2024-08-28  9:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-27 at 12:06 +0200, Jan Beulich wrote:
> On 21.08.2024 18:06, Oleksii Kurochko wrote:
> > In Xen, memory-ordered atomic operations are not necessary,
> 
> This is an interesting statement.
I looked at the definition of build_atomic_{write,read}() for other
architectures and didn't find any additional memory-ordered primitives
such as fences.

> I'd like to suggest that you at least
> limit it to the two constructs in question, rather than stating this
> globally for everything.
I am not sure that I understand what is "the two constructs". Could you
please clarify?

> 
> > based on {read,write}_atomic() implementations for other
> > architectures.
> > Therefore, {read,write}{b,w,l,q}_cpu() can be used instead of
> > {read,write}{b,w,l,q}(), allowing the caller to decide if
> > additional
> > fences should be applied before or after {read,write}_atomic().
> > 
> > Change the declaration of _write_atomic() to accept a 'volatile
> > void *'
> > type for the 'x' argument instead of 'unsigned long'.
> > This prevents compilation errors such as:
> > 1."discards 'volatile' qualifier from pointer target type," which
> > occurs
> >   due to the initialization of a volatile pointer,
> >   e.g., `volatile uint8_t *ptr = p;` in _add_sized().
> 
> I don't follow you here.
This issue started occurring after the change mentioned in point 2
below.

I initially provided an incorrect explanation for the compilation error
mentioned above. Let me correct that now and update the commit message
in the next patch version. The reason for this error is that after the
_write_atomic() prototype was updated from _write_atomic(..., unsigned
long, ...) to _write_atomic(..., void *x, ...), the write_atomic()
macro contains x_, which is of type 'volatile uintX_t' because ptr has
the type 'volatile uintX_t *'.

Therefore, _write_atomic() should have its second argument declared as
volatile const void *. Alternatively, we can consider updating
write_atomic() to:
   #define write_atomic(p, x)                              \
   ({                                                      \
       typeof(*(p)) x_ = (x);                              \
       _write_atomic(p, (const void *)&x_, sizeof(*(p)));  \
   })
Would this be a better approach?Would it be better?

> It's the other argument of write_atomic() that
> has ptr passed there.
Probably the thing mentioned above it is what you mean here. I wasn't
sure that I understand this sentence correctly.

> 
> > 2."incompatible type for argument 2 of '_write_atomic'," which can
> > occur
> >   when calling write_pte(), where 'x' is of type pte_t rather than
> >   unsigned long.
> 
> How's this related to the change at hand? That isn't different ahead
> of
> this change, is it?
This is not directly related to the current change, which is why I
decided to add a sentence about write_pte().

Since write_pte(pte_t *p, pte_t pte) uses write_atomic(), and the
argument types are pte_t * and pte respectively, we encounter a
compilation issue in write_atomic() because:

_write_atomic() expects the second argument to be of type unsigned
long, leading to a compilation error like "incompatible type for
argument 2 of '_write_atomic'."
I considered defining write_pte() as write_atomic(p, pte.pte), but this
would fail at 'typeof(*(p)) x_ = (x);' and result in a compilation
error 'invalid initializer' or something like that.

It might be better to update write_pte() to:
   /* Write a pagetable entry. */
   static inline void write_pte(pte_t *p, pte_t pte)
   {
       write_atomic((unsigned long *)p, pte.pte);
   }
Then, we wouldn't need to modify the definition of write_atomic() or
change the type of the second argument of _write_atomic().
Would it be better?

> 
> > --- a/xen/arch/riscv/include/asm/atomic.h
> > +++ b/xen/arch/riscv/include/asm/atomic.h
> > @@ -31,21 +31,17 @@
> >  
> >  void __bad_atomic_size(void);
> >  
> > -/*
> > - * Legacy from Linux kernel. For some reason they wanted to have
> > ordered
> > - * read/write access. Thereby read* is used instead of read*_cpu()
> > - */
> >  static always_inline void read_atomic_size(const volatile void *p,
> >                                             void *res,
> >                                             unsigned int size)
> >  {
> >      switch ( size )
> >      {
> > -    case 1: *(uint8_t *)res = readb(p); break;
> > -    case 2: *(uint16_t *)res = readw(p); break;
> > -    case 4: *(uint32_t *)res = readl(p); break;
> > +    case 1: *(uint8_t *)res = readb_cpu(p); break;
> > +    case 2: *(uint16_t *)res = readw_cpu(p); break;
> > +    case 4: *(uint32_t *)res = readl_cpu(p); break;
> >  #ifndef CONFIG_RISCV_32
> > -    case 8: *(uint32_t *)res = readq(p); break;
> > +    case 8: *(uint32_t *)res = readq_cpu(p); break;
> >  #endif
> >      default: __bad_atomic_size(); break;
> >      }
> > @@ -58,15 +54,16 @@ static always_inline void
> > read_atomic_size(const volatile void *p,
> >  })
> >  
> >  static always_inline void _write_atomic(volatile void *p,
> > -                                       unsigned long x, unsigned
> > int size)
> > +                                        volatile void *x,
> 
> If this really needs to become a pointer, it ought to also be
> pointer-
> to-const. Otherwise it is yet more confusing which operand is which.
Sure. I will add 'const' if the prototype of _write_atomic() won't use
'unsigned long' for x argument.

> 
> > +                                        unsigned int size)
> >  {
> >      switch ( size )
> >      {
> > -    case 1: writeb(x, p); break;
> > -    case 2: writew(x, p); break;
> > -    case 4: writel(x, p); break;
> > +    case 1: writeb_cpu(*(uint8_t *)x, p); break;
> > +    case 2: writew_cpu(*(uint16_t *)x, p); break;
> > +    case 4: writel_cpu(*(uint32_t *)x, p); break;
> >  #ifndef CONFIG_RISCV_32
> > -    case 8: writeq(x, p); break;
> > +    case 8: writeq_cpu(*(uint64_t *)x, p); break;
> 
> Of course you may not cast away const-ness then. You also be casting
> away volatile-ness, but (as per above) I question the need for
> volatile
> on x.
I added an explanation about this earlier in the message. Let's discuss
whether volatile is needed there or not.

If I should not cast away the const and volatile qualifiers, then I
need to update the prototypes of writeX_cpu()?

~ Oleksii


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/7] xen/riscv: use {read,write}{b,w,l,q}_cpu() to define {read,write}_atomic()
  2024-08-28  9:21     ` oleksii.kurochko
@ 2024-08-28  9:42       ` Jan Beulich
  2024-08-29  8:52         ` oleksii.kurochko
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2024-08-28  9:42 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 28.08.2024 11:21, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-27 at 12:06 +0200, Jan Beulich wrote:
>> On 21.08.2024 18:06, Oleksii Kurochko wrote:
>>> In Xen, memory-ordered atomic operations are not necessary,
>>
>> This is an interesting statement.
> I looked at the definition of build_atomic_{write,read}() for other
> architectures and didn't find any additional memory-ordered primitives
> such as fences.
> 
>> I'd like to suggest that you at least
>> limit it to the two constructs in question, rather than stating this
>> globally for everything.
> I am not sure that I understand what is "the two constructs". Could you
> please clarify?

{read,write}_atomic() (the statement in your description is, after all,
not obviously limited to just those two, yet I understand you mean to
say what you say only for them)

>>> based on {read,write}_atomic() implementations for other
>>> architectures.
>>> Therefore, {read,write}{b,w,l,q}_cpu() can be used instead of
>>> {read,write}{b,w,l,q}(), allowing the caller to decide if
>>> additional
>>> fences should be applied before or after {read,write}_atomic().
>>>
>>> Change the declaration of _write_atomic() to accept a 'volatile
>>> void *'
>>> type for the 'x' argument instead of 'unsigned long'.
>>> This prevents compilation errors such as:
>>> 1."discards 'volatile' qualifier from pointer target type," which
>>> occurs
>>>   due to the initialization of a volatile pointer,
>>>   e.g., `volatile uint8_t *ptr = p;` in _add_sized().
>>
>> I don't follow you here.
> This issue started occurring after the change mentioned in point 2
> below.
> 
> I initially provided an incorrect explanation for the compilation error
> mentioned above. Let me correct that now and update the commit message
> in the next patch version. The reason for this error is that after the
> _write_atomic() prototype was updated from _write_atomic(..., unsigned
> long, ...) to _write_atomic(..., void *x, ...), the write_atomic()
> macro contains x_, which is of type 'volatile uintX_t' because ptr has
> the type 'volatile uintX_t *'.

While there's no "ptr" in write_atomic(), I think I see what you mean. Yet
at the same time Arm - having a similar construct - gets away without
volatile. Perhaps this wants modelling after read_atomic() then, using a
union?

> Therefore, _write_atomic() should have its second argument declared as
> volatile const void *. Alternatively, we can consider updating
> write_atomic() to:
>    #define write_atomic(p, x)                              \
>    ({                                                      \
>        typeof(*(p)) x_ = (x);                              \
>        _write_atomic(p, (const void *)&x_, sizeof(*(p)));  \
>    })
> Would this be a better approach?Would it be better?

Like const you also should avoid to cast away volatile, whenever possible.

>>> 2."incompatible type for argument 2 of '_write_atomic'," which can
>>> occur
>>>   when calling write_pte(), where 'x' is of type pte_t rather than
>>>   unsigned long.
>>
>> How's this related to the change at hand? That isn't different ahead
>> of
>> this change, is it?
> This is not directly related to the current change, which is why I
> decided to add a sentence about write_pte().
> 
> Since write_pte(pte_t *p, pte_t pte) uses write_atomic(), and the
> argument types are pte_t * and pte respectively, we encounter a
> compilation issue in write_atomic() because:
> 
> _write_atomic() expects the second argument to be of type unsigned
> long, leading to a compilation error like "incompatible type for
> argument 2 of '_write_atomic'."
> I considered defining write_pte() as write_atomic(p, pte.pte), but this
> would fail at 'typeof(*(p)) x_ = (x);' and result in a compilation
> error 'invalid initializer' or something like that.
> 
> It might be better to update write_pte() to:
>    /* Write a pagetable entry. */
>    static inline void write_pte(pte_t *p, pte_t pte)
>    {
>        write_atomic((unsigned long *)p, pte.pte);
>    }
> Then, we wouldn't need to modify the definition of write_atomic() or
> change the type of the second argument of _write_atomic().
> Would it be better?

As said numerous times before: Whenever you can get away without a cast,
you should avoid the cast. Here:

static inline void write_pte(pte_t *p, pte_t pte)
{
    write_atomic(&p->pte, pte.pte);
}

That's one of the possible options, yes. Yet, like Arm has it, you may
actually want the capability to read/write non-scalar types. If so,
adjustments to write_atomic() are necessary, yet as indicated before:
Please keep such entirely independent changes separate.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/7] xen/riscv: set up fixmap mappings
  2024-08-27 10:29   ` Jan Beulich
@ 2024-08-28  9:53     ` oleksii.kurochko
  2024-08-28 10:44       ` Jan Beulich
  2024-08-30 11:55     ` oleksii.kurochko
  1 sibling, 1 reply; 32+ messages in thread
From: oleksii.kurochko @ 2024-08-28  9:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-27 at 12:29 +0200, Jan Beulich wrote:
> > 
> > +
> > +/*
> > + * Direct access to xen_fixmap[] should only happen when {set,
> > + * clear}_fixmap() is unusable (e.g. where we would end up to
> > + * recursively call the helpers).
> > + */
> > +extern pte_t xen_fixmap[];
> 
> I'm afraid I keep being irritated by the comment: What recursive use
> of
> helpers is being talked about here? I can't see anything recursive in
> this
> patch. If this starts happening with a subsequent patch, then you
> have
> two options: Move the declaration + comment there, or clarify in the
> description (in enough detail) what this is about.
This comment is added because of:
```
void *__init pmap_map(mfn_t mfn)
  ...
       /*
        * We cannot use set_fixmap() here. We use PMAP when the domain map
        * page infrastructure is not yet initialized, so
   map_pages_to_xen() called
        * by set_fixmap() needs to map pages on demand, which then calls
   pmap()
        * again, resulting in a loop. Modify the PTEs directly instead.
   The same
        * is true for pmap_unmap().
        */
       arch_pmap_map(slot, mfn);
   ...
```
And it happens because set_fixmap() could be defined using generic PT
helpers so what will lead to recursive behaviour when when there is no
direct map:
   static pte_t *map_table(mfn_t mfn)
   {
       /*
        * During early boot, map_domain_page() may be unusable. Use the
        * PMAP to map temporarily a page-table.
        */
       if ( system_state == SYS_STATE_early_boot )
           return pmap_map(mfn);
       ...
   }

> 
> > @@ -81,6 +82,18 @@ static inline void flush_page_to_ram(unsigned
> > long mfn, bool sync_icache)
> >      BUG_ON("unimplemented");
> >  }
> >  
> > +/* Write a pagetable entry. */
> > +static inline void write_pte(pte_t *p, pte_t pte)
> > +{
> > +    write_atomic(p, pte);
> > +}
> > +
> > +/* Read a pagetable entry. */
> > +static inline pte_t read_pte(pte_t *p)
> > +{
> > +    return read_atomic(p);
> 
> This only works because of the strange type trickery you're playing
> in
> read_atomic(). Look at x86 code - there's a strict expectation that
> the
> type can be converted to/from unsigned long. And page table accessors
> are written with that taken into consideration. Same goes for
> write_pte()
> of course, with the respective comment on the earlier patch in mind.
I will check x86 code. Probably my answer on the patch with
read/write_atomic() suggest that too. Based on the answers to that
patch I think we can go with x86 approach.

Thanks.

~ Oleksii

> 
> Otoh I see that Arm does something very similar. If you have a strong
> need / desire to follow that, then please at least split the two
> entirely separate aspects that patch 1 presently changes both in one
> go.
> 
> Jan



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/7] xen/riscv: set up fixmap mappings
  2024-08-28  9:53     ` oleksii.kurochko
@ 2024-08-28 10:44       ` Jan Beulich
  2024-08-30 11:01         ` oleksii.kurochko
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2024-08-28 10:44 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 28.08.2024 11:53, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-27 at 12:29 +0200, Jan Beulich wrote:
>>>
>>> +
>>> +/*
>>> + * Direct access to xen_fixmap[] should only happen when {set,
>>> + * clear}_fixmap() is unusable (e.g. where we would end up to
>>> + * recursively call the helpers).
>>> + */
>>> +extern pte_t xen_fixmap[];
>>
>> I'm afraid I keep being irritated by the comment: What recursive use
>> of
>> helpers is being talked about here? I can't see anything recursive in
>> this
>> patch. If this starts happening with a subsequent patch, then you
>> have
>> two options: Move the declaration + comment there, or clarify in the
>> description (in enough detail) what this is about.
> This comment is added because of:
> ```
> void *__init pmap_map(mfn_t mfn)
>   ...
>        /*
>         * We cannot use set_fixmap() here. We use PMAP when the domain map
>         * page infrastructure is not yet initialized, so
>    map_pages_to_xen() called
>         * by set_fixmap() needs to map pages on demand, which then calls
>    pmap()
>         * again, resulting in a loop. Modify the PTEs directly instead.
>    The same
>         * is true for pmap_unmap().
>         */
>        arch_pmap_map(slot, mfn);
>    ...
> ```
> And it happens because set_fixmap() could be defined using generic PT
> helpers

As you say - could be. If I'm not mistaken no set_fixmap() implementation
exists even by the end of the series. Fundamentally I'd expect set_fixmap()
to (possibly) use xen_fixmap[] directly. That in turn ...

> so what will lead to recursive behaviour when when there is no
> direct map:

... would mean no recursion afaict. Hence why clarification is needed as
to what's going on here _and_ what's planned.

Jan

>    static pte_t *map_table(mfn_t mfn)
>    {
>        /*
>         * During early boot, map_domain_page() may be unusable. Use the
>         * PMAP to map temporarily a page-table.
>         */
>        if ( system_state == SYS_STATE_early_boot )
>            return pmap_map(mfn);
>        ...
>    }



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-27 13:44   ` Jan Beulich
@ 2024-08-28 10:56     ` oleksii.kurochko
  2024-08-28 11:55       ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: oleksii.kurochko @ 2024-08-28 10:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-27 at 15:44 +0200, Jan Beulich wrote:
> On 21.08.2024 18:06, Oleksii Kurochko wrote:

> 
> > --- a/xen/arch/riscv/include/asm/smp.h
> > +++ b/xen/arch/riscv/include/asm/smp.h
> > @@ -5,6 +5,10 @@
> >  #include <xen/cpumask.h>
> >  #include <xen/percpu.h>
> >  
> > +#include <asm/processor.h>
> > +
> > +#define INVALID_HARTID ULONG_MAX
> 
> So what if the firmware report this value for one of the harts?
It could be an issue, but in my opinion, there is a small chance that
the firmware will use such a high number. I can add a BUG_ON() in
start_xen() to check that bootcpu_id is not equal to INVALID_HARTID to
ensure that the firmware does not report this value. Otherwise, we
would need to add a 'bool valid;' to struct pcpu_info and use it
instead of INVALID_HARTID.

> > --- a/xen/arch/riscv/setup.c
> > +++ b/xen/arch/riscv/setup.c
> > @@ -8,6 +8,7 @@
> >  #include <public/version.h>
> >  
> >  #include <asm/early_printk.h>
> > +#include <asm/smp.h>
> >  #include <asm/traps.h>
> >  
> >  void arch_get_xen_caps(xen_capabilities_info_t *info)
> > @@ -40,6 +41,10 @@ void __init noreturn start_xen(unsigned long
> > bootcpu_id,
> >  {
> >      remove_identity_mapping();
> >  
> > +    set_processor_id(0);
> 
> This isn't really needed, is it? The pcpu_info[] initializer already
> installs the necessary 0. Another thing would be if the initializer
> set the field to, say, NR_CPUS.
> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/smp.c
> > @@ -0,0 +1,21 @@
> > +#include <xen/smp.h>
> > +
> > +/*
> > + * FIXME: make pcpu_info[] dynamically allocated when necessary
> > + *        functionality will be ready
> > + */
> > +/* tp points to one of these per cpu */
> > +struct pcpu_info pcpu_info[NR_CPUS] = { { 0, INVALID_HARTID } };
> 
> As to the initializer - what about CPUs other than CPU0? Would they
> better all have hart_id set to invalid?
I thought about that, but I decided that if we have INVALID_HARTID as
hart_id and the hart_id is checked in the appropriate places, then it
doesn't really matter what the processor_id member of struct pcpu_info
is. For clarity, it might be better to set it to an invalid value, but
it doesn't clear which value we should choose as invalid. I assume that
NR_CPUS is a good candidate for that?

> 
> Also, as a pretty strong suggestion to avoid excessive churn going
> forward: Please consider using dedicated initializers here. IOW
> perhaps
> 
> struct pcpu_info pcpu_info[NR_CPUS] = { [0 ... NR_CPUS - 1] = {
>     .hart_id = INVALID_HARTID,
> }};
> 
> Yet as said earlier - in addition you likely want to make sure no
> two CPUs have (part of) their struct instance in the same cache line.
> That won't matter right now, as you have no fields you alter at
> runtime, but I expect such fields will appear.
Is my understanding correct that adding __cacheline_aligned will be
sufficient:
   struct pcpu_info {
   ...
   } __cacheline_aligned;


> 
> > +void setup_tp(unsigned int cpuid)
> > +{
> > +    /*
> > +     * tp register contains an address of physical cpu
> > information.
> > +     * So write physical CPU info of cpuid to tp register.
> > +     * It will be used later by get_processor_id() ( look at
> > +     * <asm/processor.h> ):
> > +     *   #define get_processor_id()    (tp->processor_id)
> > +     */
> > +    asm volatile ( "mv tp, %0"
> > +                   :: "r" ((unsigned long)&pcpu_info[cpuid]) :
> > "memory" );
> > +}
> 
> So you've opted to still do this in C. Which means there's still a
> residual risk of the compiler assuming it can already to tp. What's
> the problem with doing this properly in assembly?
There is no problem and to be on the safe side I will re-write it to
assembly.

> 
> As to the memory clobber - in an isolated, non-inline function its
> significance is reduced mostly to the case of LTO (which I'm not
> sure you even target). Nevertheless probably worth keeping, even if
> mainly for documentation purposes. Provided of course this C function
> is to remain.
> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/smpboot.c
> > @@ -0,0 +1,8 @@
> > +#include <xen/init.h>
> > +#include <xen/sections.h>
> > +#include <xen/smp.h>
> > +
> > +void __init smp_set_bootcpu_id(unsigned long boot_cpu_hartid)
> > +{
> > +    cpuid_to_hartid(0) = boot_cpu_hartid;
> > +}
> 
> Does this really need its own function?
No, there is no such need. I will drop it.

Thanks.

~ Oleksii


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-28 10:56     ` oleksii.kurochko
@ 2024-08-28 11:55       ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2024-08-28 11:55 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 28.08.2024 12:56, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-27 at 15:44 +0200, Jan Beulich wrote:
>> On 21.08.2024 18:06, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/include/asm/smp.h
>>> +++ b/xen/arch/riscv/include/asm/smp.h
>>> @@ -5,6 +5,10 @@
>>>  #include <xen/cpumask.h>
>>>  #include <xen/percpu.h>
>>>  
>>> +#include <asm/processor.h>
>>> +
>>> +#define INVALID_HARTID ULONG_MAX
>>
>> So what if the firmware report this value for one of the harts?
> It could be an issue, but in my opinion, there is a small chance that
> the firmware will use such a high number. I can add a BUG_ON() in
> start_xen() to check that bootcpu_id is not equal to INVALID_HARTID to
> ensure that the firmware does not report this value. Otherwise, we
> would need to add a 'bool valid;' to struct pcpu_info and use it
> instead of INVALID_HARTID.

Which route to go largely depends on expectations to actual hardware
we're intending Xen to be usable on.

>>> --- a/xen/arch/riscv/setup.c
>>> +++ b/xen/arch/riscv/setup.c
>>> @@ -8,6 +8,7 @@
>>>  #include <public/version.h>
>>>  
>>>  #include <asm/early_printk.h>
>>> +#include <asm/smp.h>
>>>  #include <asm/traps.h>
>>>  
>>>  void arch_get_xen_caps(xen_capabilities_info_t *info)
>>> @@ -40,6 +41,10 @@ void __init noreturn start_xen(unsigned long
>>> bootcpu_id,
>>>  {
>>>      remove_identity_mapping();
>>>  
>>> +    set_processor_id(0);
>>
>> This isn't really needed, is it? The pcpu_info[] initializer already
>> installs the necessary 0. Another thing would be if the initializer
>> set the field to, say, NR_CPUS.

As suggested here, ...

>>> --- /dev/null
>>> +++ b/xen/arch/riscv/smp.c
>>> @@ -0,0 +1,21 @@
>>> +#include <xen/smp.h>
>>> +
>>> +/*
>>> + * FIXME: make pcpu_info[] dynamically allocated when necessary
>>> + *        functionality will be ready
>>> + */
>>> +/* tp points to one of these per cpu */
>>> +struct pcpu_info pcpu_info[NR_CPUS] = { { 0, INVALID_HARTID } };
>>
>> As to the initializer - what about CPUs other than CPU0? Would they
>> better all have hart_id set to invalid?
> I thought about that, but I decided that if we have INVALID_HARTID as
> hart_id and the hart_id is checked in the appropriate places, then it
> doesn't really matter what the processor_id member of struct pcpu_info
> is. For clarity, it might be better to set it to an invalid value, but
> it doesn't clear which value we should choose as invalid. I assume that
> NR_CPUS is a good candidate for that?

... yes. With that you'd also avoid the need for a "valid" flag: An
entry's hart ID would be valid (no matter which value) if its
processor_id field is valid (less than NR_CPUS).

>> Also, as a pretty strong suggestion to avoid excessive churn going
>> forward: Please consider using dedicated initializers here. IOW
>> perhaps
>>
>> struct pcpu_info pcpu_info[NR_CPUS] = { [0 ... NR_CPUS - 1] = {
>>     .hart_id = INVALID_HARTID,
>> }};
>>
>> Yet as said earlier - in addition you likely want to make sure no
>> two CPUs have (part of) their struct instance in the same cache line.
>> That won't matter right now, as you have no fields you alter at
>> runtime, but I expect such fields will appear.
> Is my understanding correct that adding __cacheline_aligned will be
> sufficient:
>    struct pcpu_info {
>    ...
>    } __cacheline_aligned;

Yes, that's what we do elsewhere.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-27 14:19   ` Jan Beulich
@ 2024-08-28 13:11     ` oleksii.kurochko
  2024-08-28 15:03       ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: oleksii.kurochko @ 2024-08-28 13:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-27 at 16:19 +0200, Jan Beulich wrote:
> On 21.08.2024 18:06, Oleksii Kurochko wrote:
> > --- a/xen/arch/riscv/include/asm/sbi.h
> > +++ b/xen/arch/riscv/include/asm/sbi.h
> > @@ -31,4 +64,34 @@ struct sbiret sbi_ecall(unsigned long ext,
> > unsigned long fid,
> >   */
> >  void sbi_console_putchar(int ch);
> >  
> > +/*
> > + * Check underlying SBI implementation has RFENCE
> > + *
> > + * @return true for supported AND false for not-supported
> > + */
> > +bool sbi_has_rfence(void);
> > +
> > +/*
> > + * Instructs the remote harts to execute one or more SFENCE.VMA
> > + * instructions, covering the range of virtual addresses between
> > + * [start_addr, start_addr + size).
> > + *
> > + * Returns 0 if IPI was sent to all the targeted harts
> > successfully
> > + * or negative value if start_addr or size is not valid.
> > + *
> > + * @hart_mask a cpu mask containing all the target harts.
> > + * @param start virtual address start
> > + * @param size virtual address range size
> > + */
> > +int sbi_remote_sfence_vma(const cpumask_t *cpu_mask,
> > +                          unsigned long start_addr,
> > +                          unsigned long size);
> 
> I may have asked before: Why not vaddr_t and size_t respectively?
Just to follow how this arguments are declared in RISC-V SBI spec but
considering that that the prototype of this function has been already
change I think we can also change types of start_addr and size.

> 
> > @@ -38,7 +51,265 @@ struct sbiret sbi_ecall(unsigned long ext,
> > unsigned long fid,
> >      return ret;
> >  }
> >  
> > +static int sbi_err_map_xen_errno(int err)
> > +{
> > +    switch ( err )
> > +    {
> > +    case SBI_SUCCESS:
> > +        return 0;
> > +    case SBI_ERR_DENIED:
> > +        return -EACCES;
> > +    case SBI_ERR_INVALID_PARAM:
> > +        return -EINVAL;
> > +    case SBI_ERR_INVALID_ADDRESS:
> > +        return -EFAULT;
> > +    case SBI_ERR_NOT_SUPPORTED:
> > +        return -EOPNOTSUPP;
> > +    case SBI_ERR_FAILURE:
> > +        fallthrough;
> > +    default:
> 
> What's the significance of the "fallthrough" here?
To indicate that the fallthrough from the case SBI_ERR_FAILURE and
default labels is intentional and should not be diagnosed by a compiler
that warns on fallthrough. Or it is needed only when fallthough happen
between switch's cases label ( not default label ) like in the
following code ( should it be fallthrough here? ):
    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA:
    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA:
    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA:

Additionally, I am considering whether the case SBI_ERR_FAILURE should
be removed or if we should find the appropriate Xen error code for this
case. I am uncertain which Xen error code from xen/errno.h would be
appropriate.

> 
> > +static unsigned long sbi_major_version(void)
> > +{
> > +    return MASK_EXTR(sbi_spec_version,
> > SBI_SPEC_VERSION_MAJOR_MASK);
> > +}
> > +
> > +static unsigned long sbi_minor_version(void)
> > +{
> > +    return MASK_EXTR(sbi_spec_version,
> > SBI_SPEC_VERSION_MINOR_MASK);
> > +}
> 
> Both functions return less than 32-bit wide values. Why unsigned long
> return types?
We had this discussion in the previous patch series. Please look here:
https://lore.kernel.org/xen-devel/253638c4-2256-4bdd-9f12-7f99e373355e@suse.com/

If it would be better I can add the comment for these functions why
they returns 'unsigned long'.

> 
> > +                               unsigned long arg4)
> > +{
> > +    struct sbiret ret = {0};
> > +    int result = 0;
> > +
> > +    switch ( fid )
> > +    {
> > +    case SBI_EXT_RFENCE_REMOTE_FENCE_I:
> > +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> > +                        0, 0, 0, 0);
> > +        break;
> > +
> > +    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA:
> > +    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA:
> > +    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA:
> > +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> > +                        start, size, 0, 0);
> > +        break;
> > +
> > +    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID:
> > +    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID:
> > +    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA_ASID:
> > +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> > +                        start, size, arg4, 0);
> > +        break;
> > +
> > +    default:
> > +        printk("%s: unknown function ID [%lu]\n",
> 
> I wonder how useful the logging in decimal of (perhaps large) unknown
> values is.
Agree, it would much better in hex.

> 
> > +               __func__, fid);
> > +        result = -EINVAL;
> > +        break;
> > +    };
> > +
> > +    if ( ret.error )
> > +    {
> > +        result = sbi_err_map_xen_errno(ret.error);
> > +        printk("%s: hbase=%lu hmask=%#lx failed (error %d)\n",
> > +               __func__, hbase, hmask, result);
> 
> Considering that sbi_err_map_xen_errno() may lose information, I'd
> recommend logging ret.error here.
By 'lose information' you mean case SBI_ERR_FAILURE?

> 
> > +static int cf_check sbi_rfence_v02(unsigned long fid,
> > +                                   const cpumask_t *cpu_mask,
> > +                                   unsigned long start, unsigned
> > long size,
> > +                                   unsigned long arg4, unsigned
> > long arg5)
> > +{
> > +    unsigned long hartid, cpuid, hmask = 0, hbase = 0, htop = 0;
> > +    int result;
> > +
> > +    /*
> > +     * hart_mask_base can be set to -1 to indicate that hart_mask
> > can be
> > +     * ignored and all available harts must be considered.
> > +     */
> > +    if ( !cpu_mask )
> > +        return sbi_rfence_v02_real(fid, 0UL, -1UL, start, size,
> > arg4);
> > +
> > +    for_each_cpu ( cpuid, cpu_mask )
> > +    {
> > +        /*
> > +        * Hart IDs might not necessarily be numbered contiguously
> > in
> > +        * a multiprocessor system, but at least one hart must have
> > a
> > +        * hart ID of zero.
> 
> Does this latter fact matter here in any way?
It doesn't, just copy from the RISC-V spec the full sentence. If it
would be better to drop the latter fact I will be happy to do that in
the next patch version.

> 
> > +        * This means that it is possible for the hart ID mapping
> > to look like:
> > +        *  0, 1, 3, 65, 66, 69
> > +        * In such cases, more than one call to
> > sbi_rfence_v02_real() will be
> > +        * needed, as a single hmask can only cover sizeof(unsigned
> > long) CPUs:
> > +        *  1. sbi_rfence_v02_real(hmask=0b1011, hbase=0)
> > +        *  2. sbi_rfence_v02_real(hmask=0b1011, hbase=65)
> > +        *
> > +        * The algorithm below tries to batch as many harts as
> > possible before
> > +        * making an SBI call. However, batching may not always be
> > possible.
> > +        * For example, consider the hart ID mapping:
> > +        *   0, 64, 1, 65, 2, 66
> 
> Just to mention it: Batching is also possible here: First (0,1,2),
> then
> (64,65,66). It just requires a different approach. Whether switching
> is
> worthwhile depends on how numbering is done on real world systems.
For sure, it's possible to do that. I was just trying to describe the
currently implemented algorithm. If you think it's beneficial to add
that information to the comment, I can include it as well.

> 
> > +static int (* __ro_after_init sbi_rfence)(unsigned long fid,
> > +                                          const cpumask_t
> > *cpu_mask,
> > +                                          unsigned long start,
> > +                                          unsigned long size,
> > +                                          unsigned long arg4,
> > +                                          unsigned long arg5);
> > +
> > +int sbi_remote_sfence_vma(const cpumask_t *cpu_mask,
> > +                          unsigned long start_addr,
> 
> To match other functions, perhaps just "start"?
It would be better, RISC-V spec is using 'start' everywhere too, at
least, for FENCE Extension.

> 
> > +int sbi_probe_extension(long extid)
> > +{
> > +    struct sbiret ret;
> > +
> > +    ret = sbi_ecall(SBI_EXT_BASE, SBI_EXT_BASE_PROBE_EXT, extid,
> > +                    0, 0, 0, 0, 0);
> > +    if ( !ret.error && ret.value )
> > +        return ret.value;
> > +
> > +    return -EOPNOTSUPP;
> 
> Any reason not to use sbi_err_map_xen_errno() here?
We can, just missed that.

Thanks.

~ Oleksii

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-28 13:11     ` oleksii.kurochko
@ 2024-08-28 15:03       ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2024-08-28 15:03 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 28.08.2024 15:11, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-27 at 16:19 +0200, Jan Beulich wrote:
>> On 21.08.2024 18:06, Oleksii Kurochko wrote:
>>> @@ -38,7 +51,265 @@ struct sbiret sbi_ecall(unsigned long ext,
>>> unsigned long fid,
>>>      return ret;
>>>  }
>>>  
>>> +static int sbi_err_map_xen_errno(int err)
>>> +{
>>> +    switch ( err )
>>> +    {
>>> +    case SBI_SUCCESS:
>>> +        return 0;
>>> +    case SBI_ERR_DENIED:
>>> +        return -EACCES;
>>> +    case SBI_ERR_INVALID_PARAM:
>>> +        return -EINVAL;
>>> +    case SBI_ERR_INVALID_ADDRESS:
>>> +        return -EFAULT;
>>> +    case SBI_ERR_NOT_SUPPORTED:
>>> +        return -EOPNOTSUPP;
>>> +    case SBI_ERR_FAILURE:
>>> +        fallthrough;
>>> +    default:
>>
>> What's the significance of the "fallthrough" here?
> To indicate that the fallthrough from the case SBI_ERR_FAILURE and
> default labels is intentional and should not be diagnosed by a compiler
> that warns on fallthrough. Or it is needed only when fallthough happen
> between switch's cases label ( not default label ) like in the
> following code ( should it be fallthrough here? ):
>     case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA:
>     case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA:
>     case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA:

No, it's also not needed there. It's only needed when there are statements
in between.

> Additionally, I am considering whether the case SBI_ERR_FAILURE should
> be removed or if we should find the appropriate Xen error code for this
> case. I am uncertain which Xen error code from xen/errno.h would be
> appropriate.

There's nothing really suitable, I fear.

>>> +static unsigned long sbi_major_version(void)
>>> +{
>>> +    return MASK_EXTR(sbi_spec_version,
>>> SBI_SPEC_VERSION_MAJOR_MASK);
>>> +}
>>> +
>>> +static unsigned long sbi_minor_version(void)
>>> +{
>>> +    return MASK_EXTR(sbi_spec_version,
>>> SBI_SPEC_VERSION_MINOR_MASK);
>>> +}
>>
>> Both functions return less than 32-bit wide values. Why unsigned long
>> return types?
> We had this discussion in the previous patch series. Please look here:
> https://lore.kernel.org/xen-devel/253638c4-2256-4bdd-9f12-7f99e373355e@suse.com/

That was for the variables used here, not the functions. The functions
clip the values in the variables enough to no longer warrant wider-
than-int.

>>> +    if ( ret.error )
>>> +    {
>>> +        result = sbi_err_map_xen_errno(ret.error);
>>> +        printk("%s: hbase=%lu hmask=%#lx failed (error %d)\n",
>>> +               __func__, hbase, hmask, result);
>>
>> Considering that sbi_err_map_xen_errno() may lose information, I'd
>> recommend logging ret.error here.
> By 'lose information' you mean case SBI_ERR_FAILURE?

Or anything else hitting the default label there.

>>> +static int cf_check sbi_rfence_v02(unsigned long fid,
>>> +                                   const cpumask_t *cpu_mask,
>>> +                                   unsigned long start, unsigned
>>> long size,
>>> +                                   unsigned long arg4, unsigned
>>> long arg5)
>>> +{
>>> +    unsigned long hartid, cpuid, hmask = 0, hbase = 0, htop = 0;
>>> +    int result;
>>> +
>>> +    /*
>>> +     * hart_mask_base can be set to -1 to indicate that hart_mask
>>> can be
>>> +     * ignored and all available harts must be considered.
>>> +     */
>>> +    if ( !cpu_mask )
>>> +        return sbi_rfence_v02_real(fid, 0UL, -1UL, start, size,
>>> arg4);
>>> +
>>> +    for_each_cpu ( cpuid, cpu_mask )
>>> +    {
>>> +        /*
>>> +        * Hart IDs might not necessarily be numbered contiguously
>>> in
>>> +        * a multiprocessor system, but at least one hart must have
>>> a
>>> +        * hart ID of zero.
>>
>> Does this latter fact matter here in any way?
> It doesn't, just copy from the RISC-V spec the full sentence. If it
> would be better to drop the latter fact I will be happy to do that in
> the next patch version.

You may certainly leave extra information, but then you want to somehow
express which part is relevant and which part is "extra". One way of
achieving such would imo be to actually state that you're quoting from
some spec.

>>> +        * This means that it is possible for the hart ID mapping
>>> to look like:
>>> +        *  0, 1, 3, 65, 66, 69
>>> +        * In such cases, more than one call to
>>> sbi_rfence_v02_real() will be
>>> +        * needed, as a single hmask can only cover sizeof(unsigned
>>> long) CPUs:
>>> +        *  1. sbi_rfence_v02_real(hmask=0b1011, hbase=0)
>>> +        *  2. sbi_rfence_v02_real(hmask=0b1011, hbase=65)
>>> +        *
>>> +        * The algorithm below tries to batch as many harts as
>>> possible before
>>> +        * making an SBI call. However, batching may not always be
>>> possible.
>>> +        * For example, consider the hart ID mapping:
>>> +        *   0, 64, 1, 65, 2, 66
>>
>> Just to mention it: Batching is also possible here: First (0,1,2),
>> then
>> (64,65,66). It just requires a different approach. Whether switching
>> is
>> worthwhile depends on how numbering is done on real world systems.
> For sure, it's possible to do that. I was just trying to describe the
> currently implemented algorithm. If you think it's beneficial to add
> that information to the comment, I can include it as well.

What I'd like to ask for is that you make a difference between "cannot"
and "we don't".

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/7] xen/riscv: page table handling
  2024-08-27 15:00   ` Jan Beulich
@ 2024-08-28 16:11     ` oleksii.kurochko
  2024-08-29  7:01       ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: oleksii.kurochko @ 2024-08-28 16:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote:
> On 21.08.2024 18:06, Oleksii Kurochko wrote:
> > Implement map_pages_to_xen() which requires several
> > functions to manage page tables and entries:
> > - pt_update()
> > - pt_mapping_level()
> > - pt_update_entry()
> > - pt_next_level()
> > - pt_check_entry()
> > 
> > To support these operations, add functions for creating,
> > mapping, and unmapping Xen tables:
> > - create_table()
> > - map_table()
> > - unmap_table()
> > 
> > Introduce internal macros starting with PTE_* for convenience.
> > These macros closely resemble PTE bits, with the exception of
> > PTE_SMALL, which indicates that 4KB is needed.
> 
> What macros are you talking about here? Is this partially stale, as
> only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being
> added?
I am speaking about macros connected to masks:
     #define PTE_R_MASK(x)   ((x) & PTE_READABLE)
     #define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
     #define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
   
     #define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE |
   PTE_EXECUTABLE))

> 
> > --- a/xen/arch/riscv/include/asm/flushtlb.h
> > +++ b/xen/arch/riscv/include/asm/flushtlb.h
> > @@ -5,12 +5,24 @@
> >  #include <xen/bug.h>
> >  #include <xen/cpumask.h>
> >  
> > +#include <asm/sbi.h>
> > +
> >  /* Flush TLB of local processor for address va. */
> >  static inline void flush_tlb_one_local(vaddr_t va)
> >  {
> >      asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
> >  }
> >  
> > +/*
> > + * Flush a range of VA's hypervisor mappings from the TLB of all
> > + * processors in the inner-shareable domain.
> > + */
> 
> Isn't inner-sharable an Arm term? Don't you simply mean "all" here?
Yes, this is Arm term. It should used "all" instead. Thanks.

> 
> > @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p)
> >      return p.pte & PTE_VALID;
> >  }
> >  
> > +inline bool pte_is_table(const pte_t p)
> > +{
> > +    return ((p.pte & (PTE_VALID |
> > +                      PTE_READABLE |
> > +                      PTE_WRITABLE |
> > +                      PTE_EXECUTABLE)) == PTE_VALID);
> > +}
> 
> In how far is the READABLE check valid here? You (imo correctly) ...
> 
> > +static inline bool pte_is_mapping(const pte_t p)
> > +{
> > +    return (p.pte & PTE_VALID) &&
> > +           (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE));
> > +}
> 
> ... don't consider this bit here.
pte_is_mapping() seems to me is correct as according to the RISC-V
privileged spec:
   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 
   5. Otherwise, this PTE is a pointer to the next level of the page   
   table.
   5. A leaf PTE has been found. ...

and regarding pte_is_table() READABLE check is valid as we have to
check only that pte.r = pte.x = 0. WRITABLE check should be dropped. Or
just use define pte_is_table() as:
   inline bool pte_is_table(const pte_t p)
   {
   	return !pte_is_mapping(p);
   }


> 
> > --- a/xen/arch/riscv/include/asm/riscv_encoding.h
> > +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
> > @@ -164,6 +164,7 @@
> >  #define SSTATUS_SD			SSTATUS64_SD
> >  #define SATP_MODE			SATP64_MODE
> >  #define SATP_MODE_SHIFT			SATP64_MODE_SHIFT
> > +#define SATP_PPN_MASK			_UL(0x00000FFFFFFFFFFF)
> >  
> >  #define HGATP_PPN			HGATP64_PPN
> >  #define HGATP_VMID_SHIFT		HGATP64_VMID_SHIFT
> 
> This looks odd, padding-wise, but that's because hard tabs are being
> used here. Is that intentional?
I use tabs here because riscv_encoding.h was copied from Linux kernel
which uses hard tabs and definitions above use 3 tabs so I used 3 hard
tabs too.

> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/pt.c
> > @@ -0,0 +1,420 @@
> > +#include <xen/bug.h>
> > +#include <xen/domain_page.h>
> > +#include <xen/errno.h>
> > +#include <xen/mm.h>
> > +#include <xen/mm-frame.h>
> > +#include <xen/pmap.h>
> > +#include <xen/spinlock.h>
> > +
> > +#include <asm/flushtlb.h>
> > +#include <asm/page.h>
> > +
> > +static inline const mfn_t get_root_page(void)
> > +{
> > +    paddr_t root_maddr = (csr_read(CSR_SATP) & SATP_PPN_MASK) <<
> > PAGE_SHIFT;
> > +
> > +    return maddr_to_mfn(root_maddr);
> > +}
> > +
> > +/* Sanity check of the entry. */
> > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
> > flags)
> > +{
> > +    /*
> > +     * See the comment about the possible combination of (mfn,
> > flags) in
> > +     * the comment above pt_update().
> > +     */
> > +
> > +    /* Sanity check when modifying an entry. */
> > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> > +    {
> > +        /* We don't allow modifying an invalid entry. */
> > +        if ( !pte_is_valid(entry) )
> > +        {
> > +            printk("Modifying invalid entry is not allowed.\n");
> 
> Perhaps all of these printk()s should be dprintk()?
It could be dprintk() but at the same time I don't see any issue if it
will be printed once.

>  And not have a full
> stop?
By "full stop," do you mean something like panic() or BUG_ON()? The
error is propagated up to the caller, which then calls panic().
Anexample of this is:
       if ( (offset + size) > MB(2) )
       {
           rc = map_pages_to_xen(BOOT_FDT_VIRT_START + MB(2),
                                 maddr_to_mfn(base_paddr + MB(2)),
                                 MB(2) >> PAGE_SHIFT,
                                 PAGE_HYPERVISOR_RO);
           if ( rc )
               panic("Unable to map the device-tree\n");
       }
If it would be better for some reason to call panic() or BUG_ON() as
soon as pt_check_entry() returns false, I can do it that way as well.

> 
> > +            return false;
> > +        }
> > +
> > +        /* We don't allow modifying a table entry */
> > +        if ( pte_is_table(entry) )
> > +        {
> > +            printk("Modifying a table entry is not allowed.\n");
> > +            return false;
> > +        }
> > +    }
> > +    /* Sanity check when inserting a mapping */
> > +    else if ( flags & PTE_VALID )
> > +    {
> > +        /* We should be here with a valid MFN. */
> > +        ASSERT(!mfn_eq(mfn, INVALID_MFN));
> 
> This is odd to have here, considering the if() further up.
Agree, ASSERT() could be drop.

> 
> > +        /*
> > +         * We don't allow replacing any valid entry.
> > +         *
> > +         * Note that the function pt_update() relies on this
> > +         * assumption and will skip the TLB flush (when Svvptc
> > +         * extension will be ratified). The function will need
> > +         * to be updated if the check is relaxed.
> > +         */
> > +        if ( pte_is_valid(entry) )
> > +        {
> > +            if ( pte_is_mapping(entry) )
> > +                printk("Changing MFN for a valid entry is not
> > allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n",
> > +                       mfn_x(mfn_from_pte(entry)), mfn_x(mfn));
> > +            else
> > +                printk("Trying to replace a table with a
> > mapping.\n");
> > +            return false;
> > +        }
> > +    }
> > +    /* Sanity check when removing a mapping. */
> > +    else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )
> 
> The PTE_VALID part of the check is pointless considering the earlier
> if(). I guess you may want to have it for doc purposes ...
Yes, it just helps to read the code and understand "confusing" if's()
above.

> 
> Since further up you're using "else if ( flags & PTE_VALID )" imo
> here you want to use "else if ( !(flags & ...) )".
> 
> > +    {
> > +        /* We should be here with an invalid MFN. */
> > +        ASSERT(mfn_eq(mfn, INVALID_MFN));
> > +
> > +        /* We don't allow removing a table */
> > +        if ( pte_is_table(entry) )
> > +        {
> > +            printk("Removing a table is not allowed.\n");
> > +            return false;
> > +        }
> 
> Is this restriction temporary?
Yes.

> 
> > +    }
> > +    /* Sanity check when populating the page-table. No check so
> > far. */
> > +    else
> > +    {
> > +        ASSERT(flags & PTE_POPULATE);
> 
> This again is redundant with earlier if() conditions.
> 
> > +#define XEN_TABLE_MAP_FAILED 0
> > +#define XEN_TABLE_SUPER_PAGE 1
> > +#define XEN_TABLE_NORMAL 2
> > +
> > +/*
> > + * Take the currently mapped table, find the corresponding entry,
> > + * and map the next table, if available.
> > + *
> > + * The alloc_tbl parameters indicates whether intermediate tables
> > should
> > + * be allocated when not present.
> > + *
> > + * Return values:
> > + *  XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry
> > + *  was empty, or allocating a new page failed.
> > + *  XEN_TABLE_NORMAL: next level or leaf mapped normally
> > + *  XEN_TABLE_SUPER_PAGE: The next entry points to a superpage.
> > + */
> > +static int pt_next_level(bool alloc_tbl, pte_t **table, unsigned
> > int offset)
> 
> Having the boolean first is unusual, but well - it's your choice.
> 
> > +{
> > +    pte_t *entry;
> > +    int ret;
> > +    mfn_t mfn;
> > +
> > +    entry = *table + offset;
> > +
> > +    if ( !pte_is_valid(*entry) )
> > +    {
> > +        if ( alloc_tbl )
> > +            return XEN_TABLE_MAP_FAILED;
> 
> Is this condition meant to be inverted?
if alloc_tbl = true we shouldn't allocatetable as:
     * The intermediate page table shouldn't be allocated when MFN
isn't
     * valid and we are not populating page table.
...
    */
    bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags &
PTE_POPULATE);

So if mfn = INVALID_MFN and flags.PTE_POPULATE=0 it means that this
table shouldn't be allocated and thereby pt_next_level() should return
XEN_TABLE_MAP_FAILED.

Or to invert if ( alloc_tbl )it will be needed to invert defintion of
alloc_tbl:
 bool alloc_tbl = !mfn_eq(mfn, INVALID_MFN) || (flags & PTE_POPULATE);
> 
> > +        ret = create_table(entry);
> > +        if ( ret )
> > +            return XEN_TABLE_MAP_FAILED;
> 
> You don't really use "ret". Why not omit the local variable, even
> more so that it has too wide scope?
I'll omit that, it is really useless.

> 
> > +/* Update an entry at the level @target. */
> > +static int pt_update_entry(mfn_t root, unsigned long virt,
> > +                           mfn_t mfn, unsigned int target,
> > +                           unsigned int flags)
> > +{
> > +    int rc;
> > +    unsigned int level = HYP_PT_ROOT_LEVEL;
> > +    pte_t *table;
> > +    /*
> > +     * The intermediate page table shouldn't be allocated when MFN
> > isn't
> > +     * valid and we are not populating page table.
> > +     * This means we either modify permissions or remove an entry,
> > or
> > +     * inserting brand new entry.
> > +     *
> > +     * See the comment above pt_update() for an additional
> > explanation about
> > +     * combinations of (mfn, flags).
> > +    */
> > +    bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags &
> > PTE_POPULATE);
> 
> Is this meant to be inverted, too (to actually match variable name
> and
> comment)?
Oh, you mentioned that too. I wrote the similar above. I think it would
be better to invert if we want to use alloc_tbl variable name.

> 
> > +            break;
> > +    }
> > +
> > +    if ( level != target )
> > +    {
> > +        printk("%s: Shattering superpage is not supported\n",
> > __func__);
> > +        rc = -EOPNOTSUPP;
> > +        goto out;
> > +    }
> > +
> > +    entry = table + offsets[level];
> > +
> > +    rc = -EINVAL;
> > +    if ( !pt_check_entry(*entry, mfn, flags) )
> > +        goto out;
> > +
> > +    /* We are removing the page */
> > +    if ( !(flags & PTE_VALID) )
> > +        memset(&pte, 0x00, sizeof(pte));
> > +    else
> > +    {
> > +        /* We are inserting a mapping => Create new pte. */
> > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > +            pte = pte_from_mfn(mfn, PTE_VALID);
> > +        else /* We are updating the permission => Copy the current
> > pte. */
> > +            pte = *entry;
> > +
> > +        /* update permission according to the flags */
> > +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
> 
> When updating an entry, don't you also need to clear (some of) the
> flags?
I am not sure why some flags should be cleared. Here we are taking only
necessary for pte flags such as R, W, X or other bits in flags are
ignored.

> 
> > +/* Return the level where mapping should be done */
> > +static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned
> > long nr,
> > +                            unsigned int flags)
> > +{
> > +    unsigned int level = 0;
> > +    unsigned long mask;
> > +    unsigned int i;
> > +
> > +    /* Use blocking mapping unless the caller requests 4K mapping
> > */
> > +    if ( unlikely(flags & PTE_SMALL) )
> > +        return level;
> > +
> > +    /*
> > +     * Don't take into account the MFN when removing mapping (i.e
> > +     * MFN_INVALID) to calculate the correct target order.
> > +     *
> > +     * `vfn` and `mfn` must be both superpage aligned.
> > +     * They are or-ed together and then checked against the size
> > of
> > +     * each level.
> > +     *
> > +     * `left` is not included and checked separately to allow
> > +     * superpage mapping even if it is not properly aligned (the
> > +     * user may have asked to map 2MB + 4k).
> 
> What is this about? There's nothing named "left" here.
It refer to "remaining" pages or "leftover" space after trying to align
a mapping to a superpage boundary.
> 
> > +     */
> > +    mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
> > +    mask |= vfn;
> > +
> > +    for ( i = HYP_PT_ROOT_LEVEL; i != 0; i-- )
> > +    {
> > +        if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) &&
> > +             (nr >= BIT(XEN_PT_LEVEL_ORDER(i), UL)) )
> > +        {
> > +            level = i;
> > +            break;
> > +        }
> > +    }
> > +
> > +    return level;
> > +}
> > +
> > +static DEFINE_SPINLOCK(xen_pt_lock);
> 
> Another largely meaningless xen_ prefix?
Thanks. I'll drop it.

> 
> > +/*
> > + * If `mfn` equals `INVALID_MFN`, it indicates that the following
> > page table
> > + * update operation might be related to either populating the
> > table (
> > + * PTE_POPULATE will be set additionaly), destroying a mapping, or
> > modifying
> > + * an existing mapping.
> 
> And the latter two are distinguished by? PTE_VALID?
inserting -> (PTE_VALID=1 + (mfn=something valid))
destroying-> ( PTE_VALID=0 )

> 
> > + * If `mfn` is valid and flags has PTE_VALID bit set then it means
> > that
> > + * inserting will be done.
> > + */
> 
> What about mfn != INVALID_MFN and PTE_VALID clear?
PTE_VALID=0 will be always considered as destroying and no matter what
is mfn value as in this case the removing is done in the way where mfn
isn't used:
        memset(&pte, 0x00, sizeof(pte));


>  Also note that "`mfn` is
> valid" isn't the same as "mfn != INVALID_MFN". You want to be precise
> here,
> to avoid confusion later on. (I say that knowing that we're still
> fighting
> especially shadow paging code on x86 not having those properly
> separated.)
If it is needed to be precise and mfn is valid isn't the same as "mfn
!= INVALID_MFN" only for the case of shadow paging?

> > 
> 
> > +    unsigned long left = nr_mfns;
> > +
> > +    const mfn_t root = get_root_page();
> > +
> > +    /*
> > +     * It is bad idea to have mapping both writeable and
> > +     * executable.
> > +     * When modifying/creating mapping (i.e PTE_VALID is set),
> > +     * prevent any update if this happen.
> > +     */
> > +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) &&
> > PTE_X_MASK(flags) )
> 
> Seeing them in use, I wonder about the naming of those PTE_?_MASK()
> macros. Along with the lhs, why not simply (flags & PTE_...)?
Hmm, good point. They can be really dropped with simplification of the
mentioned if(...).

Thanks.

~ Oleksii


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/7] xen/riscv: page table handling
  2024-08-28 16:11     ` oleksii.kurochko
@ 2024-08-29  7:01       ` Jan Beulich
  2024-08-29 12:04         ` oleksii.kurochko
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2024-08-29  7:01 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 28.08.2024 18:11, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote:
>> On 21.08.2024 18:06, Oleksii Kurochko wrote:
>>> Implement map_pages_to_xen() which requires several
>>> functions to manage page tables and entries:
>>> - pt_update()
>>> - pt_mapping_level()
>>> - pt_update_entry()
>>> - pt_next_level()
>>> - pt_check_entry()
>>>
>>> To support these operations, add functions for creating,
>>> mapping, and unmapping Xen tables:
>>> - create_table()
>>> - map_table()
>>> - unmap_table()
>>>
>>> Introduce internal macros starting with PTE_* for convenience.
>>> These macros closely resemble PTE bits, with the exception of
>>> PTE_SMALL, which indicates that 4KB is needed.
>>
>> What macros are you talking about here? Is this partially stale, as
>> only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being
>> added?
> I am speaking about macros connected to masks:
>      #define PTE_R_MASK(x)   ((x) & PTE_READABLE)
>      #define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
>      #define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
>    
>      #define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE |
>    PTE_EXECUTABLE))

Some of which is did question further down in my reply. But what's
worse - by saying "closely resemble PTE bits, with the exception of
PTE_SMALL" you pretty clearly _do not_ refer to the macros above, but
to PTE_VALID etc.

>>> @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p)
>>>      return p.pte & PTE_VALID;
>>>  }
>>>  
>>> +inline bool pte_is_table(const pte_t p)
>>> +{
>>> +    return ((p.pte & (PTE_VALID |
>>> +                      PTE_READABLE |
>>> +                      PTE_WRITABLE |
>>> +                      PTE_EXECUTABLE)) == PTE_VALID);
>>> +}
>>
>> In how far is the READABLE check valid here? You (imo correctly) ...

Oh, I wrongly picked on READABLE when it should have been the WRITABLE
bit.

>>> +static inline bool pte_is_mapping(const pte_t p)
>>> +{
>>> +    return (p.pte & PTE_VALID) &&
>>> +           (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE));
>>> +}
>>
>> ... don't consider this bit here.
> pte_is_mapping() seems to me is correct as according to the RISC-V
> privileged spec:
>    4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 
>    5. Otherwise, this PTE is a pointer to the next level of the page   
>    table.
>    5. A leaf PTE has been found. ...

Right. And then why do you check all three of r, x, and w, when the doc
mentions only r and x? There may be reasons, but such reasons then need
clearly stating in a code comment, for people to understand why the code
is not following the spec.

> and regarding pte_is_table() READABLE check is valid as we have to
> check only that pte.r = pte.x = 0. WRITABLE check should be dropped. Or
> just use define pte_is_table() as:
>    inline bool pte_is_table(const pte_t p)
>    {
>    	return !pte_is_mapping(p);
>    }

You had it like this earlier on, didn't you? That's wrong, because for a
PTE to describe another page table level PTE_VALID needs to be set.

>>> --- /dev/null
>>> +++ b/xen/arch/riscv/pt.c
>>> @@ -0,0 +1,420 @@
>>> +#include <xen/bug.h>
>>> +#include <xen/domain_page.h>
>>> +#include <xen/errno.h>
>>> +#include <xen/mm.h>
>>> +#include <xen/mm-frame.h>
>>> +#include <xen/pmap.h>
>>> +#include <xen/spinlock.h>
>>> +
>>> +#include <asm/flushtlb.h>
>>> +#include <asm/page.h>
>>> +
>>> +static inline const mfn_t get_root_page(void)
>>> +{
>>> +    paddr_t root_maddr = (csr_read(CSR_SATP) & SATP_PPN_MASK) <<
>>> PAGE_SHIFT;
>>> +
>>> +    return maddr_to_mfn(root_maddr);
>>> +}
>>> +
>>> +/* Sanity check of the entry. */
>>> +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
>>> flags)
>>> +{
>>> +    /*
>>> +     * See the comment about the possible combination of (mfn,
>>> flags) in
>>> +     * the comment above pt_update().
>>> +     */
>>> +
>>> +    /* Sanity check when modifying an entry. */
>>> +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
>>> +    {
>>> +        /* We don't allow modifying an invalid entry. */
>>> +        if ( !pte_is_valid(entry) )
>>> +        {
>>> +            printk("Modifying invalid entry is not allowed.\n");
>>
>> Perhaps all of these printk()s should be dprintk()?
> It could be dprintk() but at the same time I don't see any issue if it
> will be printed once.

What guarantees that it wouldn't be logged over and over? It's simply
bad practice to accompany all error returns with log messages, even
in release builds. Even if right now you're only in the bring-up phase,
you still want to have security in mind. If any such log message ended
up reachable from a guest-invoked path, an XSA would be needed.

>>  And not have a full
>> stop?
> By "full stop," do you mean something like panic() or BUG_ON()?

No. "Full stop" is the period at the end of a sentence (which shouldn't
normally be there at the end of log messages).

>>> +        /*
>>> +         * We don't allow replacing any valid entry.
>>> +         *
>>> +         * Note that the function pt_update() relies on this
>>> +         * assumption and will skip the TLB flush (when Svvptc
>>> +         * extension will be ratified). The function will need
>>> +         * to be updated if the check is relaxed.
>>> +         */
>>> +        if ( pte_is_valid(entry) )
>>> +        {
>>> +            if ( pte_is_mapping(entry) )
>>> +                printk("Changing MFN for a valid entry is not
>>> allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n",
>>> +                       mfn_x(mfn_from_pte(entry)), mfn_x(mfn));
>>> +            else
>>> +                printk("Trying to replace a table with a
>>> mapping.\n");
>>> +            return false;
>>> +        }
>>> +    }
>>> +    /* Sanity check when removing a mapping. */
>>> +    else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )
>>
>> The PTE_VALID part of the check is pointless considering the earlier
>> if(). I guess you may want to have it for doc purposes ...
> Yes, it just helps to read the code and understand "confusing" if's()
> above.

Well, since you mention "confusing": I for one consider such redundant
checks confusing. It raises the question whether this check is wrong or
the earlier one is. Therefore, if you want to keep the redundancy, it
may help if you extend the comment to mention it's actually redundant
(e.g. by saying "for completeness" or some such).

>>> +#define XEN_TABLE_MAP_FAILED 0
>>> +#define XEN_TABLE_SUPER_PAGE 1
>>> +#define XEN_TABLE_NORMAL 2
>>> +
>>> +/*
>>> + * Take the currently mapped table, find the corresponding entry,
>>> + * and map the next table, if available.
>>> + *
>>> + * The alloc_tbl parameters indicates whether intermediate tables
>>> should
>>> + * be allocated when not present.
>>> + *
>>> + * Return values:
>>> + *  XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry
>>> + *  was empty, or allocating a new page failed.
>>> + *  XEN_TABLE_NORMAL: next level or leaf mapped normally
>>> + *  XEN_TABLE_SUPER_PAGE: The next entry points to a superpage.
>>> + */
>>> +static int pt_next_level(bool alloc_tbl, pte_t **table, unsigned
>>> int offset)
>>
>> Having the boolean first is unusual, but well - it's your choice.
>>
>>> +{
>>> +    pte_t *entry;
>>> +    int ret;
>>> +    mfn_t mfn;
>>> +
>>> +    entry = *table + offset;
>>> +
>>> +    if ( !pte_is_valid(*entry) )
>>> +    {
>>> +        if ( alloc_tbl )
>>> +            return XEN_TABLE_MAP_FAILED;
>>
>> Is this condition meant to be inverted?
> if alloc_tbl = true we shouldn't allocatetable as:
>      * The intermediate page table shouldn't be allocated when MFN
> isn't
>      * valid and we are not populating page table.
> ...
>     */

Well, no. The variable name really shouldn't be the opposite of what is
meant. "alloc_tbl" can only possibly mean "allocate a table if none is
there". I can't think of a sensible interpretation in the inverted sense.
I'm curious how you mean to interpret that variable name.

>     bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags &
> PTE_POPULATE);
> 
> So if mfn = INVALID_MFN and flags.PTE_POPULATE=0 it means that this
> table shouldn't be allocated and thereby pt_next_level() should return
> XEN_TABLE_MAP_FAILED.
> 
> Or to invert if ( alloc_tbl )it will be needed to invert defintion of
> alloc_tbl:
>  bool alloc_tbl = !mfn_eq(mfn, INVALID_MFN) || (flags & PTE_POPULATE);

Yes, as I did comment further down.

>>> +    if ( level != target )
>>> +    {
>>> +        printk("%s: Shattering superpage is not supported\n",
>>> __func__);
>>> +        rc = -EOPNOTSUPP;
>>> +        goto out;
>>> +    }
>>> +
>>> +    entry = table + offsets[level];
>>> +
>>> +    rc = -EINVAL;
>>> +    if ( !pt_check_entry(*entry, mfn, flags) )
>>> +        goto out;
>>> +
>>> +    /* We are removing the page */
>>> +    if ( !(flags & PTE_VALID) )
>>> +        memset(&pte, 0x00, sizeof(pte));
>>> +    else
>>> +    {
>>> +        /* We are inserting a mapping => Create new pte. */
>>> +        if ( !mfn_eq(mfn, INVALID_MFN) )
>>> +            pte = pte_from_mfn(mfn, PTE_VALID);
>>> +        else /* We are updating the permission => Copy the current
>>> pte. */
>>> +            pte = *entry;
>>> +
>>> +        /* update permission according to the flags */
>>> +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
>>
>> When updating an entry, don't you also need to clear (some of) the
>> flags?
> I am not sure why some flags should be cleared. Here we are taking only
> necessary for pte flags such as R, W, X or other bits in flags are
> ignored.

Consider what happens to a PTE with R and X set when a request comes in
to change to R/W. You'll end up with R, X, and W all set if you don't
first clear the bits that are meant to be changeable in a "modify"
operation.

>>> +/* Return the level where mapping should be done */
>>> +static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned
>>> long nr,
>>> +                            unsigned int flags)
>>> +{
>>> +    unsigned int level = 0;
>>> +    unsigned long mask;
>>> +    unsigned int i;
>>> +
>>> +    /* Use blocking mapping unless the caller requests 4K mapping
>>> */
>>> +    if ( unlikely(flags & PTE_SMALL) )
>>> +        return level;
>>> +
>>> +    /*
>>> +     * Don't take into account the MFN when removing mapping (i.e
>>> +     * MFN_INVALID) to calculate the correct target order.
>>> +     *
>>> +     * `vfn` and `mfn` must be both superpage aligned.
>>> +     * They are or-ed together and then checked against the size
>>> of
>>> +     * each level.
>>> +     *
>>> +     * `left` is not included and checked separately to allow
>>> +     * superpage mapping even if it is not properly aligned (the
>>> +     * user may have asked to map 2MB + 4k).
>>
>> What is this about? There's nothing named "left" here.
> It refer to "remaining" pages or "leftover" space after trying to align
> a mapping to a superpage boundary.

What what is the quoted "left" here? Such a variable appears to exist in
the caller, but using the name here is lacking context.

>>> +/*
>>> + * If `mfn` equals `INVALID_MFN`, it indicates that the following
>>> page table
>>> + * update operation might be related to either populating the
>>> table (
>>> + * PTE_POPULATE will be set additionaly), destroying a mapping, or
>>> modifying
>>> + * an existing mapping.
>>
>> And the latter two are distinguished by? PTE_VALID?
> inserting -> (PTE_VALID=1 + (mfn=something valid))
> destroying-> ( PTE_VALID=0 )

Which then needs saying in the comment.

>>> + * If `mfn` is valid and flags has PTE_VALID bit set then it means
>>> that
>>> + * inserting will be done.
>>> + */
>>
>> What about mfn != INVALID_MFN and PTE_VALID clear?
> PTE_VALID=0 will be always considered as destroying and no matter what
> is mfn value as in this case the removing is done in the way where mfn
> isn't used:

Right, yet elsewhere you're restrictive as to MFN values valid to use.
Not requiring INVALID_MFN here looks inconsistent to me.

>         memset(&pte, 0x00, sizeof(pte));

Just to mention it: I don't think memset() is a very good way of clearing
a PTE, even if right here it's not a live one.

>>  Also note that "`mfn` is
>> valid" isn't the same as "mfn != INVALID_MFN". You want to be precise
>> here,
>> to avoid confusion later on. (I say that knowing that we're still
>> fighting
>> especially shadow paging code on x86 not having those properly
>> separated.)
> If it is needed to be precise and mfn is valid isn't the same as "mfn
> != INVALID_MFN" only for the case of shadow paging?

No, I used shadow paging only as an example of where we have similar
issues. I'd like to avoid that a new port starts out with introducing
more instances of that. You want to properly separate INVALID_MFN from
"invalid MFN", where the latter means any MFN where either nothing
exists at all, or (see mfn_valid()) where no struct page_info exists.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/7] xen/riscv: use {read,write}{b,w,l,q}_cpu() to define {read,write}_atomic()
  2024-08-28  9:42       ` Jan Beulich
@ 2024-08-29  8:52         ` oleksii.kurochko
  0 siblings, 0 replies; 32+ messages in thread
From: oleksii.kurochko @ 2024-08-29  8:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Wed, 2024-08-28 at 11:42 +0200, Jan Beulich wrote:
> On 28.08.2024 11:21, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-08-27 at 12:06 +0200, Jan Beulich wrote:
> > > On 21.08.2024 18:06, Oleksii Kurochko wrote:
> > > > In Xen, memory-ordered atomic operations are not necessary,
> > > 
> > > This is an interesting statement.
> > I looked at the definition of build_atomic_{write,read}() for other
> > architectures and didn't find any additional memory-ordered
> > primitives
> > such as fences.
> > 
> > > I'd like to suggest that you at least
> > > limit it to the two constructs in question, rather than stating
> > > this
> > > globally for everything.
> > I am not sure that I understand what is "the two constructs". Could
> > you
> > please clarify?
> 
> {read,write}_atomic() (the statement in your description is, after
> all,
> not obviously limited to just those two, yet I understand you mean to
> say what you say only for them)
Yeah, I re-read commit message after your reply and now I can see that
is not really clear.

> 
> > > > based on {read,write}_atomic() implementations for other
> > > > architectures.
> > > > Therefore, {read,write}{b,w,l,q}_cpu() can be used instead of
> > > > {read,write}{b,w,l,q}(), allowing the caller to decide if
> > > > additional
> > > > fences should be applied before or after {read,write}_atomic().
> > > > 
> > > > Change the declaration of _write_atomic() to accept a 'volatile
> > > > void *'
> > > > type for the 'x' argument instead of 'unsigned long'.
> > > > This prevents compilation errors such as:
> > > > 1."discards 'volatile' qualifier from pointer target type,"
> > > > which
> > > > occurs
> > > >   due to the initialization of a volatile pointer,
> > > >   e.g., `volatile uint8_t *ptr = p;` in _add_sized().
> > > 
> > > I don't follow you here.
> > This issue started occurring after the change mentioned in point 2
> > below.
> > 
> > I initially provided an incorrect explanation for the compilation
> > error
> > mentioned above. Let me correct that now and update the commit
> > message
> > in the next patch version. The reason for this error is that after
> > the
> > _write_atomic() prototype was updated from _write_atomic(...,
> > unsigned
> > long, ...) to _write_atomic(..., void *x, ...), the write_atomic()
> > macro contains x_, which is of type 'volatile uintX_t' because ptr
> > has
> > the type 'volatile uintX_t *'.
> 
> While there's no "ptr" in write_atomic(), I think I see what you
> mean. Yet
> at the same time Arm - having a similar construct - gets away without
> volatile. Perhaps this wants modelling after read_atomic() then,
> using a
> union?
The use of a union could be considered as a solution. For now, I think
I will just update write_pte() to avoid this issue and and minimize
changes in this patch.

> 
> > > > 2."incompatible type for argument 2 of '_write_atomic'," which
> > > > can
> > > > occur
> > > >   when calling write_pte(), where 'x' is of type pte_t rather
> > > > than
> > > >   unsigned long.
> > > 
> > > How's this related to the change at hand? That isn't different
> > > ahead
> > > of
> > > this change, is it?
> > This is not directly related to the current change, which is why I
> > decided to add a sentence about write_pte().
> > 
> > Since write_pte(pte_t *p, pte_t pte) uses write_atomic(), and the
> > argument types are pte_t * and pte respectively, we encounter a
> > compilation issue in write_atomic() because:
> > 
> > _write_atomic() expects the second argument to be of type unsigned
> > long, leading to a compilation error like "incompatible type for
> > argument 2 of '_write_atomic'."
> > I considered defining write_pte() as write_atomic(p, pte.pte), but
> > this
> > would fail at 'typeof(*(p)) x_ = (x);' and result in a compilation
> > error 'invalid initializer' or something like that.
> > 
> > It might be better to update write_pte() to:
> >    /* Write a pagetable entry. */
> >    static inline void write_pte(pte_t *p, pte_t pte)
> >    {
> >        write_atomic((unsigned long *)p, pte.pte);
> >    }
> > Then, we wouldn't need to modify the definition of write_atomic()
> > or
> > change the type of the second argument of _write_atomic().
> > Would it be better?
> 
> As said numerous times before: Whenever you can get away without a
> cast,
> you should avoid the cast. Here:
> 
> static inline void write_pte(pte_t *p, pte_t pte)
> {
>     write_atomic(&p->pte, pte.pte);
> }
> 
> That's one of the possible options, yes. Yet, like Arm has it, you
> may
> actually want the capability to read/write non-scalar types. If so,
> adjustments to write_atomic() are necessary, yet as indicated before:
> Please keep such entirely independent changes separate.
I quickly checked that there is only one instance where write_atomic()
is used for a scalar type in the Arm code. I think it would be better
to update RISC-V's write_pte() and not modify write_atomic(), at least
for now.

Thanks.

~ Oleksii


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/7] xen/riscv: page table handling
  2024-08-29  7:01       ` Jan Beulich
@ 2024-08-29 12:04         ` oleksii.kurochko
  2024-08-29 12:14           ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: oleksii.kurochko @ 2024-08-29 12:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Thu, 2024-08-29 at 09:01 +0200, Jan Beulich wrote:
> On 28.08.2024 18:11, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote:
> > > On 21.08.2024 18:06, Oleksii Kurochko wrote:
> > > > Implement map_pages_to_xen() which requires several
> > > > functions to manage page tables and entries:
> > > > - pt_update()
> > > > - pt_mapping_level()
> > > > - pt_update_entry()
> > > > - pt_next_level()
> > > > - pt_check_entry()
> > > > 
> > > > To support these operations, add functions for creating,
> > > > mapping, and unmapping Xen tables:
> > > > - create_table()
> > > > - map_table()
> > > > - unmap_table()
> > > > 
> > > > Introduce internal macros starting with PTE_* for convenience.
> > > > These macros closely resemble PTE bits, with the exception of
> > > > PTE_SMALL, which indicates that 4KB is needed.
> > > 
> > > What macros are you talking about here? Is this partially stale,
> > > as
> > > only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being
> > > added?
> > I am speaking about macros connected to masks:
> >      #define PTE_R_MASK(x)   ((x) & PTE_READABLE)
> >      #define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
> >      #define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
> >    
> >      #define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE |
> >    PTE_EXECUTABLE))
> 
> Some of which is did question further down in my reply. But what's
> worse - by saying "closely resemble PTE bits, with the exception of
> PTE_SMALL" you pretty clearly _do not_ refer to the macros above, but
> to PTE_VALID etc.
Agree, it should be corrected.

> 
> > > > @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p)
> > > >      return p.pte & PTE_VALID;
> > > >  }
> > > >  
> > > > +inline bool pte_is_table(const pte_t p)
> > > > +{
> > > > +    return ((p.pte & (PTE_VALID |
> > > > +                      PTE_READABLE |
> > > > +                      PTE_WRITABLE |
> > > > +                      PTE_EXECUTABLE)) == PTE_VALID);
> > > > +}
> > > 
> > > In how far is the READABLE check valid here? You (imo correctly)
> > > ...
> 
> Oh, I wrongly picked on READABLE when it should have been the
> WRITABLE
> bit.
> 
> > > > +static inline bool pte_is_mapping(const pte_t p)
> > > > +{
> > > > +    return (p.pte & PTE_VALID) &&
> > > > +           (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE));
> > > > +}
> > > 
> > > ... don't consider this bit here.
> > pte_is_mapping() seems to me is correct as according to the RISC-V
> > privileged spec:
> >    4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to
> > step 
> >    5. Otherwise, this PTE is a pointer to the next level of the
> > page   
> >    table.
> >    5. A leaf PTE has been found. ...
> 
> Right. And then why do you check all three of r, x, and w, when the
> doc
> mentions only r and x? There may be reasons, but such reasons then
> need
> clearly stating in a code comment, for people to understand why the
> code
> is not following the spec.
So I remembered why R, W, and X are checked. There is contradictory
information about these bits
(https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1317C64-L1321C10
):
```
The permission bits, R, W, and X, indicate whether the page is
readable, writable, and executable, respectively. When all three are
zero, the PTE is a pointer to the next level of the page table;
otherwise, it is a leaf PTE.
```

However, it is also written here
(https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1539
) that only pte.r and pte.x should be checked.

I can assume that the interpretation that R=W=X=0 indicates a pointer
to the next level of the page table could come from this statement
(https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1538
):
```
If _pte_._v_ = 0, or if _pte_._r_ = 0 and _pte_._w_ = 1, or if any bits
or encodings that are reserved for future standard use are set within
_pte_, stop and raise a page-fault exception corresponding to the
original access type.
```
From this, I can assume that when pte.r = 0, pte.w should also always
be zero; otherwise, a page-fault exception will be raised. ( but it is
no obviously connected to if the PTE is a pointer to the next page
table or not... ).




> 
> > and regarding pte_is_table() READABLE check is valid as we have to
> > check only that pte.r = pte.x = 0. WRITABLE check should be
> > dropped. Or
> > just use define pte_is_table() as:
> >    inline bool pte_is_table(const pte_t p)
> >    {
> >    	return !pte_is_mapping(p);
> >    }
> 
> You had it like this earlier on, didn't you? That's wrong, because
> for a
> PTE to describe another page table level PTE_VALID needs to be set.
Agree, it's wrong, missed that.

> > > > +#define XEN_TABLE_MAP_FAILED 0
> > > > +#define XEN_TABLE_SUPER_PAGE 1
> > > > +#define XEN_TABLE_NORMAL 2
> > > > +
> > > > +/*
> > > > + * Take the currently mapped table, find the corresponding
> > > > entry,
> > > > + * and map the next table, if available.
> > > > + *
> > > > + * The alloc_tbl parameters indicates whether intermediate
> > > > tables
> > > > should
> > > > + * be allocated when not present.
> > > > + *
> > > > + * Return values:
> > > > + *  XEN_TABLE_MAP_FAILED: Either alloc_only was set and the
> > > > entry
> > > > + *  was empty, or allocating a new page failed.
> > > > + *  XEN_TABLE_NORMAL: next level or leaf mapped normally
> > > > + *  XEN_TABLE_SUPER_PAGE: The next entry points to a
> > > > superpage.
> > > > + */
> > > > +static int pt_next_level(bool alloc_tbl, pte_t **table,
> > > > unsigned
> > > > int offset)
> > > 
> > > Having the boolean first is unusual, but well - it's your choice.
> > > 
> > > > +{
> > > > +    pte_t *entry;
> > > > +    int ret;
> > > > +    mfn_t mfn;
> > > > +
> > > > +    entry = *table + offset;
> > > > +
> > > > +    if ( !pte_is_valid(*entry) )
> > > > +    {
> > > > +        if ( alloc_tbl )
> > > > +            return XEN_TABLE_MAP_FAILED;
> > > 
> > > Is this condition meant to be inverted?
> > if alloc_tbl = true we shouldn't allocatetable as:
> >      * The intermediate page table shouldn't be allocated when MFN
> > isn't
> >      * valid and we are not populating page table.
> > ...
> >     */
> 
> Well, no. The variable name really shouldn't be the opposite of what
> is
> meant. "alloc_tbl" can only possibly mean "allocate a table if none
> is
> there". I can't think of a sensible interpretation in the inverted
> sense.
> I'm curious how you mean to interpret that variable name.
My interpretation was that alloc_tbl = true means that algorithm is
trying to allocate the table what is forbidden at the moment but I
agree that your interpretation sounds more understandable based on the
variable name.

> 
> >     bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags &
> > PTE_POPULATE);
> > 
> > So if mfn = INVALID_MFN and flags.PTE_POPULATE=0 it means that this
> > table shouldn't be allocated and thereby pt_next_level() should
> > return
> > XEN_TABLE_MAP_FAILED.
> > 
> > Or to invert if ( alloc_tbl )it will be needed to invert defintion
> > of
> > alloc_tbl:
> >  bool alloc_tbl = !mfn_eq(mfn, INVALID_MFN) || (flags &
> > PTE_POPULATE);
> 
> Yes, as I did comment further down.
> 
> > > > +    if ( level != target )
> > > > +    {
> > > > +        printk("%s: Shattering superpage is not supported\n",
> > > > __func__);
> > > > +        rc = -EOPNOTSUPP;
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    entry = table + offsets[level];
> > > > +
> > > > +    rc = -EINVAL;
> > > > +    if ( !pt_check_entry(*entry, mfn, flags) )
> > > > +        goto out;
> > > > +
> > > > +    /* We are removing the page */
> > > > +    if ( !(flags & PTE_VALID) )
> > > > +        memset(&pte, 0x00, sizeof(pte));
> > > > +    else
> > > > +    {
> > > > +        /* We are inserting a mapping => Create new pte. */
> > > > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > > > +            pte = pte_from_mfn(mfn, PTE_VALID);
> > > > +        else /* We are updating the permission => Copy the
> > > > current
> > > > pte. */
> > > > +            pte = *entry;
> > > > +
> > > > +        /* update permission according to the flags */
> > > > +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED |
> > > > PTE_DIRTY;
> > > 
> > > When updating an entry, don't you also need to clear (some of)
> > > the
> > > flags?
> > I am not sure why some flags should be cleared. Here we are taking
> > only
> > necessary for pte flags such as R, W, X or other bits in flags are
> > ignored.
> 
> Consider what happens to a PTE with R and X set when a request comes
> in
> to change to R/W. You'll end up with R, X, and W all set if you don't
> first clear the bits that are meant to be changeable in a "modify"
> operation.
That's definitely going to be a problem. I'll update the code then.

> 
> > > > +/* Return the level where mapping should be done */
> > > > +static int pt_mapping_level(unsigned long vfn, mfn_t mfn,
> > > > unsigned
> > > > long nr,
> > > > +                            unsigned int flags)
> > > > +{
> > > > +    unsigned int level = 0;
> > > > +    unsigned long mask;
> > > > +    unsigned int i;
> > > > +
> > > > +    /* Use blocking mapping unless the caller requests 4K
> > > > mapping
> > > > */
> > > > +    if ( unlikely(flags & PTE_SMALL) )
> > > > +        return level;
> > > > +
> > > > +    /*
> > > > +     * Don't take into account the MFN when removing mapping
> > > > (i.e
> > > > +     * MFN_INVALID) to calculate the correct target order.
> > > > +     *
> > > > +     * `vfn` and `mfn` must be both superpage aligned.
> > > > +     * They are or-ed together and then checked against the
> > > > size
> > > > of
> > > > +     * each level.
> > > > +     *
> > > > +     * `left` is not included and checked separately to allow
> > > > +     * superpage mapping even if it is not properly aligned
> > > > (the
> > > > +     * user may have asked to map 2MB + 4k).
> > > 
> > > What is this about? There's nothing named "left" here.
> > It refer to "remaining" pages or "leftover" space after trying to
> > align
> > a mapping to a superpage boundary.
> 
> What what is the quoted "left" here? Such a variable appears to exist
> in
> the caller, but using the name here is lacking context.
Then I will update the comment and tell from where 'left' is coming.

> 
> 
> > > > + * If `mfn` is valid and flags has PTE_VALID bit set then it
> > > > means
> > > > that
> > > > + * inserting will be done.
> > > > + */
> > > 
> > > What about mfn != INVALID_MFN and PTE_VALID clear?
> > PTE_VALID=0 will be always considered as destroying and no matter
> > what
> > is mfn value as in this case the removing is done in the way where
> > mfn
> > isn't used:
> 
> Right, yet elsewhere you're restrictive as to MFN values valid to
> use.
> Not requiring INVALID_MFN here looks inconsistent to me.
but actually if we will leave ASSERT in pt_check_entry() we will be
sure that we are here with mfn = INVALID_MFN:
       /* Sanity check when removing a mapping. */
       else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )
       {
           /* We should be here with an invalid MFN. */
           ASSERT(mfn_eq(mfn, INVALID_MFN));
> 
> >         memset(&pte, 0x00, sizeof(pte));
> 
> Just to mention it: I don't think memset() is a very good way of
> clearing
> a PTE, even if right here it's not a live one.
Just direct assigning would be better? 

> 
> > >  Also note that "`mfn` is
> > > valid" isn't the same as "mfn != INVALID_MFN". You want to be
> > > precise
> > > here,
> > > to avoid confusion later on. (I say that knowing that we're still
> > > fighting
> > > especially shadow paging code on x86 not having those properly
> > > separated.)
> > If it is needed to be precise and mfn is valid isn't the same as
> > "mfn
> > != INVALID_MFN" only for the case of shadow paging?
> 
> No, I used shadow paging only as an example of where we have similar
> issues. I'd like to avoid that a new port starts out with introducing
> more instances of that. You want to properly separate INVALID_MFN
> from
> "invalid MFN", where the latter means any MFN where either nothing
> exists at all, or (see mfn_valid()) where no struct page_info exists.
Well, now I think I understand the difference between "INVALID_MFN" and
"invalid MFN."

Referring back to your original reply, I need to update the comment
above pt_update():
```
   ...
     * If `mfn` is valid ( exist ) and flags has PTE_VALID bit set then it
   means that inserting will be done.
```
Would this be correct and more precise?

Based on the code for mfn_valid(), the separation is currently done
using the max_page value, which can't be initialized at the moment as
it requires reading the device tree file to obtain the RAM end.

We could use a placeholder for the RAM end (for example, a very high
value like -1UL) and then add __mfn_valid() within pt_update().
However, I'm not sure if this approach aligns with what you consider by
proper separation between INVALID_MFN and "invalid MFN."

~ Oleksii


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/7] xen/riscv: page table handling
  2024-08-29 12:04         ` oleksii.kurochko
@ 2024-08-29 12:14           ` Jan Beulich
  2024-08-29 14:42             ` oleksii.kurochko
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2024-08-29 12:14 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 29.08.2024 14:04, oleksii.kurochko@gmail.com wrote:
> On Thu, 2024-08-29 at 09:01 +0200, Jan Beulich wrote:
>> On 28.08.2024 18:11, oleksii.kurochko@gmail.com wrote:
>>> On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote:
>>>> On 21.08.2024 18:06, Oleksii Kurochko wrote:
>>>>> @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p)
>>>>>      return p.pte & PTE_VALID;
>>>>>  }
>>>>>  
>>>>> +inline bool pte_is_table(const pte_t p)
>>>>> +{
>>>>> +    return ((p.pte & (PTE_VALID |
>>>>> +                      PTE_READABLE |
>>>>> +                      PTE_WRITABLE |
>>>>> +                      PTE_EXECUTABLE)) == PTE_VALID);
>>>>> +}
>>>>
>>>> In how far is the READABLE check valid here? You (imo correctly)
>>>> ...
>>
>> Oh, I wrongly picked on READABLE when it should have been the
>> WRITABLE
>> bit.
>>
>>>>> +static inline bool pte_is_mapping(const pte_t p)
>>>>> +{
>>>>> +    return (p.pte & PTE_VALID) &&
>>>>> +           (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE));
>>>>> +}
>>>>
>>>> ... don't consider this bit here.
>>> pte_is_mapping() seems to me is correct as according to the RISC-V
>>> privileged spec:
>>>    4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to
>>> step 
>>>    5. Otherwise, this PTE is a pointer to the next level of the
>>> page   
>>>    table.
>>>    5. A leaf PTE has been found. ...
>>
>> Right. And then why do you check all three of r, x, and w, when the
>> doc
>> mentions only r and x? There may be reasons, but such reasons then
>> need
>> clearly stating in a code comment, for people to understand why the
>> code
>> is not following the spec.
> So I remembered why R, W, and X are checked. There is contradictory
> information about these bits
> (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1317C64-L1321C10
> ):
> ```
> The permission bits, R, W, and X, indicate whether the page is
> readable, writable, and executable, respectively. When all three are
> zero, the PTE is a pointer to the next level of the page table;
> otherwise, it is a leaf PTE.
> ```
> 
> However, it is also written here
> (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1539
> ) that only pte.r and pte.x should be checked.
> 
> I can assume that the interpretation that R=W=X=0 indicates a pointer
> to the next level of the page table could come from this statement
> (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1538
> ):
> ```
> If _pte_._v_ = 0, or if _pte_._r_ = 0 and _pte_._w_ = 1, or if any bits
> or encodings that are reserved for future standard use are set within
> _pte_, stop and raise a page-fault exception corresponding to the
> original access type.
> ```
> From this, I can assume that when pte.r = 0, pte.w should also always
> be zero; otherwise, a page-fault exception will be raised. ( but it is
> no obviously connected to if the PTE is a pointer to the next page
> table or not... ).

I don't view the information provided as contradictory, especially when
further taking the "Encoding of PTE R/W/X fields" table into account: W
set but the other two clear is "Reserved for future use."

>>>>> + * If `mfn` is valid and flags has PTE_VALID bit set then it
>>>>> means
>>>>> that
>>>>> + * inserting will be done.
>>>>> + */
>>>>
>>>> What about mfn != INVALID_MFN and PTE_VALID clear?
>>> PTE_VALID=0 will be always considered as destroying and no matter
>>> what
>>> is mfn value as in this case the removing is done in the way where
>>> mfn
>>> isn't used:
>>
>> Right, yet elsewhere you're restrictive as to MFN values valid to
>> use.
>> Not requiring INVALID_MFN here looks inconsistent to me.
> but actually if we will leave ASSERT in pt_check_entry() we will be
> sure that we are here with mfn = INVALID_MFN:
>        /* Sanity check when removing a mapping. */
>        else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )
>        {
>            /* We should be here with an invalid MFN. */
>            ASSERT(mfn_eq(mfn, INVALID_MFN));

Having such an assertion there is fine, but doesn't save you from getting
comments correct / complete.

>>>         memset(&pte, 0x00, sizeof(pte));
>>
>> Just to mention it: I don't think memset() is a very good way of
>> clearing
>> a PTE, even if right here it's not a live one.
> Just direct assigning would be better? 

Imo yes.

>>>>  Also note that "`mfn` is
>>>> valid" isn't the same as "mfn != INVALID_MFN". You want to be
>>>> precise
>>>> here,
>>>> to avoid confusion later on. (I say that knowing that we're still
>>>> fighting
>>>> especially shadow paging code on x86 not having those properly
>>>> separated.)
>>> If it is needed to be precise and mfn is valid isn't the same as
>>> "mfn
>>> != INVALID_MFN" only for the case of shadow paging?
>>
>> No, I used shadow paging only as an example of where we have similar
>> issues. I'd like to avoid that a new port starts out with introducing
>> more instances of that. You want to properly separate INVALID_MFN
>> from
>> "invalid MFN", where the latter means any MFN where either nothing
>> exists at all, or (see mfn_valid()) where no struct page_info exists.
> Well, now I think I understand the difference between "INVALID_MFN" and
> "invalid MFN."
> 
> Referring back to your original reply, I need to update the comment
> above pt_update():
> ```
>    ...
>      * If `mfn` is valid ( exist ) and flags has PTE_VALID bit set then it
>    means that inserting will be done.
> ```
> Would this be correct and more precise?

That depends on whether it correctly describes what the code does. If
the code continues to check against INVALID_MFN, such a description
wouldn't be correct. Also, just to re-iterate, ...

> Based on the code for mfn_valid(), the separation is currently done
> using the max_page value, which can't be initialized at the moment as
> it requires reading the device tree file to obtain the RAM end.

... mfn_valid() may return false for MMIO pages, for which it may still
be legitimate to create mappings. IMO ...

> We could use a placeholder for the RAM end (for example, a very high
> value like -1UL) and then add __mfn_valid() within pt_update().
> However, I'm not sure if this approach aligns with what you consider by
> proper separation between INVALID_MFN and "invalid MFN."

... throughout the code here you mean INVALID_MFN and never "invalid MFN".
Populating page tables is lower a layer than where you want to be
concerned with that distinction; the callers of these low level functions
will need to make the distinction where necessary.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/7] xen/riscv: page table handling
  2024-08-29 12:14           ` Jan Beulich
@ 2024-08-29 14:42             ` oleksii.kurochko
  2024-08-29 14:56               ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: oleksii.kurochko @ 2024-08-29 14:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Thu, 2024-08-29 at 14:14 +0200, Jan Beulich wrote:
> > > > >  Also note that "`mfn` is
> > > > > valid" isn't the same as "mfn != INVALID_MFN". You want to be
> > > > > precise
> > > > > here,
> > > > > to avoid confusion later on. (I say that knowing that we're
> > > > > still
> > > > > fighting
> > > > > especially shadow paging code on x86 not having those
> > > > > properly
> > > > > separated.)
> > > > If it is needed to be precise and mfn is valid isn't the same
> > > > as
> > > > "mfn
> > > > != INVALID_MFN" only for the case of shadow paging?
> > > 
> > > No, I used shadow paging only as an example of where we have
> > > similar
> > > issues. I'd like to avoid that a new port starts out with
> > > introducing
> > > more instances of that. You want to properly separate INVALID_MFN
> > > from
> > > "invalid MFN", where the latter means any MFN where either
> > > nothing
> > > exists at all, or (see mfn_valid()) where no struct page_info
> > > exists.
> > Well, now I think I understand the difference between "INVALID_MFN"
> > and
> > "invalid MFN."
> > 
> > Referring back to your original reply, I need to update the comment
> > above pt_update():
> > ```
> >     ...
> >       * If `mfn` is valid ( exist ) and flags has PTE_VALID bit set
> > then it
> >     means that inserting will be done.
> > ```
> > Would this be correct and more precise?
> 
> That depends on whether it correctly describes what the code does. If
> the code continues to check against INVALID_MFN, such a description
> wouldn't be correct. Also, just to re-iterate, ...
> 
> > Based on the code for mfn_valid(), the separation is currently done
> > using the max_page value, which can't be initialized at the moment
> > as
> > it requires reading the device tree file to obtain the RAM end.
> 
> ... mfn_valid() may return false for MMIO pages, for which it may
> still
> be legitimate to create mappings. IMO ...
> 
> > We could use a placeholder for the RAM end (for example, a very
> > high
> > value like -1UL) and then add __mfn_valid() within pt_update().
> > However, I'm not sure if this approach aligns with what you
> > consider by
> > proper separation between INVALID_MFN and "invalid MFN."
> 
> ... throughout the code here you mean INVALID_MFN and never "invalid
> MFN".
IIC INVALID_MFN should mean that mfn exist ( correspond to some usable
memory range of memory map ) but hasn't been mapped yet. Then for me
what I have in the comment seems correct to me:
```
   if `mfn` isn't equal to INVALID_MFN ( so it is valid/exist in terms
   that there is real memory range in memory map to which this mfn
   correspond ) and flags PTE_VALID bit set ...
```


> Populating page tables is lower a layer than where you want to be
> concerned with that distinction; the callers of these low level
> functions
> will need to make the distinction where necessary.
Then the question now is just in a proper wording of the pt_update()
arguments values?

~ Oleksii



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/7] xen/riscv: page table handling
  2024-08-29 14:42             ` oleksii.kurochko
@ 2024-08-29 14:56               ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2024-08-29 14:56 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 29.08.2024 16:42, oleksii.kurochko@gmail.com wrote:
> On Thu, 2024-08-29 at 14:14 +0200, Jan Beulich wrote:
>>>>>>  Also note that "`mfn` is
>>>>>> valid" isn't the same as "mfn != INVALID_MFN". You want to be
>>>>>> precise
>>>>>> here,
>>>>>> to avoid confusion later on. (I say that knowing that we're
>>>>>> still
>>>>>> fighting
>>>>>> especially shadow paging code on x86 not having those
>>>>>> properly
>>>>>> separated.)
>>>>> If it is needed to be precise and mfn is valid isn't the same
>>>>> as
>>>>> "mfn
>>>>> != INVALID_MFN" only for the case of shadow paging?
>>>>
>>>> No, I used shadow paging only as an example of where we have
>>>> similar
>>>> issues. I'd like to avoid that a new port starts out with
>>>> introducing
>>>> more instances of that. You want to properly separate INVALID_MFN
>>>> from
>>>> "invalid MFN", where the latter means any MFN where either
>>>> nothing
>>>> exists at all, or (see mfn_valid()) where no struct page_info
>>>> exists.
>>> Well, now I think I understand the difference between "INVALID_MFN"
>>> and
>>> "invalid MFN."
>>>
>>> Referring back to your original reply, I need to update the comment
>>> above pt_update():
>>> ```
>>>     ...
>>>       * If `mfn` is valid ( exist ) and flags has PTE_VALID bit set
>>> then it
>>>     means that inserting will be done.
>>> ```
>>> Would this be correct and more precise?
>>
>> That depends on whether it correctly describes what the code does. If
>> the code continues to check against INVALID_MFN, such a description
>> wouldn't be correct. Also, just to re-iterate, ...
>>
>>> Based on the code for mfn_valid(), the separation is currently done
>>> using the max_page value, which can't be initialized at the moment
>>> as
>>> it requires reading the device tree file to obtain the RAM end.
>>
>> ... mfn_valid() may return false for MMIO pages, for which it may
>> still
>> be legitimate to create mappings. IMO ...
>>
>>> We could use a placeholder for the RAM end (for example, a very
>>> high
>>> value like -1UL) and then add __mfn_valid() within pt_update().
>>> However, I'm not sure if this approach aligns with what you
>>> consider by
>>> proper separation between INVALID_MFN and "invalid MFN."
>>
>> ... throughout the code here you mean INVALID_MFN and never "invalid
>> MFN".
> IIC INVALID_MFN should mean that mfn exist ( correspond to some usable
> memory range of memory map ) but hasn't been mapped yet. Then for me
> what I have in the comment seems correct to me:
> ```
>    if `mfn` isn't equal to INVALID_MFN ( so it is valid/exist in terms
>    that there is real memory range in memory map to which this mfn
>    correspond ) and flags PTE_VALID bit set ...
> ```

Not really, no, as said ...

>> Populating page tables is lower a layer than where you want to be
>> concerned with that distinction; the callers of these low level
>> functions
>> will need to make the distinction where necessary.

... here. At this level I think you want to consider only INVALID_MFN,
and for anything else you're simply not concerned what the MFN provided
points at[1]. Specifically, said said before, it may point at an MMIO
range which may not be in the memory map (a PCI device BAR for example).

Jan

[1] One thing that could be checked at this layer is the number of
significant MFN bits, in case there were hardware setups in which you
know that not the full width is permitted that the PTE has room for. No
idea whether such exists in the RISC-V world.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/7] xen/riscv: set up fixmap mappings
  2024-08-28 10:44       ` Jan Beulich
@ 2024-08-30 11:01         ` oleksii.kurochko
  2024-08-30 12:37           ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: oleksii.kurochko @ 2024-08-30 11:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Wed, 2024-08-28 at 12:44 +0200, Jan Beulich wrote:
> On 28.08.2024 11:53, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-08-27 at 12:29 +0200, Jan Beulich wrote:
> > > > 
> > > > +
> > > > +/*
> > > > + * Direct access to xen_fixmap[] should only happen when {set,
> > > > + * clear}_fixmap() is unusable (e.g. where we would end up to
> > > > + * recursively call the helpers).
> > > > + */
> > > > +extern pte_t xen_fixmap[];
> > > 
> > > I'm afraid I keep being irritated by the comment: What recursive
> > > use
> > > of
> > > helpers is being talked about here? I can't see anything
> > > recursive in
> > > this
> > > patch. If this starts happening with a subsequent patch, then you
> > > have
> > > two options: Move the declaration + comment there, or clarify in
> > > the
> > > description (in enough detail) what this is about.
We can't move declaration of xen_fixmap[] to the patch where
set_fixmap() will be introduced ( which uses PMAP for domain map page
infrastructure ) as xen_fixmap[] is used in the current patch.

And we can't properly provide the proper description with the function
which will be introduced one day in the future ( what can be not good
too ). I came up with the following description in the comment above
xen_fixmap[] declaration:
   /*
    * Direct access to xen_fixmap[] should only happen when {set,
    * clear}_fixmap() is unusable (e.g. where we would end up to
    * recursively call the helpers).
    * 
    * One such case is pmap_map() where set_fixmap() can not be used.
    * It happens because PMAP is used when the domain map page
   infrastructure
    * is not yet initialized, so map_pages_to_xen() called by
   set_fixmap() needs
    * to map pages on demand, which then calls pmap() again, resulting
   in a loop.
    * Modification of the PTEs directly instead in arch_pmap_map().
    * The same is true for pmap_unmap().
    */

Could it be an option just to drop the comment for now at all as at the
moment there is no such restriction with the usage of
{set,clear}_fixmap() and xen_fixmap[]?

> > This comment is added because of:
> > ```
> > void *__init pmap_map(mfn_t mfn)
> >   ...
> >        /*
> >         * We cannot use set_fixmap() here. We use PMAP when the
> > domain map
> >         * page infrastructure is not yet initialized, so
> >    map_pages_to_xen() called
> >         * by set_fixmap() needs to map pages on demand, which then
> > calls
> >    pmap()
> >         * again, resulting in a loop. Modify the PTEs directly
> > instead.
> >    The same
> >         * is true for pmap_unmap().
> >         */
> >        arch_pmap_map(slot, mfn);
> >    ...
> > ```
> > And it happens because set_fixmap() could be defined using generic
> > PT
> > helpers
> 
> As you say - could be. If I'm not mistaken no set_fixmap()
> implementation
> exists even by the end of the series. Fundamentally I'd expect
> set_fixmap()
> to (possibly) use xen_fixmap[] directly. That in turn ...
> 
> > so what will lead to recursive behaviour when when there is no
> > direct map:
> 
> ... would mean no recursion afaict. Hence why clarification is needed
> as
> to what's going on here _and_ what's planned.
Yes, it is true. No recursion will happen in this case but if to look
at the implementation of set_fixmap() for other Arm or x86 ( but I am
not sure that x86 uses PMAP inside map_pages_to_xen() ) they are using
map_pages_to_xen().

~ Oleksii

> 
> >    static pte_t *map_table(mfn_t mfn)
> >    {
> >        /*
> >         * During early boot, map_domain_page() may be unusable. Use
> > the
> >         * PMAP to map temporarily a page-table.
> >         */
> >        if ( system_state == SYS_STATE_early_boot )
> >            return pmap_map(mfn);
> >        ...
> >    }
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/7] xen/riscv: set up fixmap mappings
  2024-08-27 10:29   ` Jan Beulich
  2024-08-28  9:53     ` oleksii.kurochko
@ 2024-08-30 11:55     ` oleksii.kurochko
  2024-08-30 12:34       ` Jan Beulich
  1 sibling, 1 reply; 32+ messages in thread
From: oleksii.kurochko @ 2024-08-30 11:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-27 at 12:29 +0200, Jan Beulich wrote:
> > @@ -81,6 +82,18 @@ static inline void flush_page_to_ram(unsigned
> > long mfn, bool sync_icache)
> >       BUG_ON("unimplemented");
> >   }
> >   
> > +/* Write a pagetable entry. */
> > +static inline void write_pte(pte_t *p, pte_t pte)
> > +{
> > +    write_atomic(p, pte);
> > +}
> > +
> > +/* Read a pagetable entry. */
> > +static inline pte_t read_pte(pte_t *p)
> > +{
> > +    return read_atomic(p);
> 
> This only works because of the strange type trickery you're playing
> in
> read_atomic(). Look at x86 code - there's a strict expectation that
> the
> type can be converted to/from unsigned long. And page table accessors
> are written with that taken into consideration. Same goes for
> write_pte()
> of course, with the respective comment on the earlier patch in mind.
> 
> Otoh I see that Arm does something very similar. If you have a strong
> need / desire to follow that, then please at least split the two
> entirely separate aspects that patch 1 presently changes both in one
> go.
I am not 100% sure that type trick could be dropped easily for RISC-V:
1. I still need the separate C function for proper #ifdef-ing:
   #ifndef CONFIG_RISCV_32
       case 8: *(uint32_t *)res = readq_cpu(p); break;
   #endif
   
2. Because of the point 1 the change should be as following:
   -#define read_atomic(p) ({                                   \
   -    union { typeof(*(p)) val; char c[sizeof(*(p))]; } x_;   \
   -    read_atomic_size(p, x_.c, sizeof(*(p)));                \
   -    x_.val;                                                 \
   +#define read_atomic(p) ({                                 \
   +    unsigned long x_;                                     \
   +    read_atomic_size(p, &x_, sizeof(*(p)));               \
   +    (typeof(*(p)))x_;                                     \
    })
   But after that I think it will be an error: "conversion to non-scalar
   type requested" in the last line as *p points to pte_t.
   
   and we can't just in read_pte() change to:
   static inline pte_t read_pte(pte_t *p)
   {
       return read_atomic(&p->pte);
   }
   As in this cases it started it will return unsigned long but function
   expects pte_t. As an option read_pte() can be updated to:
   /* Read a pagetable entry. */
   static inline pte_t read_pte(pte_t *p)
   {
       return (pte_t) { .pte = read_atomic(&p->pte) };
   }
   
   But I am not sure that it is better then just have a union trick inside
   read_atomic() and then just have read_atomic(p) for read_pte().
   
   Am I missing something?
   
~ Oleksii


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/7] xen/riscv: set up fixmap mappings
  2024-08-30 11:55     ` oleksii.kurochko
@ 2024-08-30 12:34       ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2024-08-30 12:34 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 30.08.2024 13:55, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-27 at 12:29 +0200, Jan Beulich wrote:
>>> @@ -81,6 +82,18 @@ static inline void flush_page_to_ram(unsigned
>>> long mfn, bool sync_icache)
>>>       BUG_ON("unimplemented");
>>>   }
>>>   
>>> +/* Write a pagetable entry. */
>>> +static inline void write_pte(pte_t *p, pte_t pte)
>>> +{
>>> +    write_atomic(p, pte);
>>> +}
>>> +
>>> +/* Read a pagetable entry. */
>>> +static inline pte_t read_pte(pte_t *p)
>>> +{
>>> +    return read_atomic(p);
>>
>> This only works because of the strange type trickery you're playing
>> in
>> read_atomic(). Look at x86 code - there's a strict expectation that
>> the
>> type can be converted to/from unsigned long. And page table accessors
>> are written with that taken into consideration. Same goes for
>> write_pte()
>> of course, with the respective comment on the earlier patch in mind.
>>
>> Otoh I see that Arm does something very similar. If you have a strong
>> need / desire to follow that, then please at least split the two
>> entirely separate aspects that patch 1 presently changes both in one
>> go.
> I am not 100% sure that type trick could be dropped easily for RISC-V:
> 1. I still need the separate C function for proper #ifdef-ing:
>    #ifndef CONFIG_RISCV_32
>        case 8: *(uint32_t *)res = readq_cpu(p); break;
>    #endif
>    
> 2. Because of the point 1 the change should be as following:
>    -#define read_atomic(p) ({                                   \
>    -    union { typeof(*(p)) val; char c[sizeof(*(p))]; } x_;   \
>    -    read_atomic_size(p, x_.c, sizeof(*(p)));                \
>    -    x_.val;                                                 \
>    +#define read_atomic(p) ({                                 \
>    +    unsigned long x_;                                     \
>    +    read_atomic_size(p, &x_, sizeof(*(p)));               \
>    +    (typeof(*(p)))x_;                                     \
>     })
>    But after that I think it will be an error: "conversion to non-scalar
>    type requested" in the last line as *p points to pte_t.
>    
>    and we can't just in read_pte() change to:
>    static inline pte_t read_pte(pte_t *p)
>    {
>        return read_atomic(&p->pte);
>    }
>    As in this cases it started it will return unsigned long but function
>    expects pte_t.

Of course.

> As an option read_pte() can be updated to:
>    /* Read a pagetable entry. */
>    static inline pte_t read_pte(pte_t *p)
>    {
>        return (pte_t) { .pte = read_atomic(&p->pte) };
>    }

That's what's needed.

>    But I am not sure that it is better then just have a union trick inside
>    read_atomic() and then just have read_atomic(p) for read_pte().

It's largely up to you. My main request is that things end up / remain
consistent. Which way round is secondary, and often merely a matter of
suitably justifying the choice made.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/7] xen/riscv: set up fixmap mappings
  2024-08-30 11:01         ` oleksii.kurochko
@ 2024-08-30 12:37           ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2024-08-30 12:37 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 30.08.2024 13:01, oleksii.kurochko@gmail.com wrote:
> On Wed, 2024-08-28 at 12:44 +0200, Jan Beulich wrote:
>> On 28.08.2024 11:53, oleksii.kurochko@gmail.com wrote:
>>> On Tue, 2024-08-27 at 12:29 +0200, Jan Beulich wrote:
>>>>>
>>>>> +
>>>>> +/*
>>>>> + * Direct access to xen_fixmap[] should only happen when {set,
>>>>> + * clear}_fixmap() is unusable (e.g. where we would end up to
>>>>> + * recursively call the helpers).
>>>>> + */
>>>>> +extern pte_t xen_fixmap[];
>>>>
>>>> I'm afraid I keep being irritated by the comment: What recursive
>>>> use
>>>> of
>>>> helpers is being talked about here? I can't see anything
>>>> recursive in
>>>> this
>>>> patch. If this starts happening with a subsequent patch, then you
>>>> have
>>>> two options: Move the declaration + comment there, or clarify in
>>>> the
>>>> description (in enough detail) what this is about.
> We can't move declaration of xen_fixmap[] to the patch where
> set_fixmap() will be introduced ( which uses PMAP for domain map page
> infrastructure ) as xen_fixmap[] is used in the current patch.
> 
> And we can't properly provide the proper description with the function
> which will be introduced one day in the future ( what can be not good
> too ). I came up with the following description in the comment above
> xen_fixmap[] declaration:
>    /*
>     * Direct access to xen_fixmap[] should only happen when {set,
>     * clear}_fixmap() is unusable (e.g. where we would end up to
>     * recursively call the helpers).
>     * 
>     * One such case is pmap_map() where set_fixmap() can not be used.
>     * It happens because PMAP is used when the domain map page
>    infrastructure
>     * is not yet initialized, so map_pages_to_xen() called by
>    set_fixmap() needs
>     * to map pages on demand, which then calls pmap() again, resulting
>    in a loop.
>     * Modification of the PTEs directly instead in arch_pmap_map().
>     * The same is true for pmap_unmap().
>     */
> 
> Could it be an option just to drop the comment for now at all as at the
> moment there is no such restriction with the usage of
> {set,clear}_fixmap() and xen_fixmap[]?

The comment isn't the right place to explain things here, imo. It's the
patch description where unexpected aspects need shedding light on.

>>> This comment is added because of:
>>> ```
>>> void *__init pmap_map(mfn_t mfn)
>>>   ...
>>>        /*
>>>         * We cannot use set_fixmap() here. We use PMAP when the
>>> domain map
>>>         * page infrastructure is not yet initialized, so
>>>    map_pages_to_xen() called
>>>         * by set_fixmap() needs to map pages on demand, which then
>>> calls
>>>    pmap()
>>>         * again, resulting in a loop. Modify the PTEs directly
>>> instead.
>>>    The same
>>>         * is true for pmap_unmap().
>>>         */
>>>        arch_pmap_map(slot, mfn);
>>>    ...
>>> ```
>>> And it happens because set_fixmap() could be defined using generic
>>> PT
>>> helpers
>>
>> As you say - could be. If I'm not mistaken no set_fixmap()
>> implementation
>> exists even by the end of the series. Fundamentally I'd expect
>> set_fixmap()
>> to (possibly) use xen_fixmap[] directly. That in turn ...
>>
>>> so what will lead to recursive behaviour when when there is no
>>> direct map:
>>
>> ... would mean no recursion afaict. Hence why clarification is needed
>> as
>> to what's going on here _and_ what's planned.
> Yes, it is true. No recursion will happen in this case but if to look
> at the implementation of set_fixmap() for other Arm or x86 ( but I am
> not sure that x86 uses PMAP inside map_pages_to_xen() ) they are using
> map_pages_to_xen().

There's no PMAP so far on x86.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-08-30 12:37 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-21 16:06 [PATCH v5 0/7] RISCV device tree mapping Oleksii Kurochko
2024-08-21 16:06 ` [PATCH v5 1/7] xen/riscv: use {read,write}{b,w,l,q}_cpu() to define {read,write}_atomic() Oleksii Kurochko
2024-08-27 10:06   ` Jan Beulich
2024-08-28  9:21     ` oleksii.kurochko
2024-08-28  9:42       ` Jan Beulich
2024-08-29  8:52         ` oleksii.kurochko
2024-08-21 16:06 ` [PATCH v5 2/7] xen/riscv: set up fixmap mappings Oleksii Kurochko
2024-08-27 10:29   ` Jan Beulich
2024-08-28  9:53     ` oleksii.kurochko
2024-08-28 10:44       ` Jan Beulich
2024-08-30 11:01         ` oleksii.kurochko
2024-08-30 12:37           ` Jan Beulich
2024-08-30 11:55     ` oleksii.kurochko
2024-08-30 12:34       ` Jan Beulich
2024-08-21 16:06 ` [PATCH v5 3/7] xen/riscv: introduce asm/pmap.h header Oleksii Kurochko
2024-08-21 16:06 ` [PATCH v5 4/7] xen/riscv: introduce functionality to work with CPU info Oleksii Kurochko
2024-08-27 13:44   ` Jan Beulich
2024-08-28 10:56     ` oleksii.kurochko
2024-08-28 11:55       ` Jan Beulich
2024-08-21 16:06 ` [PATCH v5 5/7] xen/riscv: introduce and initialize SBI RFENCE extension Oleksii Kurochko
2024-08-27 14:19   ` Jan Beulich
2024-08-28 13:11     ` oleksii.kurochko
2024-08-28 15:03       ` Jan Beulich
2024-08-21 16:06 ` [PATCH v5 6/7] xen/riscv: page table handling Oleksii Kurochko
2024-08-27 15:00   ` Jan Beulich
2024-08-28 16:11     ` oleksii.kurochko
2024-08-29  7:01       ` Jan Beulich
2024-08-29 12:04         ` oleksii.kurochko
2024-08-29 12:14           ` Jan Beulich
2024-08-29 14:42             ` oleksii.kurochko
2024-08-29 14:56               ` Jan Beulich
2024-08-21 16:06 ` [PATCH v5 7/7] xen/riscv: introduce early_fdt_map() Oleksii Kurochko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.