[PATCH v4 0/7] RISCV device tree mapping

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/7] RISCV device tree mapping
@ 2024-08-09 16:19 Oleksii Kurochko
  2024-08-09 16:19 ` [PATCH v4 1/7] xen/riscv: enable CONFIG_HAS_DEVICE_TREE Oleksii Kurochko
                   ` (6 more replies)
  0 siblings, 7 replies; 39+ messages in thread
From: Oleksii Kurochko @ 2024-08-09 16:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Current patch series introduces device tree mapping for RISC-V
and necessary things for that such as:
- Fixmap mapping
- pmap
- Xen page table processing

Also there is another one depenency:
RISCV basic exception handling implementation [2]

[2] https://lore.kernel.org/xen-devel/cover.1722960083.git.oleksii.kurochko@gmail.com/T/#t

---
Changes in v4:
 - Drop depedency from common devicre tree patch series as it was merged to
   staging.
 - Update the cover letter message.
 - All other changes are patch specific so please look at the patch.
---
Changes in v3:
 - Introduce SBI RFENCE extension support.
 - Introduce and initialize pcpu_info[] and __cpuid_to_hartid_map[] and functionality
   to work with this arrays.
 - Make page table handling arch specific instead of trying to make it generic.
 - All other changes are patch specific so please look at the patch.
---
Changes in v2:
 - Update the cover letter message
 - introduce fixmap mapping
 - introduce pmap
 - introduce CONFIG_GENREIC_PT
 - update use early_fdt_map() after MMU is enabled.
---

Oleksii Kurochko (7):
  xen/riscv: enable CONFIG_HAS_DEVICE_TREE
  xen/riscv: set up fixmap mappings
  xen/riscv: introduce asm/pmap.h header
  xen/riscv: introduce functionality to work with CPU info
  xen/riscv: introduce and initialize SBI RFENCE extension
  xen/riscv: page table handling
  xen/riscv: introduce early_fdt_map()

 xen/arch/riscv/Kconfig                      |   2 +
 xen/arch/riscv/Makefile                     |   3 +
 xen/arch/riscv/include/asm/config.h         |   8 +
 xen/arch/riscv/include/asm/fixmap.h         |  44 +++
 xen/arch/riscv/include/asm/flushtlb.h       |  19 +
 xen/arch/riscv/include/asm/mm.h             |   6 +
 xen/arch/riscv/include/asm/page.h           |  88 +++++
 xen/arch/riscv/include/asm/pmap.h           |  36 ++
 xen/arch/riscv/include/asm/processor.h      |  28 +-
 xen/arch/riscv/include/asm/riscv_encoding.h |   1 +
 xen/arch/riscv/include/asm/sbi.h            |  64 +++
 xen/arch/riscv/include/asm/smp.h            |  10 +
 xen/arch/riscv/mm.c                         | 101 ++++-
 xen/arch/riscv/pt.c                         | 408 ++++++++++++++++++++
 xen/arch/riscv/sbi.c                        | 252 +++++++++++-
 xen/arch/riscv/setup.c                      |  26 ++
 xen/arch/riscv/smp.c                        |   4 +
 xen/arch/riscv/smpboot.c                    |  12 +
 xen/arch/riscv/xen.lds.S                    |   2 +-
 19 files changed, 1104 insertions(+), 10 deletions(-)
 create mode 100644 xen/arch/riscv/include/asm/fixmap.h
 create mode 100644 xen/arch/riscv/include/asm/pmap.h
 create mode 100644 xen/arch/riscv/pt.c
 create mode 100644 xen/arch/riscv/smp.c
 create mode 100644 xen/arch/riscv/smpboot.c

-- 
2.45.2



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 1/7] xen/riscv: enable CONFIG_HAS_DEVICE_TREE
  2024-08-09 16:19 [PATCH v4 0/7] RISCV device tree mapping Oleksii Kurochko
@ 2024-08-09 16:19 ` Oleksii Kurochko
  2024-08-09 16:19 ` [PATCH v4 2/7] xen/riscv: set up fixmap mappings Oleksii Kurochko
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 39+ messages in thread
From: Oleksii Kurochko @ 2024-08-09 16:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Enable build of generic functionality for working with device
tree for RISC-V.
Also, a collection of functions for parsing memory map and other
boot information from a device tree are available now.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V4:
 - Nothing changed. Only rebase
---
Changes in V3:
 - add Acked-by: Jan Beulich <jbeulich@suse.com>
 - update the commit message
---
Changes in V2:
 - move 'select HAS_DEVICE_TREE' to CONFIG_RISCV.
---
Changes in V1:
 - new patch
---
 xen/arch/riscv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/xen/arch/riscv/Kconfig b/xen/arch/riscv/Kconfig
index f531e96657..259eea8d3b 100644
--- a/xen/arch/riscv/Kconfig
+++ b/xen/arch/riscv/Kconfig
@@ -2,6 +2,7 @@ config RISCV
 	def_bool y
 	select FUNCTION_ALIGNMENT_16B
 	select GENERIC_BUG_FRAME
+	select HAS_DEVICE_TREE
 
 config RISCV_64
 	def_bool y
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 2/7] xen/riscv: set up fixmap mappings
  2024-08-09 16:19 [PATCH v4 0/7] RISCV device tree mapping Oleksii Kurochko
  2024-08-09 16:19 ` [PATCH v4 1/7] xen/riscv: enable CONFIG_HAS_DEVICE_TREE Oleksii Kurochko
@ 2024-08-09 16:19 ` Oleksii Kurochko
  2024-08-13  8:22   ` Jan Beulich
  2024-08-09 16:19 ` [PATCH v4 3/7] xen/riscv: introduce asm/pmap.h header Oleksii Kurochko
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Oleksii Kurochko @ 2024-08-09 16:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Set up fixmap mappings and the L0 page table for fixmap support.

Define new macros in riscv/config.h for calculating
the FIXMAP_BASE address, including BOOT_FDT_VIRT_{START, SIZE},
XEN_VIRT_SIZE, and XEN_VIRT_END.

Update the check for Xen size in riscv/lds.S to use
XEN_VIRT_SIZE instead of a hardcoded constant.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V4:
 - move definitions of XEN_VIRT_SIZE, BOOT_FDT_VIRT_{START,SIZE}, FIXMAP_{BASE,ADDR}
   below XEN_VIRT_START to have definitions appear in order.
 - define FIX_LAST as (FIX_MISC + 1) to have a guard slot at the end.
 - s/enumerated/numbered in the comment
 - update the cycle which looks for L1 page table in setup_fixmap_mapping_function() and
   the comment above him.
 - drop fences inside write_pte() and put 'fence r,r' in setup_fixmap() before sfence_vma().
 - update the commit message
 - drop printk message inside setup_fixmap().
---
Changes in V3:
 - s/XEN_SIZE/XEN_VIRT_SIZE
 - drop usage of XEN_VIRT_END.
 - sort newly introduced defines in config.h by address
 - code style fixes
 - drop runtime check of that pte is valid as it was checked in L1 page table finding cycle by BUG_ON().
 - update implementation of write_pte() with FENCE rw, rw.
 - add BUILD_BUG_ON() to check that amount of entries aren't bigger then entries in page table.
 - drop set_fixmap, clear_fixmap declarations as they aren't used and defined now
 - update the commit message.
 - s/__ASM_FIXMAP_H/ASM_FIXMAP_H
 - add SPDX-License-Identifier: GPL-2.0 
---
 xen/arch/riscv/include/asm/config.h |  8 ++++++
 xen/arch/riscv/include/asm/fixmap.h | 44 +++++++++++++++++++++++++++++
 xen/arch/riscv/include/asm/mm.h     |  2 ++
 xen/arch/riscv/include/asm/page.h   |  6 ++++
 xen/arch/riscv/mm.c                 | 43 ++++++++++++++++++++++++++++
 xen/arch/riscv/setup.c              |  2 ++
 xen/arch/riscv/xen.lds.S            |  2 +-
 7 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 xen/arch/riscv/include/asm/fixmap.h

diff --git a/xen/arch/riscv/include/asm/config.h b/xen/arch/riscv/include/asm/config.h
index 50583aafdc..4b4cc529a9 100644
--- a/xen/arch/riscv/include/asm/config.h
+++ b/xen/arch/riscv/include/asm/config.h
@@ -74,6 +74,14 @@
 #error "unsupported RV_STAGE1_MODE"
 #endif
 
+#define XEN_VIRT_SIZE           MB(2)
+
+#define BOOT_FDT_VIRT_START     (XEN_VIRT_START + XEN_VIRT_SIZE)
+#define BOOT_FDT_VIRT_SIZE      MB(4)
+
+#define FIXMAP_BASE             (BOOT_FDT_VIRT_START + BOOT_FDT_VIRT_SIZE)
+#define FIXMAP_ADDR(n)          (FIXMAP_BASE + (n) * PAGE_SIZE)
+
 #define DIRECTMAP_SLOT_END      509
 #define DIRECTMAP_SLOT_START    200
 #define DIRECTMAP_VIRT_START    SLOTN(DIRECTMAP_SLOT_START)
diff --git a/xen/arch/riscv/include/asm/fixmap.h b/xen/arch/riscv/include/asm/fixmap.h
new file mode 100644
index 0000000000..2ecd05dd9f
--- /dev/null
+++ b/xen/arch/riscv/include/asm/fixmap.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * fixmap.h: compile-time virtual memory allocation
+ */
+#ifndef ASM_FIXMAP_H
+#define ASM_FIXMAP_H
+
+#include <xen/bug.h>
+#include <xen/page-size.h>
+#include <xen/pmap.h>
+
+#include <asm/page.h>
+
+/* Fixmap slots */
+#define FIX_PMAP_BEGIN (0) /* Start of PMAP */
+#define FIX_PMAP_END (FIX_PMAP_BEGIN + NUM_FIX_PMAP - 1) /* End of PMAP */
+#define FIX_MISC (FIX_PMAP_END + 1)  /* Ephemeral mappings of hardware */
+
+#define FIX_LAST (FIX_MISC + 1) /* +1 means a guard slot */
+
+#define FIXADDR_START FIXMAP_ADDR(0)
+#define FIXADDR_TOP FIXMAP_ADDR(FIX_LAST)
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Direct access to xen_fixmap[] should only happen when {set,
+ * clear}_fixmap() is unusable (e.g. where we would end up to
+ * recursively call the helpers).
+ */
+extern pte_t xen_fixmap[];
+
+#define fix_to_virt(slot) ((void *)FIXMAP_ADDR(slot))
+
+static inline unsigned int virt_to_fix(vaddr_t vaddr)
+{
+    BUG_ON(vaddr >= FIXADDR_TOP || vaddr < FIXADDR_START);
+
+    return ((vaddr - FIXADDR_START) >> PAGE_SHIFT);
+}
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* ASM_FIXMAP_H */
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 25af9e1aaa..a0bdc2bc3a 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -255,4 +255,6 @@ static inline unsigned int arch_get_dma_bitsize(void)
     return 32; /* TODO */
 }
 
+void setup_fixmap_mappings(void);
+
 #endif /* _ASM_RISCV_MM_H */
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index c831e16417..5db3edb100 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -81,6 +81,12 @@ static inline void flush_page_to_ram(unsigned long mfn, bool sync_icache)
     BUG_ON("unimplemented");
 }
 
+/* Write a pagetable entry. */
+static inline void write_pte(pte_t *p, pte_t pte)
+{
+    *p = pte;
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_RISCV_PAGE_H */
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 7d09e781bf..b8ff91cf4e 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -12,6 +12,7 @@
 #include <asm/early_printk.h>
 #include <asm/csr.h>
 #include <asm/current.h>
+#include <asm/fixmap.h>
 #include <asm/page.h>
 #include <asm/processor.h>
 
@@ -49,6 +50,9 @@ stage1_pgtbl_root[PAGETABLE_ENTRIES];
 pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
 stage1_pgtbl_nonroot[PGTBL_INITIAL_COUNT * PAGETABLE_ENTRIES];
 
+pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
+xen_fixmap[PAGETABLE_ENTRIES];
+
 #define HANDLE_PGTBL(curr_lvl_num)                                          \
     index = pt_index(curr_lvl_num, page_addr);                              \
     if ( pte_is_valid(pgtbl[index]) )                                       \
@@ -191,6 +195,45 @@ static bool __init check_pgtbl_mode_support(struct mmu_desc *mmu_desc,
     return is_mode_supported;
 }
 
+void __init setup_fixmap_mappings(void)
+{
+    pte_t *pte, tmp;
+    unsigned int i;
+
+    BUILD_BUG_ON(FIX_LAST >= PAGETABLE_ENTRIES);
+
+    pte = &stage1_pgtbl_root[pt_index(HYP_PT_ROOT_LEVEL, FIXMAP_ADDR(0))];
+
+    /*
+     * In RISC-V page table levels are numbered from Lx to L0 where
+     * x is the highest page table level for currect  MMU mode ( for example,
+     * for Sv39 has 3 page tables so the x = 2 (L2 -> L1 -> L0) ).
+     *
+     * In this cycle we want to find L1 page table because as L0 page table
+     * xen_fixmap[] will be used.
+     */
+    for ( i = HYP_PT_ROOT_LEVEL; i-- > 1; )
+    {
+        BUG_ON(!pte_is_valid(*pte));
+
+        pte = (pte_t *)LOAD_TO_LINK(pte_to_paddr(*pte));
+        pte = &pte[pt_index(i, FIXMAP_ADDR(0))];
+    }
+
+    BUG_ON(pte_is_valid(*pte));
+
+    tmp = paddr_to_pte(LINK_TO_LOAD((unsigned long)&xen_fixmap), PTE_TABLE);
+    write_pte(pte, tmp);
+
+    RISCV_FENCE(rw, rw);
+    sfence_vma();
+
+    /*
+     * We only need the zeroeth table allocated, but not the PTEs set, because
+     * set_fixmap() will set them on the fly.
+     */
+}
+
 /*
  * setup_initial_pagetables:
  *
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 4defad68f4..13f0e8c77d 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -46,6 +46,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
     test_macros_from_bug_h();
 #endif
 
+    setup_fixmap_mappings();
+
     printk("All set up\n");
 
     for ( ;; )
diff --git a/xen/arch/riscv/xen.lds.S b/xen/arch/riscv/xen.lds.S
index 070b19d915..7a683f6065 100644
--- a/xen/arch/riscv/xen.lds.S
+++ b/xen/arch/riscv/xen.lds.S
@@ -181,6 +181,6 @@ ASSERT(!SIZEOF(.got.plt),  ".got.plt non-empty")
  * Changing the size of Xen binary can require an update of
  * PGTBL_INITIAL_COUNT.
  */
-ASSERT(_end - _start <= MB(2), "Xen too large for early-boot assumptions")
+ASSERT(_end - _start <= XEN_VIRT_SIZE, "Xen too large for early-boot assumptions")
 
 ASSERT(_ident_end - _ident_start <= IDENT_AREA_SIZE, "identity region is too big");
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 3/7] xen/riscv: introduce asm/pmap.h header
  2024-08-09 16:19 [PATCH v4 0/7] RISCV device tree mapping Oleksii Kurochko
  2024-08-09 16:19 ` [PATCH v4 1/7] xen/riscv: enable CONFIG_HAS_DEVICE_TREE Oleksii Kurochko
  2024-08-09 16:19 ` [PATCH v4 2/7] xen/riscv: set up fixmap mappings Oleksii Kurochko
@ 2024-08-09 16:19 ` Oleksii Kurochko
  2024-08-13  8:41   ` Jan Beulich
  2024-08-09 16:19 ` [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info Oleksii Kurochko
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Oleksii Kurochko @ 2024-08-09 16:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Introduce arch_pmap_{un}map functions and select HAS_PMAP for CONFIG_RISCV.

Add pte_from_mfn() for use in arch_pmap_map().

Introduce flush_xen_tlb_one_local() and use it in arch_pmap_{un}map().

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V4:
 - mark arch_pmap_{un}map() as __init: documentation purpose and
   a necessary (but not sufficient) condition here, to validly
   use local TLB flushes only.
 - add flush_xen_tlb_one_local() to arch_pmap_map() as absense of
   "negative" TLB entrues will be guaranted only in the case
   when Svvptc extension is present.
 - s/mfn_from_pte/pte_from_mfn
 - drop flush_xen_tlb_range_va_local() as it isn't used in this patch
 - drop mfn_to_xen_entry() as pte_from_mfn() does the same thing
 - add flags argument to pte_from_mfn().
 - update the commit message.
 - s/flush_xen_tlb_range_va_local/flush_tlb_range_va_local
---
Changes in V3:
 - rename argument of function mfn_to_xen_entry(..., attr -> flags ).
 - update the code of mfn_to_xen_entry() to use flags argument.
 - add blank in mfn_from_pte() in return line.
 - introduce flush_xen_tlb_range_va_local() and use it inside arch_pmap_{un}map().
 - s/__ASM_PMAP_H__/ASM_PMAP_H
 - add SPDX-License-Identifier: GPL-2.0 
---
 xen/arch/riscv/Kconfig                |  1 +
 xen/arch/riscv/include/asm/flushtlb.h |  6 +++++
 xen/arch/riscv/include/asm/page.h     |  6 +++++
 xen/arch/riscv/include/asm/pmap.h     | 36 +++++++++++++++++++++++++++
 4 files changed, 49 insertions(+)
 create mode 100644 xen/arch/riscv/include/asm/pmap.h

diff --git a/xen/arch/riscv/Kconfig b/xen/arch/riscv/Kconfig
index 259eea8d3b..0112aa8778 100644
--- a/xen/arch/riscv/Kconfig
+++ b/xen/arch/riscv/Kconfig
@@ -3,6 +3,7 @@ config RISCV
 	select FUNCTION_ALIGNMENT_16B
 	select GENERIC_BUG_FRAME
 	select HAS_DEVICE_TREE
+	select HAS_PMAP
 
 config RISCV_64
 	def_bool y
diff --git a/xen/arch/riscv/include/asm/flushtlb.h b/xen/arch/riscv/include/asm/flushtlb.h
index 7ce32bea0b..f4a735fd6c 100644
--- a/xen/arch/riscv/include/asm/flushtlb.h
+++ b/xen/arch/riscv/include/asm/flushtlb.h
@@ -5,6 +5,12 @@
 #include <xen/bug.h>
 #include <xen/cpumask.h>
 
+/* Flush TLB of local processor for address va. */
+static inline void flush_tlb_one_local(vaddr_t va)
+{
+    asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
+}
+
 /*
  * Filter the given set of CPUs, removing those that definitely flushed their
  * TLB since @page_timestamp.
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index 5db3edb100..d96db0e322 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -87,6 +87,12 @@ static inline void write_pte(pte_t *p, pte_t pte)
     *p = pte;
 }
 
+static inline pte_t pte_from_mfn(mfn_t mfn, unsigned int flags)
+{
+    unsigned long pte = (mfn_x(mfn) << PTE_PPN_SHIFT) | flags;
+    return (pte_t){ .pte = pte };
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_RISCV_PAGE_H */
diff --git a/xen/arch/riscv/include/asm/pmap.h b/xen/arch/riscv/include/asm/pmap.h
new file mode 100644
index 0000000000..60065c996f
--- /dev/null
+++ b/xen/arch/riscv/include/asm/pmap.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ASM_PMAP_H
+#define ASM_PMAP_H
+
+#include <xen/bug.h>
+#include <xen/init.h>
+#include <xen/mm.h>
+#include <xen/page-size.h>
+
+#include <asm/fixmap.h>
+#include <asm/flushtlb.h>
+#include <asm/system.h>
+
+static inline void __init arch_pmap_map(unsigned int slot, mfn_t mfn)
+{
+    pte_t *entry = &xen_fixmap[slot];
+    pte_t pte;
+
+    ASSERT(!pte_is_valid(*entry));
+
+    pte = pte_from_mfn(mfn, PAGE_HYPERVISOR_RW);
+    write_pte(entry, pte);
+
+    flush_tlb_one_local(FIXMAP_ADDR(slot));
+}
+
+static inline void __init arch_pmap_unmap(unsigned int slot)
+{
+    pte_t pte = {};
+
+    write_pte(&xen_fixmap[slot], pte);
+
+    flush_tlb_one_local(FIXMAP_ADDR(slot));
+}
+
+#endif /* ASM_PMAP_H */
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-09 16:19 [PATCH v4 0/7] RISCV device tree mapping Oleksii Kurochko
                   ` (2 preceding siblings ...)
  2024-08-09 16:19 ` [PATCH v4 3/7] xen/riscv: introduce asm/pmap.h header Oleksii Kurochko
@ 2024-08-09 16:19 ` Oleksii Kurochko
  2024-08-13  8:54   ` Jan Beulich
  2024-08-09 16:19 ` [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension Oleksii Kurochko
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Oleksii Kurochko @ 2024-08-09 16:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Introduce struct pcpu_info to store pCPU-related information.
Initially, it includes only processor_id, but it will be extended
to include guest CPU information and temporary variables for
saving/restoring vCPU registers.

Add set_processor_id() and get_processor_id() functions to set
and retrieve the processor_id stored in pcpu_info.

Introduce cpuid_to_hartid_map[NR_CPUS] to map Xen logical CPUs to
hart IDs (physical CPU IDs). Add auxiliary macros cpuid_to_hartid()
for convenient access to this mapping.

Define smp_processor_id() to provide accurate information,
replacing the previous "dummy" value of 0.

Initialize tp registers to point to pcpu_info[0].
Set processor_id to 0 for logical CPU 0 and store the physical CPU ID
for the current logical CPU in cpuid_to_hartid_map[].

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V4:
 - wrap id with () inside set_processor_id().
 - code style fixes
 - update BUG_ON(id > NR_CPUS) in smp_processor_id() and drop the comment
   above BUG_ON().
 - s/__cpuid_to_hartid_map/cpuid_to_hartid_map
 - s/cpuid_to_hartid_map/cpuid_to_harti ( here cpuid_to_hartid_map is the name
   of the macros ).
 - update the commit message above the code of TP register initialization in
   start_xen().
 - s/smp_setup_processor_id/smp_setup_bootcpu_id
 - update the commit message.
 - cleanup headers which are included in <asm/processor.h>
---
Changes in V3:
 - new patch.
---
 xen/arch/riscv/Makefile                |  2 ++
 xen/arch/riscv/include/asm/processor.h | 28 ++++++++++++++++++++++++--
 xen/arch/riscv/include/asm/smp.h       | 10 +++++++++
 xen/arch/riscv/setup.c                 | 14 +++++++++++++
 xen/arch/riscv/smp.c                   |  4 ++++
 xen/arch/riscv/smpboot.c               | 12 +++++++++++
 6 files changed, 68 insertions(+), 2 deletions(-)
 create mode 100644 xen/arch/riscv/smp.c
 create mode 100644 xen/arch/riscv/smpboot.c

diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index 81b77b13d6..334fd24547 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -4,6 +4,8 @@ obj-y += mm.o
 obj-$(CONFIG_RISCV_64) += riscv64/
 obj-y += sbi.o
 obj-y += setup.o
+obj-y += smp.o
+obj-y += smpboot.o
 obj-y += stubs.o
 obj-y += traps.o
 obj-y += vm_event.o
diff --git a/xen/arch/riscv/include/asm/processor.h b/xen/arch/riscv/include/asm/processor.h
index 3ae164c265..fd4e9b4a37 100644
--- a/xen/arch/riscv/include/asm/processor.h
+++ b/xen/arch/riscv/include/asm/processor.h
@@ -12,8 +12,32 @@
 
 #ifndef __ASSEMBLY__
 
-/* TODO: need to be implemeted */
-#define smp_processor_id() 0
+#include <xen/bug.h>
+
+register struct pcpu_info *tp asm ("tp");
+
+struct pcpu_info {
+    unsigned int processor_id;
+};
+
+/* tp points to one of these */
+extern struct pcpu_info pcpu_info[NR_CPUS];
+
+#define get_processor_id()      (tp->processor_id)
+#define set_processor_id(id)    do { \
+    tp->processor_id = (id);         \
+} while (0)
+
+static inline unsigned int smp_processor_id(void)
+{
+    unsigned int id;
+
+    id = get_processor_id();
+
+    BUG_ON(id > NR_CPUS);
+
+    return id;
+}
 
 /* On stack VCPU state */
 struct cpu_user_regs
diff --git a/xen/arch/riscv/include/asm/smp.h b/xen/arch/riscv/include/asm/smp.h
index b1ea91b1eb..9f49d2bc8b 100644
--- a/xen/arch/riscv/include/asm/smp.h
+++ b/xen/arch/riscv/include/asm/smp.h
@@ -5,6 +5,8 @@
 #include <xen/cpumask.h>
 #include <xen/percpu.h>
 
+#define INVALID_HARTID UINT_MAX
+
 DECLARE_PER_CPU(cpumask_var_t, cpu_sibling_mask);
 DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
 
@@ -14,6 +16,14 @@ DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
  */
 #define park_offline_cpus false
 
+void smp_set_bootcpu_id(unsigned long boot_cpu_hartid);
+
+/*
+ * Mapping between linux logical cpu index and hartid.
+ */
+extern unsigned long cpuid_to_hartid_map[NR_CPUS];
+#define cpuid_to_hartid(cpu) cpuid_to_hartid_map[cpu]
+
 #endif
 
 /*
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 13f0e8c77d..c9446e6038 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -8,6 +8,7 @@
 #include <public/version.h>
 
 #include <asm/early_printk.h>
+#include <asm/smp.h>
 #include <asm/traps.h>
 
 void arch_get_xen_caps(xen_capabilities_info_t *info)
@@ -40,6 +41,19 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
 {
     remove_identity_mapping();
 
+    /*
+     * tp register contains an address of physical cpu information.
+     * So write physical CPU info of boot cpu to tp register
+     * It will be used later by get_processor_id() ( look at
+     * <asm/processor.h> ):
+     *   #define get_processor_id()    (tp->processor_id)
+     */
+    asm volatile ( "mv tp, %0" : : "r"((unsigned long)&pcpu_info[0]) );
+
+    set_processor_id(0);
+
+    smp_set_bootcpu_id(bootcpu_id);
+
     trap_init();
 
 #ifdef CONFIG_SELF_TESTS
diff --git a/xen/arch/riscv/smp.c b/xen/arch/riscv/smp.c
new file mode 100644
index 0000000000..779d955e3a
--- /dev/null
+++ b/xen/arch/riscv/smp.c
@@ -0,0 +1,4 @@
+#include <xen/smp.h>
+
+/* tp points to one of these per cpu */
+struct pcpu_info pcpu_info[NR_CPUS];
diff --git a/xen/arch/riscv/smpboot.c b/xen/arch/riscv/smpboot.c
new file mode 100644
index 0000000000..6690522a52
--- /dev/null
+++ b/xen/arch/riscv/smpboot.c
@@ -0,0 +1,12 @@
+#include <xen/init.h>
+#include <xen/sections.h>
+#include <xen/smp.h>
+
+unsigned long cpuid_to_hartid_map[NR_CPUS] __ro_after_init = {
+    [0 ... NR_CPUS - 1] = INVALID_HARTID
+};
+
+void __init smp_set_bootcpu_id(unsigned long boot_cpu_hartid)
+{
+    cpuid_to_hartid(0) = boot_cpu_hartid;
+}
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-09 16:19 [PATCH v4 0/7] RISCV device tree mapping Oleksii Kurochko
                   ` (3 preceding siblings ...)
  2024-08-09 16:19 ` [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info Oleksii Kurochko
@ 2024-08-09 16:19 ` Oleksii Kurochko
  2024-08-13  9:34   ` Jan Beulich
  2024-08-09 16:19 ` [PATCH v4 6/7] xen/riscv: page table handling Oleksii Kurochko
  2024-08-09 16:19 ` [PATCH v4 7/7] xen/riscv: introduce early_fdt_map() Oleksii Kurochko
  6 siblings, 1 reply; 39+ messages in thread
From: Oleksii Kurochko @ 2024-08-09 16:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Introduce functions to work with the SBI RFENCE extension for issuing
various fence operations to remote CPUs.

Add the sbi_init() function along with auxiliary functions and macro
definitions for proper initialization and checking the availability of
SBI extensions. Currently, this is implemented only for RFENCE.

Introduce sbi_remote_sfence_vma() to send SFENCE_VMA instructions to
a set of target HARTs. This will support the implementation of
flush_xen_tlb_range_va().

Integrate __sbi_rfence_v02 from Linux kernel 6.6.0-rc4 with minimal
modifications:
 - Adapt to Xen code style.
 - Use cpuid_to_hartid() instead of cpuid_to_hartid_map[].
 - Update BIT(...) to BIT(..., UL).
 - Rename __sbi_rfence_v02_call to __sbi_rfence_v02_real and
   remove the unused arg5.
 - Handle NULL cpu_mask to execute rfence on all CPUs by calling
  __sbi_rfence_v02_real(..., 0UL, -1UL,...) instead of creating hmask.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V4:
 - update the commit message.
 - code style fixes
 - update return type of sbi_has_rfence() from int to bool and drop
   conditional operator inside implementation.
 - Update mapping of SBI_ERR_FAILURE in sbi_err_map_xen_errno().
 - Update return type of sbi_spec_is_0_1() and drop conditional operator
   inside implementation.
 - s/0x%lx/%#lx
 - update the comment above declaration of sbi_remote_sfence_vma() with
   more detailed explanation what the function does.
 - update prototype of sbi_remote_sfence_vma(). Now it receives cpumask_t
   and returns int.
 - refactor __sbi_rfence_v02() take from the Linux kernel as it takes into
   account a case that hart id could be from different hbase. For example,
   the case when hart IDs are the following 0, 3, 65, 2. Or the case when
   hart IDs are unsorted: 0 3 1 2.
 - drop sbi_cpumask_to_hartmask() as it is not needed anymore
 - Update the prototype of sbi_remote_sfence_vma() and implemntation accordingly
   to the fact it returns 'int'.
 - s/flush_xen_tlb_one_local/flush_tlb_one_local
---
Changes in V3:
 - new patch.
---
 xen/arch/riscv/include/asm/sbi.h |  64 ++++++++
 xen/arch/riscv/sbi.c             | 252 ++++++++++++++++++++++++++++++-
 xen/arch/riscv/setup.c           |   3 +
 3 files changed, 318 insertions(+), 1 deletion(-)

diff --git a/xen/arch/riscv/include/asm/sbi.h b/xen/arch/riscv/include/asm/sbi.h
index 0e6820a4ed..931613646d 100644
--- a/xen/arch/riscv/include/asm/sbi.h
+++ b/xen/arch/riscv/include/asm/sbi.h
@@ -12,8 +12,42 @@
 #ifndef __ASM_RISCV_SBI_H__
 #define __ASM_RISCV_SBI_H__
 
+#include <xen/cpumask.h>
+
 #define SBI_EXT_0_1_CONSOLE_PUTCHAR		0x1
 
+#define SBI_EXT_BASE                    0x10
+#define SBI_EXT_RFENCE                  0x52464E43
+
+/* SBI function IDs for BASE extension */
+#define SBI_EXT_BASE_GET_SPEC_VERSION   0x0
+#define SBI_EXT_BASE_GET_IMP_ID         0x1
+#define SBI_EXT_BASE_GET_IMP_VERSION    0x2
+#define SBI_EXT_BASE_PROBE_EXT          0x3
+
+/* SBI function IDs for RFENCE extension */
+#define SBI_EXT_RFENCE_REMOTE_FENCE_I           0x0
+#define SBI_EXT_RFENCE_REMOTE_SFENCE_VMA        0x1
+#define SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID   0x2
+#define SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA       0x3
+#define SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID  0x4
+#define SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA       0x5
+#define SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA_ASID  0x6
+
+#define SBI_SPEC_VERSION_MAJOR_SHIFT    24
+#define SBI_SPEC_VERSION_MAJOR_MASK     0x7f
+#define SBI_SPEC_VERSION_MINOR_MASK     0xffffff
+
+/* SBI return error codes */
+#define SBI_SUCCESS             0
+#define SBI_ERR_FAILURE         (-1)
+#define SBI_ERR_NOT_SUPPORTED   (-2)
+#define SBI_ERR_INVALID_PARAM   (-3)
+#define SBI_ERR_DENIED          (-4)
+#define SBI_ERR_INVALID_ADDRESS (-5)
+
+#define SBI_SPEC_VERSION_DEFAULT        0x1
+
 struct sbiret {
     long error;
     long value;
@@ -31,4 +65,34 @@ struct sbiret sbi_ecall(unsigned long ext, unsigned long fid,
  */
 void sbi_console_putchar(int ch);
 
+/*
+ * Check underlying SBI implementation has RFENCE
+ *
+ * @return 1 for supported AND 0 for not-supported
+ */
+bool sbi_has_rfence(void);
+
+/*
+ * Instructs the remote harts to execute one or more SFENCE.VMA
+ * instructions, covering the range of virtual addresses between
+ * start_addr and start_addr + size.
+ *
+ * Returns 0 if IPI was sent to all the targeted harts successfully
+ * or negative value if start_addr or size is not valid.
+ *
+ * @hart_mask a cpu mask containing all the target harts.
+ * @param start virtual address start
+ * @param size virtual address range size
+ */
+int sbi_remote_sfence_vma(cpumask_t *cpu_mask,
+                          unsigned long start_addr,
+                          unsigned long size);
+
+/*
+ * Initialize SBI library
+ *
+ * @return 0 on success, otherwise negative errno on failure
+ */
+int sbi_init(void);
+
 #endif /* __ASM_RISCV_SBI_H__ */
diff --git a/xen/arch/riscv/sbi.c b/xen/arch/riscv/sbi.c
index 0ae166c861..39e46ef859 100644
--- a/xen/arch/riscv/sbi.c
+++ b/xen/arch/riscv/sbi.c
@@ -7,11 +7,23 @@
  * Modified by Bobby Eshleman (bobby.eshleman@gmail.com).
  *
  * Copyright (c) 2019 Western Digital Corporation or its affiliates.
- * Copyright (c) 2021-2023 Vates SAS.
+ * Copyright (c) 2021-2024 Vates SAS.
  */
 
+#include <xen/compiler.h>
+#include <xen/const.h>
+#include <xen/cpumask.h>
+#include <xen/errno.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/smp.h>
+
+#include <asm/processor.h>
 #include <asm/sbi.h>
 
+static unsigned long sbi_spec_version = SBI_SPEC_VERSION_DEFAULT;
+static unsigned long sbi_fw_id, sbi_fw_version;
+
 struct sbiret sbi_ecall(unsigned long ext, unsigned long fid,
                         unsigned long arg0, unsigned long arg1,
                         unsigned long arg2, unsigned long arg3,
@@ -38,7 +50,245 @@ struct sbiret sbi_ecall(unsigned long ext, unsigned long fid,
     return ret;
 }
 
+static int sbi_err_map_xen_errno(int err)
+{
+    switch ( err )
+    {
+    case SBI_SUCCESS:
+        return 0;
+    case SBI_ERR_DENIED:
+        return -EACCES;
+    case SBI_ERR_INVALID_PARAM:
+        return -EINVAL;
+    case SBI_ERR_INVALID_ADDRESS:
+        return -EFAULT;
+    case SBI_ERR_NOT_SUPPORTED:
+        return -EOPNOTSUPP;
+    case SBI_ERR_FAILURE:
+        fallthrough;
+    default:
+        return -ENOSYS;
+    };
+}
+
 void sbi_console_putchar(int ch)
 {
     sbi_ecall(SBI_EXT_0_1_CONSOLE_PUTCHAR, 0, ch, 0, 0, 0, 0, 0);
 }
+
+static unsigned long sbi_major_version(void)
+{
+    return (sbi_spec_version >> SBI_SPEC_VERSION_MAJOR_SHIFT) &
+        SBI_SPEC_VERSION_MAJOR_MASK;
+}
+
+static unsigned long sbi_minor_version(void)
+{
+    return sbi_spec_version & SBI_SPEC_VERSION_MINOR_MASK;
+}
+
+static long sbi_ext_base_func(long fid)
+{
+    struct sbiret ret;
+
+    ret = sbi_ecall(SBI_EXT_BASE, fid, 0, 0, 0, 0, 0, 0);
+    if ( !ret.error )
+        return ret.value;
+    else
+        return ret.error;
+}
+
+static int __sbi_rfence_v02_real(unsigned long fid,
+                                 unsigned long hmask, unsigned long hbase,
+                                 unsigned long start, unsigned long size,
+                                 unsigned long arg4)
+{
+    struct sbiret ret = {0};
+    int result = 0;
+
+    switch ( fid )
+    {
+    case SBI_EXT_RFENCE_REMOTE_FENCE_I:
+        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
+                0, 0, 0, 0);
+        break;
+    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA:
+        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
+                start, size, 0, 0);
+        break;
+    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID:
+        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
+                start, size, arg4, 0);
+        break;
+    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA:
+        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
+                start, size, 0, 0);
+        break;
+    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID:
+        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
+                start, size, arg4, 0);
+        break;
+    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA:
+        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
+                start, size, 0, 0);
+        break;
+    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA_ASID:
+        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
+                start, size, arg4, 0);
+        break;
+
+    default:
+        printk("%s: unknown function ID [%lu]\n",
+               __func__, fid);
+        result = -EINVAL;
+        break;
+    };
+
+    if ( ret.error )
+    {
+        result = sbi_err_map_xen_errno(ret.error);
+        printk("%s: hbase=%lu hmask=%#lx failed (error %d)\n",
+               __func__, hbase, hmask, result);
+    }
+
+    return result;
+}
+
+static int __sbi_rfence_v02(unsigned long fid,
+                            const cpumask_t *cpu_mask,
+                            unsigned long start, unsigned long size,
+                            unsigned long arg4, unsigned long arg5)
+{
+    unsigned long hartid, cpuid, hmask = 0, hbase = 0, htop = 0;
+    int result;
+
+    /*
+     * hart_mask_base can be set to -1 to indicate that hart_mask can be
+     * ignored and all available harts must be considered.
+     */
+    if ( !cpu_mask )
+        return __sbi_rfence_v02_real(fid, 0UL, -1UL, start, size, arg4);
+
+    for_each_cpu( cpuid, cpu_mask )
+    {
+        hartid = cpuid_to_hartid(cpuid);
+        if ( hmask )
+        {
+            if ( hartid + BITS_PER_LONG <= htop ||
+                 hbase + BITS_PER_LONG <= hartid )
+            {
+                result = __sbi_rfence_v02_real(fid, hmask, hbase,
+                                               start, size, arg4);
+                if ( result )
+                    return result;
+                hmask = 0;
+            }
+            else if ( hartid < hbase )
+            {
+                /* shift the mask to fit lower hartid */
+                hmask <<= hbase - hartid;
+                hbase = hartid;
+            }
+        }
+
+        if ( !hmask )
+        {
+            hbase = hartid;
+            htop = hartid;
+        } else if ( hartid > htop )
+            htop = hartid;
+
+        hmask |= BIT(hartid - hbase, UL);
+    }
+
+    if ( hmask )
+    {
+        result = __sbi_rfence_v02_real(fid, hmask, hbase,
+                                       start, size, arg4);
+        if ( result )
+            return result;
+    }
+
+    return 0;
+}
+
+static int (*__sbi_rfence)(unsigned long fid,
+                           const cpumask_t *cpu_mask,
+                           unsigned long start, unsigned long size,
+                           unsigned long arg4, unsigned long arg5) = NULL;
+
+int sbi_remote_sfence_vma(cpumask_t *cpu_mask,
+                          unsigned long start_addr,
+                          unsigned long size)
+{
+    return __sbi_rfence(SBI_EXT_RFENCE_REMOTE_SFENCE_VMA,
+                        cpu_mask, start_addr, size, 0, 0);
+}
+
+#define sbi_get_spec_version()  \
+    sbi_ext_base_func(SBI_EXT_BASE_GET_SPEC_VERSION)
+
+#define sbi_get_firmware_id()   \
+    sbi_ext_base_func(SBI_EXT_BASE_GET_IMP_ID)
+
+#define sbi_get_firmware_version()  \
+    sbi_ext_base_func(SBI_EXT_BASE_GET_IMP_VERSION)
+
+int sbi_probe_extension(long extid)
+{
+    struct sbiret ret;
+
+    ret = sbi_ecall(SBI_EXT_BASE, SBI_EXT_BASE_PROBE_EXT, extid,
+                    0, 0, 0, 0, 0);
+    if ( !ret.error && ret.value )
+        return ret.value;
+
+    return -EOPNOTSUPP;
+}
+
+static bool sbi_spec_is_0_1(void)
+{
+    return (sbi_spec_version == SBI_SPEC_VERSION_DEFAULT);
+}
+
+bool sbi_has_rfence(void)
+{
+    return (__sbi_rfence != NULL);
+}
+
+int __init sbi_init(void)
+{
+    int ret;
+
+    ret = sbi_get_spec_version();
+    if ( ret > 0 )
+        sbi_spec_version = ret;
+
+    printk("SBI specification v%lu.%lu detected\n",
+            sbi_major_version(), sbi_minor_version());
+
+    if ( !sbi_spec_is_0_1() )
+    {
+        sbi_fw_id = sbi_get_firmware_id();
+        sbi_fw_version = sbi_get_firmware_version();
+
+        printk("SBI implementation ID=%#lx Version=%#lx\n",
+            sbi_fw_id, sbi_fw_version);
+
+        if ( sbi_probe_extension(SBI_EXT_RFENCE) > 0 )
+        {
+            __sbi_rfence = __sbi_rfence_v02;
+            printk("SBI v0.2 RFENCE extension detected\n");
+        }
+    } else {
+        BUG_ON("Ooops. SBI spec veriosn 0.1 detected. Need to add support");
+    }
+
+    if ( !sbi_has_rfence() )
+    {
+        BUG_ON("At the moment flush_xen_tlb_range_va() uses SBI rfence to "
+               "flush TLB for all CPUS!");
+    }
+
+    return 0;
+}
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index c9446e6038..a49a8eee90 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -8,6 +8,7 @@
 #include <public/version.h>
 
 #include <asm/early_printk.h>
+#include <asm/sbi.h>
 #include <asm/smp.h>
 #include <asm/traps.h>
 
@@ -56,6 +57,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
 
     trap_init();
 
+    sbi_init();
+
 #ifdef CONFIG_SELF_TESTS
     test_macros_from_bug_h();
 #endif
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-09 16:19 [PATCH v4 0/7] RISCV device tree mapping Oleksii Kurochko
                   ` (4 preceding siblings ...)
  2024-08-09 16:19 ` [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension Oleksii Kurochko
@ 2024-08-09 16:19 ` Oleksii Kurochko
  2024-08-13 10:31   ` Jan Beulich
  2024-08-09 16:19 ` [PATCH v4 7/7] xen/riscv: introduce early_fdt_map() Oleksii Kurochko
  6 siblings, 1 reply; 39+ messages in thread
From: Oleksii Kurochko @ 2024-08-09 16:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Implement map_pages_to_xen() which requires several
functions to manage page tables and entries:
- pt_update()
- pt_mapping_level()
- pt_update_entry()
- pt_next_level()
- pt_check_entry()

To support these operations, add functions for creating,
mapping, and unmapping Xen tables:
- create_xen_table()
- xen_map_table()
- xen_unmap_table()

Introduce internal macros starting with PTE_* for convenience.
These macros closely resemble PTE bits, with the exception of
PTE_BLOCK, which indicates that a page larger than 4KB is
needed.
RISC-V detects superpages using `pte.x` and `pte.r`, as there
is no specific bit in the PTE for this purpose. From the RISC-V spec:
```
  ...
  4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 5.
     Otherwise, this PTE is a pointer to the next level of the page table.
     ... .
  5. A leaf PTE has been found.
     ...
  ...
```

The code doesn’t support super page shattering so 4KB pages are used as
default.
Additionaly as mentioed in RISC-V priviliged spec:
```
 After much deliberation, we have settled on a conventional page size of
 4 KiB for both RV32 and RV64. We expect this decision to ease the porting
 of low-level runtime software and device drivers.

 The TLB reach problem is ameliorated by transparent superpage support in
 modern operating systems [2]. Additionally, multi-level TLB hierarchies
 are quite inexpensive relative to the multi-level cache hierarchies whose
 address space they map.

 [2] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox. Practical,
     transparent operating system support for superpages.
     SIGOPS Oper. Syst. Rev., 36(SI):89–104, December 2002.
```

In addition introduce flush_tlb_range_va() for TLB flushing across
CPUs after updating the PTE for the requested mapping.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V4:
 - update the commit message.
 - drop xen_ prefix for functions: xen_pt_update(), xen_pt_mapping_level(),
   xen_pt_update_entry(), xen_pt_next_level(), xen_pt_check_entry().
 - drop 'select GENERIC_PT' for CONFIG_RISCV. There is no GENERIC_PT anymore.
 - update implementation of flush_xen_tlb_range_va and s/flush_xen_tlb_range_va/flush_tlb_range_va
 - s/pte_get_mfn/mfn_from_pte. Others similar definitions I decided not to touch as
   they were introduced before and this patter of naming such type of macros will be applied
   for newly introduced macros.
 - drop _PAGE_* definitions and use analogues of PTE_*.
 - introduce PTE_{W,X,R}_MASK and drop PAGE_{XN,W,X}_MASK. Also drop _PAGE_{*}_BIT
 - introduce PAGE_HYPERVISOR_RX.
 - drop unused now l3_table_offset.
 - drop struct pt_t as it was used only for one function. If it will be needed in the future
   pt_t will be re-introduced.
 - code styles fixes in pte_is_table(). drop level argument from t.
 - update implementation and prototype of pte_is_mapping().
 - drop level argument from pt_next_level().
 - introduce definition of SATP_PPN_MASK.
 - isolate PPN of CSR_SATP before shift by PAGE_SHIFT.
 - drop set_permission() functions as it is not used more then once.
 - update prototype of pt_check_entry(): drop level argument as it is not used.
 - pt_check_entry():
   - code style fixes
   - update the sanity check when modifying an entry
   - update the sanity check when when removing a mapping.
 - s/read_only/alloc_only.
 - code style fixes for pt_next_level().
 - pt_update_entry() changes:
   - drop arch_level variable inisde pt_update_entry()
   - drop convertion near virt to paddr_t in DECLARE_OFFSETS(offsets, virt);
   - pull out "goto out inside first 'for' cycle.
   - drop braces for 'if' cases which has only one line.
   - ident 'out' label with one blank.
   - update the comment above alloc_only and also definition to take into
     account  that if pte population was requested or not.
   - drop target variable and rename arch_target argument of the function to
     target.
 - pt_mapping_level() changes:
   - move the check if PTE_BLOCK should be mapped on the top of the function.
   - change int i to unsigned int and update 'for' cycle correspondingly.
 - update prototye of pt_update():
   - drop the comment  above nr_mfns and drop const to be consistent with other
     arguments.
   - always flush TLB at the end of the function as non-present entries can be put
     in the TLB.
   - add fence before TLB flush to ensure that PTEs are all updated before flushing.
 - s/XEN_TABLE_NORMAL_PAGE/XEN_TABLE_NORMAL
 - add a check in map_pages_to_xen() the mfn is not INVALID_MFN.
 - add the comment on top of pt_update() how mfn = INVALID_MFN is considered.
 - s/_PAGE_BLOCK/PTE_BLOCK.
 - add the comment with additional explanation for PTE_BLOCK.
 - drop defintion of FIRST_SIZE as it isn't used.
---
Changes in V3:
 - new patch. ( Technically it is reworked version of the generic approach
   which I tried to suggest in the previous version )
---

 xen/arch/riscv/Makefile                     |   1 +
 xen/arch/riscv/include/asm/flushtlb.h       |  13 +
 xen/arch/riscv/include/asm/mm.h             |   2 +
 xen/arch/riscv/include/asm/page.h           |  76 ++++
 xen/arch/riscv/include/asm/riscv_encoding.h |   1 +
 xen/arch/riscv/mm.c                         |   9 -
 xen/arch/riscv/pt.c                         | 408 ++++++++++++++++++++
 7 files changed, 501 insertions(+), 9 deletions(-)
 create mode 100644 xen/arch/riscv/pt.c

diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index 334fd24547..d058ea4e95 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -1,6 +1,7 @@
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 obj-y += entry.o
 obj-y += mm.o
+obj-y += pt.o
 obj-$(CONFIG_RISCV_64) += riscv64/
 obj-y += sbi.o
 obj-y += setup.o
diff --git a/xen/arch/riscv/include/asm/flushtlb.h b/xen/arch/riscv/include/asm/flushtlb.h
index f4a735fd6c..d18b416a60 100644
--- a/xen/arch/riscv/include/asm/flushtlb.h
+++ b/xen/arch/riscv/include/asm/flushtlb.h
@@ -5,12 +5,25 @@
 #include <xen/bug.h>
 #include <xen/cpumask.h>
 
+#include <asm/sbi.h>
+
 /* Flush TLB of local processor for address va. */
 static inline void flush_tlb_one_local(vaddr_t va)
 {
     asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
 }
 
+/*
+ * Flush a range of VA's hypervisor mappings from the TLB of all
+ * processors in the inner-shareable domain.
+ */
+static inline void flush_tlb_range_va(vaddr_t va,
+                                      size_t size)
+{
+    BUG_ON(!sbi_has_rfence());
+    sbi_remote_sfence_vma(NULL, va, size);
+}
+
 /*
  * Filter the given set of CPUs, removing those that definitely flushed their
  * TLB since @page_timestamp.
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index a0bdc2bc3a..ce1557bb27 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -42,6 +42,8 @@ static inline void *maddr_to_virt(paddr_t ma)
 #define virt_to_mfn(va)     __virt_to_mfn(va)
 #define mfn_to_virt(mfn)    __mfn_to_virt(mfn)
 
+#define mfn_from_pte(pte) maddr_to_mfn(pte_to_paddr(pte))
+
 struct page_info
 {
     /* Each frame can be threaded onto a doubly-linked list. */
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index d96db0e322..0deb1d36aa 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -20,6 +20,11 @@
 #define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl) - 1))
 #define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK << XEN_PT_LEVEL_SHIFT(lvl))
 
+/*
+ * PTE format:
+ * | XLEN-1  10 | 9             8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
+ *       PFN      reserved for SW   D   A   G   U   X   W   R   V
+ */
 #define PTE_VALID                   BIT(0, UL)
 #define PTE_READABLE                BIT(1, UL)
 #define PTE_WRITABLE                BIT(2, UL)
@@ -33,15 +38,72 @@
 #define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
 #define PTE_TABLE                   (PTE_VALID)
 
+#define PAGE_HYPERVISOR_RO          (PTE_VALID | PTE_READABLE)
 #define PAGE_HYPERVISOR_RW          (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
+#define PAGE_HYPERVISOR_RX          (PTE_VALID | PTE_READABLE | PTE_EXECUTABLE)
 
 #define PAGE_HYPERVISOR             PAGE_HYPERVISOR_RW
 
+
+/*
+ * There are no such bits in PTE format for RISC-V.
+ *
+ * The code doesn’t support super page shattering so at the moment superpages
+ * can't be used as a default so PTE_BLOCK is introduced to have ability to
+ * tell that superpage should be allocated.
+ * Additionaly as mentioed in RISC-V priviliged spec:
+ * ```
+ *  After much deliberation, we have settled on a conventional page size of
+ *  4 KiB for both RV32 and RV64. We expect this decision to ease the porting
+ *  of low-level runtime software and device drivers.
+ *
+ *  The TLB reach problem is ameliorated by transparent superpage support in
+ *  modern operating systems [2]. Additionally, multi-level TLB hierarchies
+ *  are quite inexpensive relative to the multi-level cache hierarchies whose
+ *  address space they map.
+ *
+ *  [2] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox. Practical,
+ *      transparent operating system support for superpages.
+ *      SIGOPS Oper. Syst. Rev., 36(SI):89–104, December 2002.
+ * ```
+ *
+ * PTE_POPULATE is introduced to have ability to tell that page tables
+ * shoud be populated.
+ */
+#define PTE_BLOCK       BIT(10, UL)
+#define PTE_POPULATE    BIT(11, UL)
+
+#define PTE_R_MASK(x)   ((x) & PTE_READABLE)
+#define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
+#define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
+
+#define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE))
+
 /* Calculate the offsets into the pagetables for a given VA */
 #define pt_linear_offset(lvl, va)   ((va) >> XEN_PT_LEVEL_SHIFT(lvl))
 
 #define pt_index(lvl, va) (pt_linear_offset((lvl), (va)) & VPN_MASK)
 
+#define TABLE_OFFSET(offs) (_AT(unsigned int, offs) & ((_AC(1, U) << PAGETABLE_ORDER) - 1))
+
+#if RV_STAGE1_MODE > SATP_MODE_SV48
+#error "need to to update DECLARE_OFFSETS macros"
+#else
+
+#define l0_table_offset(va) TABLE_OFFSET(pt_linear_offset(0, va))
+#define l1_table_offset(va) TABLE_OFFSET(pt_linear_offset(1, va))
+#define l2_table_offset(va) TABLE_OFFSET(pt_linear_offset(2, va))
+
+/* Generate an array @var containing the offset for each level from @addr */
+#define DECLARE_OFFSETS(var, addr)          \
+    const unsigned int var[] = {            \
+        l0_table_offset(addr),              \
+        l1_table_offset(addr),              \
+        l2_table_offset(addr),              \
+    }
+
+#endif
+
 /* Page Table entry */
 typedef struct {
 #ifdef CONFIG_RISCV_64
@@ -67,6 +129,20 @@ static inline bool pte_is_valid(pte_t p)
     return p.pte & PTE_VALID;
 }
 
+inline bool pte_is_table(const pte_t p)
+{
+    return ((p.pte & (PTE_VALID |
+                      PTE_READABLE |
+                      PTE_WRITABLE |
+                      PTE_EXECUTABLE)) == PTE_VALID);
+}
+
+static inline bool pte_is_mapping(const pte_t p)
+{
+    return (p.pte & PTE_VALID) &&
+           ((p.pte & PTE_WRITABLE) || (p.pte & PTE_EXECUTABLE));
+}
+
 static inline void invalidate_icache(void)
 {
     BUG_ON("unimplemented");
diff --git a/xen/arch/riscv/include/asm/riscv_encoding.h b/xen/arch/riscv/include/asm/riscv_encoding.h
index 58abe5eccc..1a05d06597 100644
--- a/xen/arch/riscv/include/asm/riscv_encoding.h
+++ b/xen/arch/riscv/include/asm/riscv_encoding.h
@@ -164,6 +164,7 @@
 #define SSTATUS_SD			SSTATUS64_SD
 #define SATP_MODE			SATP64_MODE
 #define SATP_MODE_SHIFT			SATP64_MODE_SHIFT
+#define SATP_PPN_MASK			_ULL(0x00000FFFFFFFFFFF)
 
 #define HGATP_PPN			HGATP64_PPN
 #define HGATP_VMID_SHIFT		HGATP64_VMID_SHIFT
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index b8ff91cf4e..e8430def14 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -369,12 +369,3 @@ int destroy_xen_mappings(unsigned long s, unsigned long e)
     BUG_ON("unimplemented");
     return -1;
 }
-
-int map_pages_to_xen(unsigned long virt,
-                     mfn_t mfn,
-                     unsigned long nr_mfns,
-                     unsigned int flags)
-{
-    BUG_ON("unimplemented");
-    return -1;
-}
diff --git a/xen/arch/riscv/pt.c b/xen/arch/riscv/pt.c
new file mode 100644
index 0000000000..25f69c899b
--- /dev/null
+++ b/xen/arch/riscv/pt.c
@@ -0,0 +1,408 @@
+#include <xen/bug.h>
+#include <xen/domain_page.h>
+#include <xen/errno.h>
+#include <xen/mm.h>
+#include <xen/mm-frame.h>
+#include <xen/pmap.h>
+#include <xen/spinlock.h>
+
+#include <asm/flushtlb.h>
+#include <asm/page.h>
+
+static inline const mfn_t get_root_page(void)
+{
+    unsigned long root_maddr =
+        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
+
+    return maddr_to_mfn(root_maddr);
+}
+
+/*
+ * Sanity check of the entry
+ * mfn is not valid and we are not populating page table. This means
+ * we either modify entry or remove an entry.
+ */
+static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int flags)
+{
+    /* Sanity check when modifying an entry. */
+    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
+    {
+        /* We don't allow modifying an invalid entry. */
+        if ( !pte_is_valid(entry) )
+        {
+            printk("Modifying invalid entry is not allowed.\n");
+            return false;
+        }
+
+        /* We don't allow modifying a table entry */
+        if ( pte_is_table(entry) )
+        {
+            printk("Modifying a table entry is not allowed.\n");
+            return false;
+        }
+    }
+    /* Sanity check when inserting a mapping */
+    else if ( flags & PTE_VALID )
+    {
+        /* We should be here with a valid MFN. */
+        ASSERT(!mfn_eq(mfn, INVALID_MFN));
+
+        /*
+         * We don't allow replacing any valid entry.
+         *
+         * Note that the function xen_pt_update() relies on this
+         * assumption and will skip the TLB flush. The function will need
+         * to be updated if the check is relaxed.
+         */
+        if ( pte_is_valid(entry) )
+        {
+            if ( pte_is_mapping(entry) )
+                printk("Changing MFN for a valid entry is not allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n",
+                       mfn_x(mfn_from_pte(entry)), mfn_x(mfn));
+            else
+                printk("Trying to replace a table with a mapping.\n");
+            return false;
+        }
+    }
+    /* Sanity check when removing a mapping. */
+    else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )
+    {
+        /* We should be here with an invalid MFN. */
+        ASSERT(mfn_eq(mfn, INVALID_MFN));
+
+        /* We don't allow removing a table */
+        if ( pte_is_table(entry) )
+        {
+            printk("Removing a table is not allowed.\n");
+            return false;
+        }
+    }
+    /* Sanity check when populating the page-table. No check so far. */
+    else
+    {
+        ASSERT(flags & PTE_POPULATE);
+        /* We should be here with an invalid MFN */
+        ASSERT(mfn_eq(mfn, INVALID_MFN));
+    }
+
+    return true;
+}
+
+static pte_t *xen_map_table(mfn_t mfn)
+{
+    /*
+     * During early boot, map_domain_page() may be unusable. Use the
+     * PMAP to map temporarily a page-table.
+     */
+    if ( system_state == SYS_STATE_early_boot )
+        return pmap_map(mfn);
+
+    return map_domain_page(mfn);
+}
+
+static void xen_unmap_table(const pte_t *table)
+{
+    /*
+     * During early boot, xen_map_table() will not use map_domain_page()
+     * but the PMAP.
+     */
+    if ( system_state == SYS_STATE_early_boot )
+        pmap_unmap(table);
+    else
+        unmap_domain_page(table);
+}
+
+static int create_xen_table(pte_t *entry)
+{
+    mfn_t mfn;
+    void *p;
+    pte_t pte;
+
+    if ( system_state != SYS_STATE_early_boot )
+    {
+        struct page_info *pg = alloc_domheap_page(NULL, 0);
+
+        if ( pg == NULL )
+            return -ENOMEM;
+
+        mfn = page_to_mfn(pg);
+    }
+    else
+        mfn = alloc_boot_pages(1, 1);
+
+    p = xen_map_table(mfn);
+    clear_page(p);
+    xen_unmap_table(p);
+
+    pte = pte_from_mfn(mfn, PTE_TABLE);
+    write_pte(entry, pte);
+
+    return 0;
+}
+
+#define XEN_TABLE_MAP_FAILED 0
+#define XEN_TABLE_SUPER_PAGE 1
+#define XEN_TABLE_NORMAL 2
+
+/*
+ * Take the currently mapped table, find the corresponding entry,
+ * and map the next table, if available.
+ *
+ * The alloc_only parameters indicates whether intermediate tables should
+ * be allocated when not present.
+ *
+ * Return values:
+ *  XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry
+ *  was empty, or allocating a new page failed.
+ *  XEN_TABLE_NORMAL: next level or leaf mapped normally
+ *  XEN_TABLE_SUPER_PAGE: The next entry points to a superpage.
+ */
+static int pt_next_level(bool alloc_only, pte_t **table, unsigned int offset)
+{
+    pte_t *entry;
+    int ret;
+    mfn_t mfn;
+
+    entry = *table + offset;
+
+    if ( !pte_is_valid(*entry) )
+    {
+        if ( alloc_only )
+            return XEN_TABLE_MAP_FAILED;
+
+        ret = create_xen_table(entry);
+        if ( ret )
+            return XEN_TABLE_MAP_FAILED;
+    }
+
+    if ( pte_is_mapping(*entry) )
+        return XEN_TABLE_SUPER_PAGE;
+
+    mfn = mfn_from_pte(*entry);
+
+    xen_unmap_table(*table);
+    *table = xen_map_table(mfn);
+
+    return XEN_TABLE_NORMAL;
+}
+
+/* Update an entry at the level @target. */
+static int pt_update_entry(mfn_t root, unsigned long virt,
+                           mfn_t mfn, unsigned int target,
+                           unsigned int flags)
+{
+    int rc;
+    unsigned int level = HYP_PT_ROOT_LEVEL;
+    pte_t *table;
+    /*
+     * The intermediate page tables are read-only when the MFN is not valid
+     * and we are not populating page table.
+     * This means we either modify permissions or remove an entry.
+     */
+    bool alloc_only = mfn_eq(mfn, INVALID_MFN) && !(flags & PTE_POPULATE);
+    pte_t pte, *entry;
+
+    /* convenience aliases */
+    DECLARE_OFFSETS(offsets, virt);
+
+    table = xen_map_table(root);
+    for ( ; level > target; level-- )
+    {
+        rc = pt_next_level(alloc_only, &table, offsets[level]);
+        if ( rc == XEN_TABLE_MAP_FAILED )
+        {
+            rc = 0;
+
+            /*
+             * We are here because pt_next_level has failed to map
+             * the intermediate page table (e.g the table does not exist
+             * and the pt is read-only). It is a valid case when
+             * removing a mapping as it may not exist in the page table.
+             * In this case, just ignore it.
+             */
+            if ( flags & PTE_VALID )
+            {
+                printk("%s: Unable to map level %u\n", __func__, level);
+                rc = -ENOENT;
+            }
+
+            goto out;
+        }
+        else if ( rc != XEN_TABLE_NORMAL )
+            break;
+    }
+
+    if ( level != target )
+    {
+        printk("%s: Shattering superpage is not supported\n", __func__);
+        rc = -EOPNOTSUPP;
+        goto out;
+    }
+
+    entry = table + offsets[level];
+
+    rc = -EINVAL;
+    if ( !pt_check_entry(*entry, mfn, flags) )
+        goto out;
+
+    /* We are removing the page */
+    if ( !(flags & PTE_VALID) )
+        memset(&pte, 0x00, sizeof(pte));
+    else
+    {
+        /* We are inserting a mapping => Create new pte. */
+        if ( !mfn_eq(mfn, INVALID_MFN) )
+            pte = pte_from_mfn(mfn, PTE_VALID);
+        else /* We are updating the permission => Copy the current pte. */
+            pte = *entry;
+
+        /* update permission according to the flags */
+        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
+    }
+
+    write_pte(entry, pte);
+
+    rc = 0;
+
+ out:
+    xen_unmap_table(table);
+
+    return rc;
+}
+
+static DEFINE_SPINLOCK(xen_pt_lock);
+
+/* Return the level where mapping should be done */
+static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned long nr,
+                            unsigned int flags)
+{
+    unsigned int level = 0;
+    unsigned long mask;
+    unsigned int i;
+
+    /*
+     * Always use level 0 ( 4k mapping ) mapping unless the caller request
+     * block mapping.
+     */
+    if ( likely(!(flags & PTE_BLOCK)) )
+        return level;
+
+    /*
+     * Don't take into account the MFN when removing mapping (i.e
+     * MFN_INVALID) to calculate the correct target order.
+     *
+     * `vfn` and `mfn` must be both superpage aligned.
+     * They are or-ed together and then checked against the size of
+     * each level.
+     *
+     * `left` is not included and checked separately to allow
+     * superpage mapping even if it is not properly aligned (the
+     * user may have asked to map 2MB + 4k).
+     */
+    mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
+    mask |= vfn;
+
+    for ( i = HYP_PT_ROOT_LEVEL; i != 0; i-- )
+    {
+        if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) &&
+             (nr >= BIT(XEN_PT_LEVEL_ORDER(i), UL)) )
+        {
+            level = i;
+            break;
+        }
+    }
+
+    return level;
+}
+
+/*
+ * If `mfn` equals `INVALID_MFN`, it indicates that the following page table
+ * update operation might be related to either populating the table,
+ * destroying a mapping, or modifying an existing mapping.
+ */
+static int pt_update(unsigned long virt,
+                     mfn_t mfn,
+                     unsigned long nr_mfns,
+                     unsigned int flags)
+{
+    int rc = 0;
+    unsigned long vfn = virt >> PAGE_SHIFT;
+    unsigned long left = nr_mfns;
+
+    const mfn_t root = get_root_page();
+
+    /*
+     * It is bad idea to have mapping both writeable and
+     * executable.
+     * When modifying/creating mapping (i.e PTE_VALID is set),
+     * prevent any update if this happen.
+     */
+    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) && PTE_X_MASK(flags) )
+    {
+        printk("Mappings should not be both Writeable and Executable.\n");
+        return -EINVAL;
+    }
+
+    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
+    {
+        printk("The virtual address is not aligned to the page-size.\n");
+        return -EINVAL;
+    }
+
+    spin_lock(&xen_pt_lock);
+
+    while ( left )
+    {
+        unsigned int order, level;
+
+        level = pt_mapping_level(vfn, mfn, left, flags);
+        order = XEN_PT_LEVEL_ORDER(level);
+
+        ASSERT(left >= BIT(order, UL));
+
+        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level,
+                                    flags);
+        if ( rc )
+            break;
+
+        vfn += 1U << order;
+        if ( !mfn_eq(mfn, INVALID_MFN) )
+            mfn = mfn_add(mfn, 1U << order);
+
+        left -= (1U << order);
+
+        if ( rc )
+            break;
+    }
+
+
+    /* ensure that PTEs are all updated before flushing */
+    RISCV_FENCE(rw, rw);
+
+    /*
+     * always flush TLB at the end of the function as non-present entries
+     * can be put in the TLB
+     */
+    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
+
+    spin_unlock(&xen_pt_lock);
+
+    return rc;
+}
+
+int map_pages_to_xen(unsigned long virt,
+                     mfn_t mfn,
+                     unsigned long nr_mfns,
+                     unsigned int flags)
+{
+    /*
+     * Ensure that we have a valid MFN before proceeding.
+     *
+     * If the MFN is invalid, pt_update() might misinterpret the operation,
+     * treating it as either a population, a mapping destruction,
+     * or a mapping modification.
+     */
+    ASSERT(!mfn_eq(mfn, INVALID_MFN));
+
+    return pt_update(virt, mfn, nr_mfns, flags);
+}
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 7/7] xen/riscv: introduce early_fdt_map()
  2024-08-09 16:19 [PATCH v4 0/7] RISCV device tree mapping Oleksii Kurochko
                   ` (5 preceding siblings ...)
  2024-08-09 16:19 ` [PATCH v4 6/7] xen/riscv: page table handling Oleksii Kurochko
@ 2024-08-09 16:19 ` Oleksii Kurochko
  6 siblings, 0 replies; 39+ messages in thread
From: Oleksii Kurochko @ 2024-08-09 16:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Jan Beulich, Julien Grall, Stefano Stabellini

Introduce function which allows to map FDT to Xen.

Also, initialization of device_tree_flattened happens using
early_fdt_map().

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V4:
 - s/_PAGE_BLOCK/PTE_BLOCK
 - Add Acked-by: Jan Beulich <jbeulich@suse.com>
 - unwarap two lines in panic() in case when device_tree_flattened is NULL
   so  grep-ing for any part of the message line will always produce a hit.
 - slightly update the commit message.
---
Changes in V3:
 - Code style fixes
 - s/SZ_2M/MB(2)
 - fix condition to check if early_fdt_map() in setup.c return NULL or not.
---
Changes in V2:
 - rework early_fdt_map to use map_pages_to_xen()
 - move call early_fdt_map() to C code after MMU is enabled.
---
 xen/arch/riscv/include/asm/mm.h |  2 ++
 xen/arch/riscv/mm.c             | 55 +++++++++++++++++++++++++++++++++
 xen/arch/riscv/setup.c          |  7 +++++
 3 files changed, 64 insertions(+)

diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index ce1557bb27..4b7b00b850 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -259,4 +259,6 @@ static inline unsigned int arch_get_dma_bitsize(void)
 
 void setup_fixmap_mappings(void);
 
+void *early_fdt_map(paddr_t fdt_paddr);
+
 #endif /* _ASM_RISCV_MM_H */
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index e8430def14..df4d22c112 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -1,13 +1,16 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 
+#include <xen/bootfdt.h>
 #include <xen/bug.h>
 #include <xen/compiler.h>
 #include <xen/init.h>
 #include <xen/kernel.h>
+#include <xen/libfdt/libfdt.h>
 #include <xen/macros.h>
 #include <xen/mm.h>
 #include <xen/pfn.h>
 #include <xen/sections.h>
+#include <xen/sizes.h>
 
 #include <asm/early_printk.h>
 #include <asm/csr.h>
@@ -369,3 +372,55 @@ int destroy_xen_mappings(unsigned long s, unsigned long e)
     BUG_ON("unimplemented");
     return -1;
 }
+
+void * __init early_fdt_map(paddr_t fdt_paddr)
+{
+    /* We are using 2MB superpage for mapping the FDT */
+    paddr_t base_paddr = fdt_paddr & XEN_PT_LEVEL_MAP_MASK(1);
+    paddr_t offset;
+    void *fdt_virt;
+    uint32_t size;
+    int rc;
+
+    /*
+     * Check whether the physical FDT address is set and meets the minimum
+     * alignment requirement. Since we are relying on MIN_FDT_ALIGN to be at
+     * least 8 bytes so that we always access the magic and size fields
+     * of the FDT header after mapping the first chunk, double check if
+     * that is indeed the case.
+     */
+    BUILD_BUG_ON(MIN_FDT_ALIGN < 8);
+    if ( !fdt_paddr || fdt_paddr % MIN_FDT_ALIGN )
+        return NULL;
+
+    /* The FDT is mapped using 2MB superpage */
+    BUILD_BUG_ON(BOOT_FDT_VIRT_START % MB(2));
+
+    rc = map_pages_to_xen(BOOT_FDT_VIRT_START, maddr_to_mfn(base_paddr),
+                          MB(2) >> PAGE_SHIFT,
+                          PAGE_HYPERVISOR_RO | PTE_BLOCK);
+    if ( rc )
+        panic("Unable to map the device-tree.\n");
+
+    offset = fdt_paddr % XEN_PT_LEVEL_SIZE(1);
+    fdt_virt = (void *)BOOT_FDT_VIRT_START + offset;
+
+    if ( fdt_magic(fdt_virt) != FDT_MAGIC )
+        return NULL;
+
+    size = fdt_totalsize(fdt_virt);
+    if ( size > BOOT_FDT_VIRT_SIZE )
+        return NULL;
+
+    if ( (offset + size) > MB(2) )
+    {
+        rc = map_pages_to_xen(BOOT_FDT_VIRT_START + MB(2),
+                              maddr_to_mfn(base_paddr + MB(2)),
+                              MB(2) >> PAGE_SHIFT,
+                              PAGE_HYPERVISOR_RO | PTE_BLOCK);
+        if ( rc )
+            panic("Unable to map the device-tree\n");
+    }
+
+    return fdt_virt;
+}
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index a49a8eee90..f7221a80d5 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -2,6 +2,7 @@
 
 #include <xen/bug.h>
 #include <xen/compile.h>
+#include <xen/device_tree.h>
 #include <xen/init.h>
 #include <xen/mm.h>
 
@@ -65,6 +66,12 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
 
     setup_fixmap_mappings();
 
+    device_tree_flattened = early_fdt_map(dtb_addr);
+    if ( !device_tree_flattened )
+        panic("Invalid device tree blob at physical address %#lx. The DTB must be 8-byte aligned and must not exceed %lld bytes in size.\n\n"
+              "Please check your bootloader.\n",
+              dtb_addr, BOOT_FDT_VIRT_SIZE);
+
     printk("All set up\n");
 
     for ( ;; )
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] xen/riscv: set up fixmap mappings
  2024-08-09 16:19 ` [PATCH v4 2/7] xen/riscv: set up fixmap mappings Oleksii Kurochko
@ 2024-08-13  8:22   ` Jan Beulich
  2024-08-14 14:21     ` oleksii.kurochko
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2024-08-13  8:22 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 09.08.2024 18:19, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/config.h
> +++ b/xen/arch/riscv/include/asm/config.h
> @@ -74,6 +74,14 @@
>  #error "unsupported RV_STAGE1_MODE"
>  #endif
>  
> +#define XEN_VIRT_SIZE           MB(2)
> +
> +#define BOOT_FDT_VIRT_START     (XEN_VIRT_START + XEN_VIRT_SIZE)
> +#define BOOT_FDT_VIRT_SIZE      MB(4)
> +
> +#define FIXMAP_BASE             (BOOT_FDT_VIRT_START + BOOT_FDT_VIRT_SIZE)
> +#define FIXMAP_ADDR(n)          (FIXMAP_BASE + (n) * PAGE_SIZE)

Just to confirm: Did you consider leaving gaps between the regions, but
then discarded that idea for whatever reason? It's not like you're tight
on address space.

As to FIXMAP_ADDR() - wouldn't that be a better fit in fixmap.h?

> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/fixmap.h
> @@ -0,0 +1,44 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * fixmap.h: compile-time virtual memory allocation
> + */
> +#ifndef ASM_FIXMAP_H
> +#define ASM_FIXMAP_H
> +
> +#include <xen/bug.h>
> +#include <xen/page-size.h>
> +#include <xen/pmap.h>
> +
> +#include <asm/page.h>
> +
> +/* Fixmap slots */
> +#define FIX_PMAP_BEGIN (0) /* Start of PMAP */
> +#define FIX_PMAP_END (FIX_PMAP_BEGIN + NUM_FIX_PMAP - 1) /* End of PMAP */
> +#define FIX_MISC (FIX_PMAP_END + 1)  /* Ephemeral mappings of hardware */
> +
> +#define FIX_LAST (FIX_MISC + 1) /* +1 means a guard slot */

As per my comment on the earlier version: No, I don't think this arranges
for a guard slot. It merely arranges for FIX_MISC to not trigger the
BUG_ON() in virt_to_fix().

> --- a/xen/arch/riscv/include/asm/page.h
> +++ b/xen/arch/riscv/include/asm/page.h
> @@ -81,6 +81,12 @@ static inline void flush_page_to_ram(unsigned long mfn, bool sync_icache)
>      BUG_ON("unimplemented");
>  }
>  
> +/* Write a pagetable entry. */
> +static inline void write_pte(pte_t *p, pte_t pte)
> +{
> +    *p = pte;
> +}

No use of write_atomic() here? And no read_pte() counterpart right away (then
also properly using read_atomic())?

> @@ -191,6 +195,45 @@ static bool __init check_pgtbl_mode_support(struct mmu_desc *mmu_desc,
>      return is_mode_supported;
>  }
>  
> +void __init setup_fixmap_mappings(void)
> +{
> +    pte_t *pte, tmp;
> +    unsigned int i;
> +
> +    BUILD_BUG_ON(FIX_LAST >= PAGETABLE_ENTRIES);
> +
> +    pte = &stage1_pgtbl_root[pt_index(HYP_PT_ROOT_LEVEL, FIXMAP_ADDR(0))];
> +
> +    /*
> +     * In RISC-V page table levels are numbered from Lx to L0 where
> +     * x is the highest page table level for currect  MMU mode ( for example,
> +     * for Sv39 has 3 page tables so the x = 2 (L2 -> L1 -> L0) ).
> +     *
> +     * In this cycle we want to find L1 page table because as L0 page table
> +     * xen_fixmap[] will be used.
> +     */
> +    for ( i = HYP_PT_ROOT_LEVEL; i-- > 1; )
> +    {
> +        BUG_ON(!pte_is_valid(*pte));
> +
> +        pte = (pte_t *)LOAD_TO_LINK(pte_to_paddr(*pte));
> +        pte = &pte[pt_index(i, FIXMAP_ADDR(0))];
> +    }
> +
> +    BUG_ON(pte_is_valid(*pte));
> +
> +    tmp = paddr_to_pte(LINK_TO_LOAD((unsigned long)&xen_fixmap), PTE_TABLE);
> +    write_pte(pte, tmp);
> +
> +    RISCV_FENCE(rw, rw);

In the changes section you say "r,r", and I was wondering about that. I
take it that "rw,rw" is really what's needed here.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/7] xen/riscv: introduce asm/pmap.h header
  2024-08-09 16:19 ` [PATCH v4 3/7] xen/riscv: introduce asm/pmap.h header Oleksii Kurochko
@ 2024-08-13  8:41   ` Jan Beulich
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2024-08-13  8:41 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 09.08.2024 18:19, Oleksii Kurochko wrote:
> Introduce arch_pmap_{un}map functions and select HAS_PMAP for CONFIG_RISCV.
> 
> Add pte_from_mfn() for use in arch_pmap_map().
> 
> Introduce flush_xen_tlb_one_local() and use it in arch_pmap_{un}map().
> 
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> ---
> Changes in V4:
>  - mark arch_pmap_{un}map() as __init: documentation purpose and
>    a necessary (but not sufficient) condition here, to validly
>    use local TLB flushes only.
>  - add flush_xen_tlb_one_local() to arch_pmap_map() as absense of
>    "negative" TLB entrues will be guaranted only in the case
>    when Svvptc extension is present.
>  - s/mfn_from_pte/pte_from_mfn
>  - drop flush_xen_tlb_range_va_local() as it isn't used in this patch

Just as a remark: This and ...

>  - drop mfn_to_xen_entry() as pte_from_mfn() does the same thing
>  - add flags argument to pte_from_mfn().
>  - update the commit message.
>  - s/flush_xen_tlb_range_va_local/flush_tlb_range_va_local

... this don't go together very well.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-09 16:19 ` [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info Oleksii Kurochko
@ 2024-08-13  8:54   ` Jan Beulich
  2024-08-14 14:45     ` oleksii.kurochko
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2024-08-13  8:54 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 09.08.2024 18:19, Oleksii Kurochko wrote:
> Introduce struct pcpu_info to store pCPU-related information.
> Initially, it includes only processor_id, but it will be extended
> to include guest CPU information and temporary variables for
> saving/restoring vCPU registers.
> 
> Add set_processor_id() and get_processor_id() functions to set
> and retrieve the processor_id stored in pcpu_info.
> 
> Introduce cpuid_to_hartid_map[NR_CPUS] to map Xen logical CPUs to
> hart IDs (physical CPU IDs). Add auxiliary macros cpuid_to_hartid()
> for convenient access to this mapping.
> 
> Define smp_processor_id() to provide accurate information,
> replacing the previous "dummy" value of 0.
> 
> Initialize tp registers to point to pcpu_info[0].
> Set processor_id to 0 for logical CPU 0 and store the physical CPU ID
> for the current logical CPU in cpuid_to_hartid_map[].
> 
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> ---
> Changes in V4:
>  - wrap id with () inside set_processor_id().
>  - code style fixes
>  - update BUG_ON(id > NR_CPUS) in smp_processor_id() and drop the comment
>    above BUG_ON().
>  - s/__cpuid_to_hartid_map/cpuid_to_hartid_map
>  - s/cpuid_to_hartid_map/cpuid_to_harti ( here cpuid_to_hartid_map is the name
>    of the macros ).
>  - update the commit message above the code of TP register initialization in
>    start_xen().

I guess that's once again "comment", not "commit message"?

> --- a/xen/arch/riscv/include/asm/processor.h
> +++ b/xen/arch/riscv/include/asm/processor.h
> @@ -12,8 +12,32 @@
>  
>  #ifndef __ASSEMBLY__
>  
> -/* TODO: need to be implemeted */
> -#define smp_processor_id() 0
> +#include <xen/bug.h>
> +
> +register struct pcpu_info *tp asm ("tp");
> +
> +struct pcpu_info {
> +    unsigned int processor_id;
> +};
> +
> +/* tp points to one of these */
> +extern struct pcpu_info pcpu_info[NR_CPUS];
> +
> +#define get_processor_id()      (tp->processor_id)
> +#define set_processor_id(id)    do { \
> +    tp->processor_id = (id);         \
> +} while (0)
> +
> +static inline unsigned int smp_processor_id(void)
> +{
> +    unsigned int id;
> +
> +    id = get_processor_id();
> +
> +    BUG_ON(id > NR_CPUS);

>= ?

> --- a/xen/arch/riscv/include/asm/smp.h
> +++ b/xen/arch/riscv/include/asm/smp.h
> @@ -5,6 +5,8 @@
>  #include <xen/cpumask.h>
>  #include <xen/percpu.h>
>  
> +#define INVALID_HARTID UINT_MAX
> +
>  DECLARE_PER_CPU(cpumask_var_t, cpu_sibling_mask);
>  DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
>  
> @@ -14,6 +16,14 @@ DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
>   */
>  #define park_offline_cpus false
>  
> +void smp_set_bootcpu_id(unsigned long boot_cpu_hartid);
> +
> +/*
> + * Mapping between linux logical cpu index and hartid.
> + */
> +extern unsigned long cpuid_to_hartid_map[NR_CPUS];
> +#define cpuid_to_hartid(cpu) cpuid_to_hartid_map[cpu]

How wide can hart IDs be? Wider than 32 bits? If not, why unsigned long?
If so, what about RV32 (which generally you at least try cover where
easily possible)?

Is there a reason this needs to be a separate array, rather than being
part of struct pcpu_info?

> @@ -40,6 +41,19 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
>  {
>      remove_identity_mapping();
>  
> +    /*
> +     * tp register contains an address of physical cpu information.
> +     * So write physical CPU info of boot cpu to tp register
> +     * It will be used later by get_processor_id() ( look at
> +     * <asm/processor.h> ):
> +     *   #define get_processor_id()    (tp->processor_id)
> +     */
> +    asm volatile ( "mv tp, %0" : : "r"((unsigned long)&pcpu_info[0]) );

Perhaps be on the safe side and also add a memory barrier?

Perhaps be on the safe side and do this absolutely first in the function,
or even in assembly (such that initializers of future variables declared
at the top of the function end up safe, too)?

Also nit: Blank please between closing quote and opening parenthesis.
(Otoh you could omit the blank between the two colons.)

> --- /dev/null
> +++ b/xen/arch/riscv/smp.c
> @@ -0,0 +1,4 @@
> +#include <xen/smp.h>
> +
> +/* tp points to one of these per cpu */
> +struct pcpu_info pcpu_info[NR_CPUS];

And they all need setting up statically? Is there a plan to make this
dynamic (which could be recorded in a "fixme" in the comment)?

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-09 16:19 ` [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension Oleksii Kurochko
@ 2024-08-13  9:34   ` Jan Beulich
  2024-08-14 15:41     ` oleksii.kurochko
  2024-08-16 12:06     ` oleksii.kurochko
  0 siblings, 2 replies; 39+ messages in thread
From: Jan Beulich @ 2024-08-13  9:34 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 09.08.2024 18:19, Oleksii Kurochko wrote:
> @@ -31,4 +65,34 @@ struct sbiret sbi_ecall(unsigned long ext, unsigned long fid,
>   */
>  void sbi_console_putchar(int ch);
>  
> +/*
> + * Check underlying SBI implementation has RFENCE
> + *
> + * @return 1 for supported AND 0 for not-supported
> + */
> +bool sbi_has_rfence(void);

Nit: With bool return type, true and false in the comment?

> +/*
> + * Instructs the remote harts to execute one or more SFENCE.VMA
> + * instructions, covering the range of virtual addresses between
> + * start_addr and start_addr + size.

It's relatively clear what is meant here, but maybe still better to
make things explicit by using a mathematical range:
[start_addr, start_addr + size).

> + * Returns 0 if IPI was sent to all the targeted harts successfully
> + * or negative value if start_addr or size is not valid.
> + *
> + * @hart_mask a cpu mask containing all the target harts.
> + * @param start virtual address start
> + * @param size virtual address range size
> + */
> +int sbi_remote_sfence_vma(cpumask_t *cpu_mask,

pointer-to-const?

> --- a/xen/arch/riscv/sbi.c
> +++ b/xen/arch/riscv/sbi.c
> @@ -7,11 +7,23 @@
>   * Modified by Bobby Eshleman (bobby.eshleman@gmail.com).
>   *
>   * Copyright (c) 2019 Western Digital Corporation or its affiliates.
> - * Copyright (c) 2021-2023 Vates SAS.
> + * Copyright (c) 2021-2024 Vates SAS.
>   */
>  
> +#include <xen/compiler.h>
> +#include <xen/const.h>
> +#include <xen/cpumask.h>
> +#include <xen/errno.h>
> +#include <xen/init.h>
> +#include <xen/lib.h>
> +#include <xen/smp.h>
> +
> +#include <asm/processor.h>
>  #include <asm/sbi.h>
>  
> +static unsigned long sbi_spec_version = SBI_SPEC_VERSION_DEFAULT;
> +static unsigned long sbi_fw_id, sbi_fw_version;

__ro_after_init for perhaps all three of them?

Considering SBI_SPEC_VERSION_{MAJOR,MINOR}_MASK, at least the first
of them also doesn't need to be unsigned long, but could be unsigned
int?

> @@ -38,7 +50,245 @@ struct sbiret sbi_ecall(unsigned long ext, unsigned long fid,
>      return ret;
>  }
>  
> +static int sbi_err_map_xen_errno(int err)
> +{
> +    switch ( err )
> +    {
> +    case SBI_SUCCESS:
> +        return 0;
> +    case SBI_ERR_DENIED:
> +        return -EACCES;
> +    case SBI_ERR_INVALID_PARAM:
> +        return -EINVAL;
> +    case SBI_ERR_INVALID_ADDRESS:
> +        return -EFAULT;
> +    case SBI_ERR_NOT_SUPPORTED:
> +        return -EOPNOTSUPP;
> +    case SBI_ERR_FAILURE:
> +        fallthrough;
> +    default:
> +        return -ENOSYS;
> +    };
> +}
> +
>  void sbi_console_putchar(int ch)
>  {
>      sbi_ecall(SBI_EXT_0_1_CONSOLE_PUTCHAR, 0, ch, 0, 0, 0, 0, 0);
>  }
> +
> +static unsigned long sbi_major_version(void)
> +{
> +    return (sbi_spec_version >> SBI_SPEC_VERSION_MAJOR_SHIFT) &
> +        SBI_SPEC_VERSION_MAJOR_MASK;
> +}

Can you use MASK_EXTR() here, allowing to even drop the separate
SBI_SPEC_VERSION_MAJOR_SHIFT?

> +static unsigned long sbi_minor_version(void)
> +{
> +    return sbi_spec_version & SBI_SPEC_VERSION_MINOR_MASK;
> +}

For consistency here then, too.

> +static long sbi_ext_base_func(long fid)
> +{
> +    struct sbiret ret;
> +
> +    ret = sbi_ecall(SBI_EXT_BASE, fid, 0, 0, 0, 0, 0, 0);
> +    if ( !ret.error )
> +        return ret.value;
> +    else
> +        return ret.error;

With the folding of two distinct values, how is the caller going to
tell failure from success?

> +}
> +
> +static int __sbi_rfence_v02_real(unsigned long fid,

Are the double leading underscores really necessary here? (Same again
further down.)

> +                                 unsigned long hmask, unsigned long hbase,
> +                                 unsigned long start, unsigned long size,
> +                                 unsigned long arg4)
> +{
> +    struct sbiret ret = {0};
> +    int result = 0;
> +
> +    switch ( fid )
> +    {
> +    case SBI_EXT_RFENCE_REMOTE_FENCE_I:
> +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> +                0, 0, 0, 0);
> +        break;
> +    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA:
> +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> +                start, size, 0, 0);
> +        break;
> +    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID:
> +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> +                start, size, arg4, 0);
> +        break;
> +    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA:
> +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> +                start, size, 0, 0);
> +        break;

How is e.g. this different from SBI_EXT_RFENCE_REMOTE_SFENCE_VMA
handling? To ease recognition, I think you want to fold cases with
identical handling.

> +    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID:
> +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> +                start, size, arg4, 0);
> +        break;
> +    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA:
> +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> +                start, size, 0, 0);
> +        break;
> +    case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA_ASID:
> +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> +                start, size, arg4, 0);
> +        break;
> +
> +    default:

Nit: Just like you have it here, blank lines please between any non-
fallthrough case blocks.

> +        printk("%s: unknown function ID [%lu]\n",
> +               __func__, fid);
> +        result = -EINVAL;
> +        break;
> +    };
> +
> +    if ( ret.error )
> +    {
> +        result = sbi_err_map_xen_errno(ret.error);
> +        printk("%s: hbase=%lu hmask=%#lx failed (error %d)\n",
> +               __func__, hbase, hmask, result);
> +    }
> +
> +    return result;
> +}
> +
> +static int __sbi_rfence_v02(unsigned long fid,

This has its address taken for storing in a function pointer. I'd like
to suggest that from the very beginning you mark such functions cf_check.

> +                            const cpumask_t *cpu_mask,
> +                            unsigned long start, unsigned long size,
> +                            unsigned long arg4, unsigned long arg5)
> +{
> +    unsigned long hartid, cpuid, hmask = 0, hbase = 0, htop = 0;
> +    int result;
> +
> +    /*
> +     * hart_mask_base can be set to -1 to indicate that hart_mask can be
> +     * ignored and all available harts must be considered.
> +     */
> +    if ( !cpu_mask )
> +        return __sbi_rfence_v02_real(fid, 0UL, -1UL, start, size, arg4);
> +
> +    for_each_cpu( cpuid, cpu_mask )

Nit: Either

    for_each_cpu ( cpuid, cpu_mask )

(if you deem for_each_cpu a pseudo-keyword) or

    for_each_cpu(cpuid, cpu_mask)

(if you don't).

> +    {
> +        hartid = cpuid_to_hartid(cpuid);
> +        if ( hmask )
> +        {
> +            if ( hartid + BITS_PER_LONG <= htop ||
> +                 hbase + BITS_PER_LONG <= hartid )
> +            {
> +                result = __sbi_rfence_v02_real(fid, hmask, hbase,
> +                                               start, size, arg4);
> +                if ( result )
> +                    return result;
> +                hmask = 0;
> +            }
> +            else if ( hartid < hbase )
> +            {
> +                /* shift the mask to fit lower hartid */
> +                hmask <<= hbase - hartid;
> +                hbase = hartid;
> +            }
> +        }
> +
> +        if ( !hmask )
> +        {
> +            hbase = hartid;
> +            htop = hartid;
> +        } else if ( hartid > htop )

Nit: Closing brace on its own line please.

> +            htop = hartid;
> +
> +        hmask |= BIT(hartid - hbase, UL);
> +    }

I can see how you will successfully batch for certain configurations
this way. When hart ID mapping is something like (0,64,1,65,2,66,...)
you won't batch at all though. Which may be fine at least for now, but
then I think a comment wants to state what is or is not intended to be
covered. It is only this way that people will know that certain
shortcomings aren't bugs.

> +    if ( hmask )
> +    {
> +        result = __sbi_rfence_v02_real(fid, hmask, hbase,
> +                                       start, size, arg4);
> +        if ( result )
> +            return result;
> +    }
> +
> +    return 0;
> +}
> +
> +static int (*__sbi_rfence)(unsigned long fid,
> +                           const cpumask_t *cpu_mask,
> +                           unsigned long start, unsigned long size,
> +                           unsigned long arg4, unsigned long arg5) = NULL;

__ro_after_init and no need for an initializer.

> +int sbi_remote_sfence_vma(cpumask_t *cpu_mask,
> +                          unsigned long start_addr,
> +                          unsigned long size)
> +{
> +    return __sbi_rfence(SBI_EXT_RFENCE_REMOTE_SFENCE_VMA,
> +                        cpu_mask, start_addr, size, 0, 0);

No check (not even an assertion) here for __sbi_rfence still being NULL?

> +int __init sbi_init(void)
> +{
> +    int ret;
> +
> +    ret = sbi_get_spec_version();
> +    if ( ret > 0 )
> +        sbi_spec_version = ret;

Truncation from long to int is not an issue here?

> +    printk("SBI specification v%lu.%lu detected\n",
> +            sbi_major_version(), sbi_minor_version());
> +
> +    if ( !sbi_spec_is_0_1() )
> +    {
> +        sbi_fw_id = sbi_get_firmware_id();
> +        sbi_fw_version = sbi_get_firmware_version();

These cannot return errors?

> +        printk("SBI implementation ID=%#lx Version=%#lx\n",
> +            sbi_fw_id, sbi_fw_version);
> +
> +        if ( sbi_probe_extension(SBI_EXT_RFENCE) > 0 )
> +        {
> +            __sbi_rfence = __sbi_rfence_v02;
> +            printk("SBI v0.2 RFENCE extension detected\n");
> +        }
> +    } else {

Nit: Style (and I think I said so before).

> +        BUG_ON("Ooops. SBI spec veriosn 0.1 detected. Need to add support");

Besides the typo (version) here and ...

> +    }
> +
> +    if ( !sbi_has_rfence() )
> +    {
> +        BUG_ON("At the moment flush_xen_tlb_range_va() uses SBI rfence to "
> +               "flush TLB for all CPUS!");

... here better panic()? A call trace doesn't really add any value for these.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-09 16:19 ` [PATCH v4 6/7] xen/riscv: page table handling Oleksii Kurochko
@ 2024-08-13 10:31   ` Jan Beulich
  2024-08-14 16:50     ` oleksii.kurochko
  2024-08-20 13:18     ` oleksii.kurochko
  0 siblings, 2 replies; 39+ messages in thread
From: Jan Beulich @ 2024-08-13 10:31 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 09.08.2024 18:19, Oleksii Kurochko wrote:
> Implement map_pages_to_xen() which requires several
> functions to manage page tables and entries:
> - pt_update()
> - pt_mapping_level()
> - pt_update_entry()
> - pt_next_level()
> - pt_check_entry()
> 
> To support these operations, add functions for creating,
> mapping, and unmapping Xen tables:
> - create_xen_table()
> - xen_map_table()
> - xen_unmap_table()

I think I commented on this before: Everything is "Xen" in hypervisor
code. What I think you mean is to map/unmap Xen's own page tables.
Naming-wise that would be {,un}map_xen_table(), though. Since they
are static, just {,un}map_table() ought to be unambiguous, too.

> Introduce internal macros starting with PTE_* for convenience.
> These macros closely resemble PTE bits, with the exception of
> PTE_BLOCK, which indicates that a page larger than 4KB is
> needed.

I did comment on this too, iirc: Is there going to be any case where
large pages are going to be "needed", i.e. not just preferred? If not,
giving the caller control over things the other way around (requesting
4k mappings are needed, as we have it in x86) may be preferable.

Hmm, but then ...

> RISC-V detects superpages using `pte.x` and `pte.r`, as there
> is no specific bit in the PTE for this purpose. From the RISC-V spec:
> ```
>   ...
>   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 5.
>      Otherwise, this PTE is a pointer to the next level of the page table.
>      ... .
>   5. A leaf PTE has been found.
>      ...
>   ...
> ```
> 
> The code doesn’t support super page shattering so 4KB pages are used as
> default.

... you have this. Yet still callers expecting re-mapping in the (large)
range they map can request small-page mappings right away.

> --- a/xen/arch/riscv/include/asm/flushtlb.h
> +++ b/xen/arch/riscv/include/asm/flushtlb.h
> @@ -5,12 +5,25 @@
>  #include <xen/bug.h>
>  #include <xen/cpumask.h>
>  
> +#include <asm/sbi.h>
> +
>  /* Flush TLB of local processor for address va. */
>  static inline void flush_tlb_one_local(vaddr_t va)
>  {
>      asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
>  }
>  
> +/*
> + * Flush a range of VA's hypervisor mappings from the TLB of all
> + * processors in the inner-shareable domain.
> + */
> +static inline void flush_tlb_range_va(vaddr_t va,
> +                                      size_t size)

No need for line wrapping here?

> @@ -33,15 +38,72 @@
>  #define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
>  #define PTE_TABLE                   (PTE_VALID)
>  
> +#define PAGE_HYPERVISOR_RO          (PTE_VALID | PTE_READABLE)
>  #define PAGE_HYPERVISOR_RW          (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
> +#define PAGE_HYPERVISOR_RX          (PTE_VALID | PTE_READABLE | PTE_EXECUTABLE)
>  
>  #define PAGE_HYPERVISOR             PAGE_HYPERVISOR_RW
>  
> +
> +/*

Nit: As before, no double blank lines please.

> + * There are no such bits in PTE format for RISC-V.

This is an odd way to start a comment: There's nothing for "such" to refer
to.

> + * The code doesn’t support super page shattering so at the moment superpages
> + * can't be used as a default so PTE_BLOCK is introduced to have ability to
> + * tell that superpage should be allocated.
> + * Additionaly as mentioed in RISC-V priviliged spec:
> + * ```
> + *  After much deliberation, we have settled on a conventional page size of
> + *  4 KiB for both RV32 and RV64. We expect this decision to ease the porting
> + *  of low-level runtime software and device drivers.
> + *
> + *  The TLB reach problem is ameliorated by transparent superpage support in
> + *  modern operating systems [2]. Additionally, multi-level TLB hierarchies
> + *  are quite inexpensive relative to the multi-level cache hierarchies whose
> + *  address space they map.
> + *
> + *  [2] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox. Practical,
> + *      transparent operating system support for superpages.
> + *      SIGOPS Oper. Syst. Rev., 36(SI):89–104, December 2002.
> + * ```
> + *
> + * PTE_POPULATE is introduced to have ability to tell that page tables
> + * shoud be populated.
> + */
> +#define PTE_BLOCK       BIT(10, UL)
> +#define PTE_POPULATE    BIT(11, UL)
> +
> +#define PTE_R_MASK(x)   ((x) & PTE_READABLE)
> +#define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
> +#define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
> +
> +#define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE))
> +
>  /* Calculate the offsets into the pagetables for a given VA */
>  #define pt_linear_offset(lvl, va)   ((va) >> XEN_PT_LEVEL_SHIFT(lvl))
>  
>  #define pt_index(lvl, va) (pt_linear_offset((lvl), (va)) & VPN_MASK)
>  
> +#define TABLE_OFFSET(offs) (_AT(unsigned int, offs) & ((_AC(1, U) << PAGETABLE_ORDER) - 1))

Not: Too long line.

> +#if RV_STAGE1_MODE > SATP_MODE_SV48

SV48? Isn't ...

> +#error "need to to update DECLARE_OFFSETS macros"
> +#else
> +
> +#define l0_table_offset(va) TABLE_OFFSET(pt_linear_offset(0, va))
> +#define l1_table_offset(va) TABLE_OFFSET(pt_linear_offset(1, va))
> +#define l2_table_offset(va) TABLE_OFFSET(pt_linear_offset(2, va))
> +
> +/* Generate an array @var containing the offset for each level from @addr */
> +#define DECLARE_OFFSETS(var, addr)          \
> +    const unsigned int var[] = {            \
> +        l0_table_offset(addr),              \
> +        l1_table_offset(addr),              \
> +        l2_table_offset(addr),              \
> +    }

... this for SV39?

> @@ -67,6 +129,20 @@ static inline bool pte_is_valid(pte_t p)
>      return p.pte & PTE_VALID;
>  }
>  
> +inline bool pte_is_table(const pte_t p)
> +{
> +    return ((p.pte & (PTE_VALID |
> +                      PTE_READABLE |
> +                      PTE_WRITABLE |
> +                      PTE_EXECUTABLE)) == PTE_VALID);
> +}
> +
> +static inline bool pte_is_mapping(const pte_t p)
> +{
> +    return (p.pte & PTE_VALID) &&
> +           ((p.pte & PTE_WRITABLE) || (p.pte & PTE_EXECUTABLE));

Shorter as (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE)) ?

> --- /dev/null
> +++ b/xen/arch/riscv/pt.c
> @@ -0,0 +1,408 @@
> +#include <xen/bug.h>
> +#include <xen/domain_page.h>
> +#include <xen/errno.h>
> +#include <xen/mm.h>
> +#include <xen/mm-frame.h>
> +#include <xen/pmap.h>
> +#include <xen/spinlock.h>
> +
> +#include <asm/flushtlb.h>
> +#include <asm/page.h>
> +
> +static inline const mfn_t get_root_page(void)
> +{
> +    unsigned long root_maddr =

maddr_t or paddr_t?

> +        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
> +
> +    return maddr_to_mfn(root_maddr);
> +}
> +
> +/*
> + * Sanity check of the entry
> + * mfn is not valid and we are not populating page table. This means

How does this fit with ...

> + * we either modify entry or remove an entry.
> + */
> +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int flags)
> +{
> +    /* Sanity check when modifying an entry. */
> +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )

... the MFN check here? And why is (valid,INVALID_MFN) an indication
of a modification? But then ...

> +    {
> +        /* We don't allow modifying an invalid entry. */
> +        if ( !pte_is_valid(entry) )
> +        {
> +            printk("Modifying invalid entry is not allowed.\n");
> +            return false;
> +        }

... I also don't understand what this is about. IOW I'm afraid I'm
still confused about the purpose of this function as well as the
transitions you want to permit / reject. I wonder whether the
flags & PTE_VALID and pte_is_valid(entry) aren't in need of swapping.

> +/* Update an entry at the level @target. */
> +static int pt_update_entry(mfn_t root, unsigned long virt,
> +                           mfn_t mfn, unsigned int target,
> +                           unsigned int flags)
> +{
> +    int rc;
> +    unsigned int level = HYP_PT_ROOT_LEVEL;
> +    pte_t *table;
> +    /*
> +     * The intermediate page tables are read-only when the MFN is not valid
> +     * and we are not populating page table.

The way flags are handled in PTEs, how can page tables be read-only?

> +     * This means we either modify permissions or remove an entry.

From all I can determine we also get here when making brand new entries.
What am I overlooking?

> +     */
> +    bool alloc_only = mfn_eq(mfn, INVALID_MFN) && !(flags & PTE_POPULATE);
> +    pte_t pte, *entry;
> +
> +    /* convenience aliases */
> +    DECLARE_OFFSETS(offsets, virt);
> +
> +    table = xen_map_table(root);
> +    for ( ; level > target; level-- )
> +    {
> +        rc = pt_next_level(alloc_only, &table, offsets[level]);
> +        if ( rc == XEN_TABLE_MAP_FAILED )
> +        {
> +            rc = 0;
> +
> +            /*
> +             * We are here because pt_next_level has failed to map
> +             * the intermediate page table (e.g the table does not exist
> +             * and the pt is read-only). It is a valid case when
> +             * removing a mapping as it may not exist in the page table.
> +             * In this case, just ignore it.
> +             */
> +            if ( flags & PTE_VALID )
> +            {
> +                printk("%s: Unable to map level %u\n", __func__, level);
> +                rc = -ENOENT;
> +            }
> +
> +            goto out;
> +        }
> +        else if ( rc != XEN_TABLE_NORMAL )
> +            break;
> +    }
> +
> +    if ( level != target )
> +    {
> +        printk("%s: Shattering superpage is not supported\n", __func__);
> +        rc = -EOPNOTSUPP;
> +        goto out;
> +    }
> +
> +    entry = table + offsets[level];
> +
> +    rc = -EINVAL;
> +    if ( !pt_check_entry(*entry, mfn, flags) )
> +        goto out;
> +
> +    /* We are removing the page */
> +    if ( !(flags & PTE_VALID) )
> +        memset(&pte, 0x00, sizeof(pte));
> +    else
> +    {
> +        /* We are inserting a mapping => Create new pte. */
> +        if ( !mfn_eq(mfn, INVALID_MFN) )
> +            pte = pte_from_mfn(mfn, PTE_VALID);
> +        else /* We are updating the permission => Copy the current pte. */
> +            pte = *entry;
> +
> +        /* update permission according to the flags */
> +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
> +    }
> +
> +    write_pte(entry, pte);
> +
> +    rc = 0;
> +
> + out:
> +    xen_unmap_table(table);
> +
> +    return rc;
> +}
> +
> +static DEFINE_SPINLOCK(xen_pt_lock);

If you put this in the middle of the file (which is fine), I think it
wants putting immediately ahead of the (first) function using it, not
at some seemingly random place.

> +/*
> + * If `mfn` equals `INVALID_MFN`, it indicates that the following page table
> + * update operation might be related to either populating the table,
> + * destroying a mapping, or modifying an existing mapping.
> + */
> +static int pt_update(unsigned long virt,
> +                     mfn_t mfn,
> +                     unsigned long nr_mfns,
> +                     unsigned int flags)
> +{
> +    int rc = 0;
> +    unsigned long vfn = virt >> PAGE_SHIFT;
> +    unsigned long left = nr_mfns;
> +
> +    const mfn_t root = get_root_page();
> +
> +    /*
> +     * It is bad idea to have mapping both writeable and
> +     * executable.
> +     * When modifying/creating mapping (i.e PTE_VALID is set),
> +     * prevent any update if this happen.
> +     */
> +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) && PTE_X_MASK(flags) )
> +    {
> +        printk("Mappings should not be both Writeable and Executable.\n");
> +        return -EINVAL;
> +    }
> +
> +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
> +    {
> +        printk("The virtual address is not aligned to the page-size.\n");
> +        return -EINVAL;
> +    }
> +
> +    spin_lock(&xen_pt_lock);
> +
> +    while ( left )
> +    {
> +        unsigned int order, level;
> +
> +        level = pt_mapping_level(vfn, mfn, left, flags);
> +        order = XEN_PT_LEVEL_ORDER(level);
> +
> +        ASSERT(left >= BIT(order, UL));
> +
> +        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level,
> +                                    flags);

Indentation.

> +        if ( rc )
> +            break;
> +
> +        vfn += 1U << order;
> +        if ( !mfn_eq(mfn, INVALID_MFN) )
> +            mfn = mfn_add(mfn, 1U << order);
> +
> +        left -= (1U << order);

To be on thje safe side, 1UL everywhere?

> +        if ( rc )
> +            break;

There was such a check already a few lines up from here.

> +    }
> +
> +
> +    /* ensure that PTEs are all updated before flushing */

Again: No double blank lines please.

> +    RISCV_FENCE(rw, rw);
> +
> +    /*
> +     * always flush TLB at the end of the function as non-present entries
> +     * can be put in the TLB
> +     */
> +    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);

Coming back to "negative" TLB entries: Assuming RISC-V, just like other
architectures, also permits intermediate page table entries to be cached,
the affected VA range may be much larger than the original request, if
intermediate page tables needed creating.

> +    spin_unlock(&xen_pt_lock);

Does this really need to come after fence and flush?

> +    return rc;
> +}
> +
> +int map_pages_to_xen(unsigned long virt,
> +                     mfn_t mfn,
> +                     unsigned long nr_mfns,
> +                     unsigned int flags)
> +{
> +    /*
> +     * Ensure that we have a valid MFN before proceeding.
> +     *
> +     * If the MFN is invalid, pt_update() might misinterpret the operation,
> +     * treating it as either a population, a mapping destruction,
> +     * or a mapping modification.
> +     */
> +    ASSERT(!mfn_eq(mfn, INVALID_MFN));

But flags with PTE_VALID not set are fine to come into here?

> +    return pt_update(virt, mfn, nr_mfns, flags);
> +}

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] xen/riscv: set up fixmap mappings
  2024-08-13  8:22   ` Jan Beulich
@ 2024-08-14 14:21     ` oleksii.kurochko
  2024-08-14 15:08       ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-14 14:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-13 at 10:22 +0200, Jan Beulich wrote:
> On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > --- a/xen/arch/riscv/include/asm/config.h
> > +++ b/xen/arch/riscv/include/asm/config.h
> > @@ -74,6 +74,14 @@
> >  #error "unsupported RV_STAGE1_MODE"
> >  #endif
> >  
> > +#define XEN_VIRT_SIZE           MB(2)
> > +
> > +#define BOOT_FDT_VIRT_START     (XEN_VIRT_START + XEN_VIRT_SIZE)
> > +#define BOOT_FDT_VIRT_SIZE      MB(4)
> > +
> > +#define FIXMAP_BASE             (BOOT_FDT_VIRT_START +
> > BOOT_FDT_VIRT_SIZE)
> > +#define FIXMAP_ADDR(n)          (FIXMAP_BASE + (n) * PAGE_SIZE)
> 
> Just to confirm: Did you consider leaving gaps between the regions,
> but
> then discarded that idea for whatever reason? It's not like you're
> tight
> on address space.
No, I would like to leave gaps between the regions. I have a slot gap
where it is possible, inside a slot I decided just not to add this gap
without any specific reason TBH. I can add some gap ( for example,
0x100000 ) if it would be better and consistent.

> 
> As to FIXMAP_ADDR() - wouldn't that be a better fit in fixmap.h?
Sure, it would more correct place for this macros.

> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/include/asm/fixmap.h
> > @@ -0,0 +1,44 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * fixmap.h: compile-time virtual memory allocation
> > + */
> > +#ifndef ASM_FIXMAP_H
> > +#define ASM_FIXMAP_H
> > +
> > +#include <xen/bug.h>
> > +#include <xen/page-size.h>
> > +#include <xen/pmap.h>
> > +
> > +#include <asm/page.h>
> > +
> > +/* Fixmap slots */
> > +#define FIX_PMAP_BEGIN (0) /* Start of PMAP */
> > +#define FIX_PMAP_END (FIX_PMAP_BEGIN + NUM_FIX_PMAP - 1) /* End of
> > PMAP */
> > +#define FIX_MISC (FIX_PMAP_END + 1)  /* Ephemeral mappings of
> > hardware */
> > +
> > +#define FIX_LAST (FIX_MISC + 1) /* +1 means a guard slot */
> 
> As per my comment on the earlier version: No, I don't think this
> arranges
> for a guard slot. It merely arranges for FIX_MISC to not trigger the
> BUG_ON() in virt_to_fix().
> 
> > --- a/xen/arch/riscv/include/asm/page.h
> > +++ b/xen/arch/riscv/include/asm/page.h
> > @@ -81,6 +81,12 @@ static inline void flush_page_to_ram(unsigned
> > long mfn, bool sync_icache)
> >      BUG_ON("unimplemented");
> >  }
> >  
> > +/* Write a pagetable entry. */
> > +static inline void write_pte(pte_t *p, pte_t pte)
> > +{
> > +    *p = pte;
> > +}
> 
> No use of write_atomic() here? And no read_pte() counterpart right
> away (then
> also properly using read_atomic())?
I wanted to avoid the fence before "*p=pte" which in case of
write_atomic() will be present. Won't it be enough to use here
WRITE_ONCE()?


> 
> > @@ -191,6 +195,45 @@ static bool __init
> > check_pgtbl_mode_support(struct mmu_desc *mmu_desc,
> >      return is_mode_supported;
> >  }
> >  
> > +void __init setup_fixmap_mappings(void)
> > +{
> > +    pte_t *pte, tmp;
> > +    unsigned int i;
> > +
> > +    BUILD_BUG_ON(FIX_LAST >= PAGETABLE_ENTRIES);
> > +
> > +    pte = &stage1_pgtbl_root[pt_index(HYP_PT_ROOT_LEVEL,
> > FIXMAP_ADDR(0))];
> > +
> > +    /*
> > +     * In RISC-V page table levels are numbered from Lx to L0
> > where
> > +     * x is the highest page table level for currect  MMU mode (
> > for example,
> > +     * for Sv39 has 3 page tables so the x = 2 (L2 -> L1 -> L0) ).
> > +     *
> > +     * In this cycle we want to find L1 page table because as L0
> > page table
> > +     * xen_fixmap[] will be used.
> > +     */
> > +    for ( i = HYP_PT_ROOT_LEVEL; i-- > 1; )
> > +    {
> > +        BUG_ON(!pte_is_valid(*pte));
> > +
> > +        pte = (pte_t *)LOAD_TO_LINK(pte_to_paddr(*pte));
> > +        pte = &pte[pt_index(i, FIXMAP_ADDR(0))];
> > +    }
> > +
> > +    BUG_ON(pte_is_valid(*pte));
> > +
> > +    tmp = paddr_to_pte(LINK_TO_LOAD((unsigned long)&xen_fixmap),
> > PTE_TABLE);
> > +    write_pte(pte, tmp);
> > +
> > +    RISCV_FENCE(rw, rw);
> 
> In the changes section you say "r,r", and I was wondering about that.
> I
> take it that "rw,rw" is really what's needed here.
It was typo. I will update the change log.

Thanks.

~ Oleksii


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-13  8:54   ` Jan Beulich
@ 2024-08-14 14:45     ` oleksii.kurochko
  2024-08-14 15:22       ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-14 14:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-13 at 10:54 +0200, Jan Beulich wrote:
> On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > Introduce struct pcpu_info to store pCPU-related information.
> > Initially, it includes only processor_id, but it will be extended
> > to include guest CPU information and temporary variables for
> > saving/restoring vCPU registers.
> > 
> > Add set_processor_id() and get_processor_id() functions to set
> > and retrieve the processor_id stored in pcpu_info.
> > 
> > Introduce cpuid_to_hartid_map[NR_CPUS] to map Xen logical CPUs to
> > hart IDs (physical CPU IDs). Add auxiliary macros cpuid_to_hartid()
> > for convenient access to this mapping.
> > 
> > Define smp_processor_id() to provide accurate information,
> > replacing the previous "dummy" value of 0.
> > 
> > Initialize tp registers to point to pcpu_info[0].
> > Set processor_id to 0 for logical CPU 0 and store the physical CPU
> > ID
> > for the current logical CPU in cpuid_to_hartid_map[].
> > 
> > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> > ---
> > Changes in V4:
> >  - wrap id with () inside set_processor_id().
> >  - code style fixes
> >  - update BUG_ON(id > NR_CPUS) in smp_processor_id() and drop the
> > comment
> >    above BUG_ON().
> >  - s/__cpuid_to_hartid_map/cpuid_to_hartid_map
> >  - s/cpuid_to_hartid_map/cpuid_to_harti ( here cpuid_to_hartid_map
> > is the name
> >    of the macros ).
> >  - update the commit message above the code of TP register
> > initialization in
> >    start_xen().
> 
> I guess that's once again "comment", not "commit message"?
Yes, sorry for confusion. It should be comment.

> 
> > --- a/xen/arch/riscv/include/asm/processor.h
> > +++ b/xen/arch/riscv/include/asm/processor.h
> > @@ -12,8 +12,32 @@
> >  
> >  #ifndef __ASSEMBLY__
> >  
> > -/* TODO: need to be implemeted */
> > -#define smp_processor_id() 0
> > +#include <xen/bug.h>
> > +
> > +register struct pcpu_info *tp asm ("tp");
> > +
> > +struct pcpu_info {
> > +    unsigned int processor_id;
> > +};
> > +
> > +/* tp points to one of these */
> > +extern struct pcpu_info pcpu_info[NR_CPUS];
> > +
> > +#define get_processor_id()      (tp->processor_id)
> > +#define set_processor_id(id)    do { \
> > +    tp->processor_id = (id);         \
> > +} while (0)
> > +
> > +static inline unsigned int smp_processor_id(void)
> > +{
> > +    unsigned int id;
> > +
> > +    id = get_processor_id();
> > +
> > +    BUG_ON(id > NR_CPUS);
> 
> > = ?
> 
> > --- a/xen/arch/riscv/include/asm/smp.h
> > +++ b/xen/arch/riscv/include/asm/smp.h
> > @@ -5,6 +5,8 @@
> >  #include <xen/cpumask.h>
> >  #include <xen/percpu.h>
> >  
> > +#define INVALID_HARTID UINT_MAX
> > +
> >  DECLARE_PER_CPU(cpumask_var_t, cpu_sibling_mask);
> >  DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
> >  
> > @@ -14,6 +16,14 @@ DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
> >   */
> >  #define park_offline_cpus false
> >  
> > +void smp_set_bootcpu_id(unsigned long boot_cpu_hartid);
> > +
> > +/*
> > + * Mapping between linux logical cpu index and hartid.
> > + */
> > +extern unsigned long cpuid_to_hartid_map[NR_CPUS];
> > +#define cpuid_to_hartid(cpu) cpuid_to_hartid_map[cpu]
> 
> How wide can hart IDs be? Wider than 32 bits? If not, why unsigned
> long?
According to the spec:
```
The mhartid CSR is an MXLEN-bit read-only register containing the
integer ID of the hardware thread running the code
```
where MXLEN can bit 32 and 64.

> If so, what about RV32 (which generally you at least try cover where
> easily possible)?
On RV32 MXLEN will be 32 and unsigned long will be 32-bit too.

> 
> Is there a reason this needs to be a separate array, rather than
> being
> part of struct pcpu_info?
> 
> > @@ -40,6 +41,19 @@ void __init noreturn start_xen(unsigned long
> > bootcpu_id,
> >  {
> >      remove_identity_mapping();
> >  
> > +    /*
> > +     * tp register contains an address of physical cpu
> > information.
> > +     * So write physical CPU info of boot cpu to tp register
> > +     * It will be used later by get_processor_id() ( look at
> > +     * <asm/processor.h> ):
> > +     *   #define get_processor_id()    (tp->processor_id)
> > +     */
> > +    asm volatile ( "mv tp, %0" : : "r"((unsigned
> > long)&pcpu_info[0]) );
> 
> Perhaps be on the safe side and also add a memory barrier?
Do you mean compiler barrier? ( asm volatile ( "..." :: ... : "memory"
)? )

> 
> Perhaps be on the safe side and do this absolutely first in the
> function,
> or even in assembly (such that initializers of future variables
> declared
> at the top of the function end up safe, too)?
I am not sure that I am following your part related to put this code in
assembler ( instructions in assembly code still code be re-ordered what
can affect a time when tp will be set with correct value ) and what do
you mean by "initializers of future variables declared at the top of
the function end up safe"
> 
> Also nit: Blank please between closing quote and opening parenthesis.
> (Otoh you could omit the blank between the two colons.)
> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/smp.c
> > @@ -0,0 +1,4 @@
> > +#include <xen/smp.h>
> > +
> > +/* tp points to one of these per cpu */
> > +struct pcpu_info pcpu_info[NR_CPUS];
> 
> And they all need setting up statically? Is there a plan to make this
> dynamic (which could be recorded in a "fixme" in the comment)?
I didn't plan to make allocation of this array dynamic. I don't expect
that NR_CPUS will be big. I can add "fixme" but I am not really
understand what will be advantages if pcpu_info[] will be allocated
dynamically.

~ Oleksii


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] xen/riscv: set up fixmap mappings
  2024-08-14 14:21     ` oleksii.kurochko
@ 2024-08-14 15:08       ` Jan Beulich
  2024-08-15  8:16         ` oleksii.kurochko
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2024-08-14 15:08 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 14.08.2024 16:21, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-13 at 10:22 +0200, Jan Beulich wrote:
>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/include/asm/page.h
>>> +++ b/xen/arch/riscv/include/asm/page.h
>>> @@ -81,6 +81,12 @@ static inline void flush_page_to_ram(unsigned
>>> long mfn, bool sync_icache)
>>>      BUG_ON("unimplemented");
>>>  }
>>>  
>>> +/* Write a pagetable entry. */
>>> +static inline void write_pte(pte_t *p, pte_t pte)
>>> +{
>>> +    *p = pte;
>>> +}
>>
>> No use of write_atomic() here? And no read_pte() counterpart right
>> away (then
>> also properly using read_atomic())?
> I wanted to avoid the fence before "*p=pte" which in case of
> write_atomic() will be present.

Well, this goes back to write_atomic() resolving to write{b,w,l,q}() for
unclear reasons; even the comment in our atomic.h says "For some reason".
The fence there is of course getting in the way here. I keep forgetting
about that aspect, because neither x86 nor Arm has anything similar
afaics.

> Won't it be enough to use here WRITE_ONCE()?

Maybe. I'm not entirely sure.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-14 14:45     ` oleksii.kurochko
@ 2024-08-14 15:22       ` Jan Beulich
  2024-08-15  8:55         ` oleksii.kurochko
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2024-08-14 15:22 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 14.08.2024 16:45, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-13 at 10:54 +0200, Jan Beulich wrote:
>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/include/asm/smp.h
>>> +++ b/xen/arch/riscv/include/asm/smp.h
>>> @@ -5,6 +5,8 @@
>>>  #include <xen/cpumask.h>
>>>  #include <xen/percpu.h>
>>>  
>>> +#define INVALID_HARTID UINT_MAX
>>> +
>>>  DECLARE_PER_CPU(cpumask_var_t, cpu_sibling_mask);
>>>  DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
>>>  
>>> @@ -14,6 +16,14 @@ DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
>>>   */
>>>  #define park_offline_cpus false
>>>  
>>> +void smp_set_bootcpu_id(unsigned long boot_cpu_hartid);
>>> +
>>> +/*
>>> + * Mapping between linux logical cpu index and hartid.
>>> + */
>>> +extern unsigned long cpuid_to_hartid_map[NR_CPUS];
>>> +#define cpuid_to_hartid(cpu) cpuid_to_hartid_map[cpu]
>>
>> How wide can hart IDs be? Wider than 32 bits? If not, why unsigned
>> long?
> According to the spec:
> ```
> The mhartid CSR is an MXLEN-bit read-only register containing the
> integer ID of the hardware thread running the code
> ```
> where MXLEN can bit 32 and 64.

Hmm, okay. If the full MXLEN bits can be used, then using unsigned long
is indeed the right thing here.

>>> @@ -40,6 +41,19 @@ void __init noreturn start_xen(unsigned long
>>> bootcpu_id,
>>>  {
>>>      remove_identity_mapping();
>>>  
>>> +    /*
>>> +     * tp register contains an address of physical cpu
>>> information.
>>> +     * So write physical CPU info of boot cpu to tp register
>>> +     * It will be used later by get_processor_id() ( look at
>>> +     * <asm/processor.h> ):
>>> +     *   #define get_processor_id()    (tp->processor_id)
>>> +     */
>>> +    asm volatile ( "mv tp, %0" : : "r"((unsigned
>>> long)&pcpu_info[0]) );
>>
>> Perhaps be on the safe side and also add a memory barrier?
> Do you mean compiler barrier? ( asm volatile ( "..." :: ... : "memory"
> )? )

Yes.

>> Perhaps be on the safe side and do this absolutely first in the
>> function,
>> or even in assembly (such that initializers of future variables
>> declared
>> at the top of the function end up safe, too)?
> I am not sure that I am following your part related to put this code in
> assembler ( instructions in assembly code still code be re-ordered what
> can affect a time when tp will be set with correct value )

I'm afraid I don't understand this. Instructions can be re-ordered, sure,
but later instructions are guaranteed to observe the effects on register
state that earlier instructions caused.

> and what do
> you mean by "initializers of future variables declared at the top of
> the function end up safe"

With

void start_xen()
{
    int var = func();

    asm volatile ( "mv tp, %0" :: ...);
    ...

you end up with the requirement that func() must not use anything that
depends on tp being set. In this simple example it may be easy to say
"don't use an initializer and call the function afterwards". But that is
going to be a risky game to play. Look at x86'es __start_xen(). An
insertion into such a big set of declarations may not pay attention to
tp not being set yet, when _all_ other C code may reasonably assume tp
is set.

>>> --- /dev/null
>>> +++ b/xen/arch/riscv/smp.c
>>> @@ -0,0 +1,4 @@
>>> +#include <xen/smp.h>
>>> +
>>> +/* tp points to one of these per cpu */
>>> +struct pcpu_info pcpu_info[NR_CPUS];
>>
>> And they all need setting up statically? Is there a plan to make this
>> dynamic (which could be recorded in a "fixme" in the comment)?
> I didn't plan to make allocation of this array dynamic. I don't expect
> that NR_CPUS will be big.

What is this expectation of yours based on? Other architectures permit
systems with hundreds or even thousands of CPUs; why would RISC-V be
different there?

> I can add "fixme" but I am not really
> understand what will be advantages if pcpu_info[] will be allocated
> dynamically.

Where possible it's better to avoid static allocations, of which on
some systems only a very small part may be used. Even if you put yourself
on the position that many take - memory being cheap - you then still
waste cache and TLB bandwidth. Furthermore as long as struct pcpu_info
isn't big enough (and properly aligned) for two successive array entries
to not share cache lines, you may end up playing cacheline ping-pong
when a CPU writes to its own array slot.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-13  9:34   ` Jan Beulich
@ 2024-08-14 15:41     ` oleksii.kurochko
  2024-08-14 15:53       ` Jan Beulich
  2024-08-16 12:06     ` oleksii.kurochko
  1 sibling, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-14 15:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-13 at 11:34 +0200, Jan Beulich wrote:
> On 09.08.2024 18:19, Oleksii Kurochko wrote:
> 
> > --- a/xen/arch/riscv/sbi.c
> > +++ b/xen/arch/riscv/sbi.c
> > @@ -7,11 +7,23 @@
> >   * Modified by Bobby Eshleman (bobby.eshleman@gmail.com).
> >   *
> >   * Copyright (c) 2019 Western Digital Corporation or its
> > affiliates.
> > - * Copyright (c) 2021-2023 Vates SAS.
> > + * Copyright (c) 2021-2024 Vates SAS.
> >   */
> >  
> > +#include <xen/compiler.h>
> > +#include <xen/const.h>
> > +#include <xen/cpumask.h>
> > +#include <xen/errno.h>
> > +#include <xen/init.h>
> > +#include <xen/lib.h>
> > +#include <xen/smp.h>
> > +
> > +#include <asm/processor.h>
> >  #include <asm/sbi.h>
> >  
> > +static unsigned long sbi_spec_version = SBI_SPEC_VERSION_DEFAULT;
> > +static unsigned long sbi_fw_id, sbi_fw_version;
> 
> __ro_after_init for perhaps all three of them?
> 
> Considering SBI_SPEC_VERSION_{MAJOR,MINOR}_MASK, at least the first
> of them also doesn't need to be unsigned long, but could be unsigned
> int?
It could be unsigned int, this part is declared as unsigned long in
Linux kernel so was taken as is. But based on the possible values for
these variables we can go with unsigned long.

> 
> > @@ -38,7 +50,245 @@ struct sbiret sbi_ecall(unsigned long ext,
> > unsigned long fid,
> >      return ret;
> >  }
> >  
> > +static int sbi_err_map_xen_errno(int err)
> > +{
> > +    switch ( err )
> > +    {
> > +    case SBI_SUCCESS:
> > +        return 0;
> > +    case SBI_ERR_DENIED:
> > +        return -EACCES;
> > +    case SBI_ERR_INVALID_PARAM:
> > +        return -EINVAL;
> > +    case SBI_ERR_INVALID_ADDRESS:
> > +        return -EFAULT;
> > +    case SBI_ERR_NOT_SUPPORTED:
> > +        return -EOPNOTSUPP;
> > +    case SBI_ERR_FAILURE:
> > +        fallthrough;
> > +    default:
> > +        return -ENOSYS;
> > +    };
> > +}
> > +
> >  void sbi_console_putchar(int ch)
> >  {
> >      sbi_ecall(SBI_EXT_0_1_CONSOLE_PUTCHAR, 0, ch, 0, 0, 0, 0, 0);
> >  }
> > +
> > +static unsigned long sbi_major_version(void)
> > +{
> > +    return (sbi_spec_version >> SBI_SPEC_VERSION_MAJOR_SHIFT) &
> > +        SBI_SPEC_VERSION_MAJOR_MASK;
> > +}
> 
> Can you use MASK_EXTR() here, allowing to even drop the separate
> SBI_SPEC_VERSION_MAJOR_SHIFT?
I am not sure that:
(sbi_spec_version >> SBI_SPEC_VERSION_MAJOR_SHIFT) & 
SBI_SPEC_VERSION_MAJOR_MASK == MASK_EXTR(sbi_spec_version,
SBI_SPEC_VERSION_MAJOR_MASK)

> 
> > +static unsigned long sbi_minor_version(void)
> > +{
> > +    return sbi_spec_version & SBI_SPEC_VERSION_MINOR_MASK;
> > +}
> 
> For consistency here then, too.
Here we can use MASK_EXTR.

> 
> > +static long sbi_ext_base_func(long fid)
> > +{
> > +    struct sbiret ret;
> > +
> > +    ret = sbi_ecall(SBI_EXT_BASE, fid, 0, 0, 0, 0, 0, 0);
> > +    if ( !ret.error )
> > +        return ret.value;
> > +    else
> > +        return ret.error;
> 
> With the folding of two distinct values, how is the caller going to
> tell failure from success?
By checking if the return value is < 0.
According to the SBI spec [
https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/binary-encoding.adoc?plain=1#L32
] ret.error can be or 0 ( which means success ) or something < 0 if it
was a failure.

> 
> > +}
> > +
> > +static int __sbi_rfence_v02_real(unsigned long fid,
> 
> Are the double leading underscores really necessary here? (Same again
> further down.)
No at all, I'll drop it.

> 
> > +                                 unsigned long hmask, unsigned
> > long hbase,
> > +                                 unsigned long start, unsigned
> > long size,
> > +                                 unsigned long arg4)
> > +{
> > +    struct sbiret ret = {0};
> > +    int result = 0;
> > +
> > +    switch ( fid )
> > +    {
> > +    case SBI_EXT_RFENCE_REMOTE_FENCE_I:
> > +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> > +                0, 0, 0, 0);
> > +        break;
> > +    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA:
> > +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> > +                start, size, 0, 0);
> > +        break;
> > +    case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID:
> > +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> > +                start, size, arg4, 0);
> > +        break;
> > +    case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA:
> > +        ret = sbi_ecall(SBI_EXT_RFENCE, fid, hmask, hbase,
> > +                start, size, 0, 0);
> > +        break;
> 
> How is e.g. this different from SBI_EXT_RFENCE_REMOTE_SFENCE_VMA
> handling? To ease recognition, I think you want to fold cases with
> identical handling.
Agree, it could be folded for some cases.

> 
> > +int sbi_remote_sfence_vma(cpumask_t *cpu_mask,
> > +                          unsigned long start_addr,
> > +                          unsigned long size)
> > +{
> > +    return __sbi_rfence(SBI_EXT_RFENCE_REMOTE_SFENCE_VMA,
> > +                        cpu_mask, start_addr, size, 0, 0);
> 
> No check (not even an assertion) here for __sbi_rfence still being
> NULL?
I thought that it would be enough to have BUG_ON() in sbi_init() but
then probably would be better to update the message inside BUG_ON():
   int __init sbi_init(void)
   {
   ....
   
       if ( !sbi_has_rfence() )
       {
           BUG_ON("At the moment flush_xen_tlb_range_va() uses SBI rfence
   to "
                  "flush TLB for all CPUS!");
       }
   
       return 0;
   }
> 
> > +int __init sbi_init(void)
> > +{
> > +    int ret;
> > +
> > +    ret = sbi_get_spec_version();
> > +    if ( ret > 0 )
> > +        sbi_spec_version = ret;
> 
> Truncation from long to int is not an issue here?
No, spec_version doesn't have such big numbers, the latest version is
v2.0.

> 
> > +    printk("SBI specification v%lu.%lu detected\n",
> > +            sbi_major_version(), sbi_minor_version());
> > +
> > +    if ( !sbi_spec_is_0_1() )
> > +    {
> > +        sbi_fw_id = sbi_get_firmware_id();
> > +        sbi_fw_version = sbi_get_firmware_version();
> 
> These cannot return errors?
At least, SBI specs don't tell that directly:
https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-base.adoc?plain=1#L31

~ Oleksii



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-14 15:41     ` oleksii.kurochko
@ 2024-08-14 15:53       ` Jan Beulich
  2024-08-15 10:00         ` oleksii.kurochko
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2024-08-14 15:53 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 14.08.2024 17:41, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-13 at 11:34 +0200, Jan Beulich wrote:
>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>> +static unsigned long sbi_major_version(void)
>>> +{
>>> +    return (sbi_spec_version >> SBI_SPEC_VERSION_MAJOR_SHIFT) &
>>> +        SBI_SPEC_VERSION_MAJOR_MASK;
>>> +}
>>
>> Can you use MASK_EXTR() here, allowing to even drop the separate
>> SBI_SPEC_VERSION_MAJOR_SHIFT?
> I am not sure that:
> (sbi_spec_version >> SBI_SPEC_VERSION_MAJOR_SHIFT) & 
> SBI_SPEC_VERSION_MAJOR_MASK == MASK_EXTR(sbi_spec_version,
> SBI_SPEC_VERSION_MAJOR_MASK)

How come you're not sure? That's precisely what MASK_EXTR() is for.

>>> +static long sbi_ext_base_func(long fid)
>>> +{
>>> +    struct sbiret ret;
>>> +
>>> +    ret = sbi_ecall(SBI_EXT_BASE, fid, 0, 0, 0, 0, 0, 0);
>>> +    if ( !ret.error )
>>> +        return ret.value;
>>> +    else
>>> +        return ret.error;
>>
>> With the folding of two distinct values, how is the caller going to
>> tell failure from success?
> By checking if the return value is < 0.
> According to the SBI spec [
> https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/binary-encoding.adoc?plain=1#L32
> ] ret.error can be or 0 ( which means success ) or something < 0 if it
> was a failure.

Right. And what if ret.error was 0, but ret.value was negative?

>>> +int sbi_remote_sfence_vma(cpumask_t *cpu_mask,
>>> +                          unsigned long start_addr,
>>> +                          unsigned long size)
>>> +{
>>> +    return __sbi_rfence(SBI_EXT_RFENCE_REMOTE_SFENCE_VMA,
>>> +                        cpu_mask, start_addr, size, 0, 0);
>>
>> No check (not even an assertion) here for __sbi_rfence still being
>> NULL?
> I thought that it would be enough to have BUG_ON() in sbi_init() but
> then probably would be better to update the message inside BUG_ON():
>    int __init sbi_init(void)
>    {
>    ....
>    
>        if ( !sbi_has_rfence() )
>        {
>            BUG_ON("At the moment flush_xen_tlb_range_va() uses SBI rfence
>    to "
>                   "flush TLB for all CPUS!");

I never really liked this kind of BUG_ON(). I leave it uncommented in
code which clearly is only temporary. Plus imo such BUG_ON()s want to
be next to where the risk is, i.e. in this case ahead of the possible
NULL deref. Then again, without PV guests and without anything mapped
at VA 0, you'll crash cleanly anyway. So perhaps my request to add a
check went too far.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-13 10:31   ` Jan Beulich
@ 2024-08-14 16:50     ` oleksii.kurochko
  2024-08-15  8:09       ` Jan Beulich
  2024-08-20 13:18     ` oleksii.kurochko
  1 sibling, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-14 16:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > Implement map_pages_to_xen() which requires several
> > functions to manage page tables and entries:
> > - pt_update()
> > - pt_mapping_level()
> > - pt_update_entry()
> > - pt_next_level()
> > - pt_check_entry()
> > 
> > To support these operations, add functions for creating,
> > mapping, and unmapping Xen tables:
> > - create_xen_table()
> > - xen_map_table()
> > - xen_unmap_table()
> 
> I think I commented on this before: Everything is "Xen" in hypervisor
> code. What I think you mean is to map/unmap Xen's own page tables.
> Naming-wise that would be {,un}map_xen_table(), though. Since they
> are static, just {,un}map_table() ought to be unambiguous, too.
I thought that your comment was about pt_*() functions but thanks for
explanation again.

> 
> > Introduce internal macros starting with PTE_* for convenience.
> > These macros closely resemble PTE bits, with the exception of
> > PTE_BLOCK, which indicates that a page larger than 4KB is
> > needed.
> 
> I did comment on this too, iirc: Is there going to be any case where
> large pages are going to be "needed", i.e. not just preferred? If
> not,
> giving the caller control over things the other way around
> (requesting
> 4k mappings are needed, as we have it in x86) may be preferable.
Yes, you did the comment but I thought that it will be enough to
mention that shattering isn't supported now and  also since
pt_update_entry()is used to unmap as well, there could be a need to
unmap e.g. 4K page from 2M block mapping what will a little bit harder
then just having 4k by default.

> 
> Hmm, but then ...
> 
> > RISC-V detects superpages using `pte.x` and `pte.r`, as there
> > is no specific bit in the PTE for this purpose. From the RISC-V
> > spec:
> > ```
> >   ...
> >   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to
> > step 5.
> >      Otherwise, this PTE is a pointer to the next level of the page
> > table.
> >      ... .
> >   5. A leaf PTE has been found.
> >      ...
> >   ...
> > ```
> > 
> > The code doesn’t support super page shattering so 4KB pages are
> > used as
> > default.
> 
> ... you have this. Yet still callers expecting re-mapping in the
> (large)
> range they map can request small-page mappings right away.
I am not sure that I fully understand what do you mean by "expcting re-
mapping".

> 
> > --- a/xen/arch/riscv/include/asm/flushtlb.h
> > +++ b/xen/arch/riscv/include/asm/flushtlb.h
> > @@ -5,12 +5,25 @@
> >  #include <xen/bug.h>
> >  #include <xen/cpumask.h>
> >  
> > +#include <asm/sbi.h>
> > +
> >  /* Flush TLB of local processor for address va. */
> >  static inline void flush_tlb_one_local(vaddr_t va)
> >  {
> >      asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
> >  }
> >  
> > +/*
> > + * Flush a range of VA's hypervisor mappings from the TLB of all
> > + * processors in the inner-shareable domain.
> > + */
> > +static inline void flush_tlb_range_va(vaddr_t va,
> > +                                      size_t size)
> 
> No need for line wrapping here?
What is line wrapping here? Do you mean that size_t size should be on
the previous line?

> 
> > @@ -33,15 +38,72 @@
> >  #define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE)
> >  #define PTE_TABLE                   (PTE_VALID)
> >  
> > +#define PAGE_HYPERVISOR_RO          (PTE_VALID | PTE_READABLE)
> >  #define PAGE_HYPERVISOR_RW          (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE)
> > +#define PAGE_HYPERVISOR_RX          (PTE_VALID | PTE_READABLE |
> > PTE_EXECUTABLE)
> >  
> >  #define PAGE_HYPERVISOR             PAGE_HYPERVISOR_RW
> >  
> > +
> > +/*
> 
> Nit: As before, no double blank lines please.
> 
> > + * There are no such bits in PTE format for RISC-V.
> 
> This is an odd way to start a comment: There's nothing for "such" to
> refer
> to.
> 
> > + * The code doesn’t support super page shattering so at the moment
> > superpages
> > + * can't be used as a default so PTE_BLOCK is introduced to have
> > ability to
> > + * tell that superpage should be allocated.
> > + * Additionaly as mentioed in RISC-V priviliged spec:
> > + * ```
> > + *  After much deliberation, we have settled on a conventional
> > page size of
> > + *  4 KiB for both RV32 and RV64. We expect this decision to ease
> > the porting
> > + *  of low-level runtime software and device drivers.
> > + *
> > + *  The TLB reach problem is ameliorated by transparent superpage
> > support in
> > + *  modern operating systems [2]. Additionally, multi-level TLB
> > hierarchies
> > + *  are quite inexpensive relative to the multi-level cache
> > hierarchies whose
> > + *  address space they map.
> > + *
> > + *  [2] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox.
> > Practical,
> > + *      transparent operating system support for superpages.
> > + *      SIGOPS Oper. Syst. Rev., 36(SI):89–104, December 2002.
> > + * ```
> > + *
> > + * PTE_POPULATE is introduced to have ability to tell that page
> > tables
> > + * shoud be populated.
> > + */
> > +#define PTE_BLOCK       BIT(10, UL)
> > +#define PTE_POPULATE    BIT(11, UL)
> > +
> > +#define PTE_R_MASK(x)   ((x) & PTE_READABLE)
> > +#define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
> > +#define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
> > +
> > +#define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE |
> > PTE_EXECUTABLE))
> > +
> >  /* Calculate the offsets into the pagetables for a given VA */
> >  #define pt_linear_offset(lvl, va)   ((va) >>
> > XEN_PT_LEVEL_SHIFT(lvl))
> >  
> >  #define pt_index(lvl, va) (pt_linear_offset((lvl), (va)) &
> > VPN_MASK)
> >  
> > +#define TABLE_OFFSET(offs) (_AT(unsigned int, offs) & ((_AC(1, U)
> > << PAGETABLE_ORDER) - 1))
> 
> Not: Too long line.
> 
> > +#if RV_STAGE1_MODE > SATP_MODE_SV48
> 
> SV48? Isn't ...
> 
> > +#error "need to to update DECLARE_OFFSETS macros"
> > +#else
> > +
> > +#define l0_table_offset(va) TABLE_OFFSET(pt_linear_offset(0, va))
> > +#define l1_table_offset(va) TABLE_OFFSET(pt_linear_offset(1, va))
> > +#define l2_table_offset(va) TABLE_OFFSET(pt_linear_offset(2, va))
> > +
> > +/* Generate an array @var containing the offset for each level
> > from @addr */
> > +#define DECLARE_OFFSETS(var, addr)          \
> > +    const unsigned int var[] = {            \
> > +        l0_table_offset(addr),              \
> > +        l1_table_offset(addr),              \
> > +        l2_table_offset(addr),              \
> > +    }
> 
> ... this for SV39?
Agree, the check above isn't correct. It should be "RV_STAGE1_MODE >
SATP_MODE_SV39".


> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/pt.c
> > @@ -0,0 +1,408 @@
> > +#include <xen/bug.h>
> > +#include <xen/domain_page.h>
> > +#include <xen/errno.h>
> > +#include <xen/mm.h>
> > +#include <xen/mm-frame.h>
> > +#include <xen/pmap.h>
> > +#include <xen/spinlock.h>
> > +
> > +#include <asm/flushtlb.h>
> > +#include <asm/page.h>
> > +
> > +static inline const mfn_t get_root_page(void)
> > +{
> > +    unsigned long root_maddr =
> 
> maddr_t or paddr_t?
> 
> > +        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
> > +
> > +    return maddr_to_mfn(root_maddr);
> > +}
> > +
> > +/*
> > + * Sanity check of the entry
> > + * mfn is not valid and we are not populating page table. This
> > means
> 
> How does this fit with ...
> 
> > + * we either modify entry or remove an entry.
> > + */
> > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
> > flags)
> > +{
> > +    /* Sanity check when modifying an entry. */
> > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> 
> ... the MFN check here?
The comment is incorrect and should be corrected:
" if mfn is valid or ... "

>  And why is (valid,INVALID_MFN) an indication
> of a modification? But then ...
the explanation is in the comment to pt_update():
   /*
    * If `mfn` equals `INVALID_MFN`, it indicates that the following page
   table
    * update operation might be related to either populating the table,
    * destroying a mapping, or modifying an existing mapping.
    */
   static int pt_update(unsigned long virt,

> 
> > +    {
> > +        /* We don't allow modifying an invalid entry. */
> > +        if ( !pte_is_valid(entry) )
> > +        {
> > +            printk("Modifying invalid entry is not allowed.\n");
> > +            return false;
> > +        }
> 
> ... I also don't understand what this is about. IOW I'm afraid I'm
> still confused about the purpose of this function as well as the
> transitions you want to permit / reject. I wonder whether the
> flags & PTE_VALID and pte_is_valid(entry) aren't in need of swapping.
> 
> > +/* Update an entry at the level @target. */
> > +static int pt_update_entry(mfn_t root, unsigned long virt,
> > +                           mfn_t mfn, unsigned int target,
> > +                           unsigned int flags)
> > +{
> > +    int rc;
> > +    unsigned int level = HYP_PT_ROOT_LEVEL;
> > +    pte_t *table;
> > +    /*
> > +     * The intermediate page tables are read-only when the MFN is
> > not valid
> > +     * and we are not populating page table.
> 
> The way flags are handled in PTEs, how can page tables be read-only?
I started to be confused. Probably I have to re-write some code and
also drop almost the whole function xen_pt_check_entry().

> 
> > +     * This means we either modify permissions or remove an entry.
> 
> From all I can determine we also get here when making brand new
> entries.
> What am I overlooking?
Nothing. then it means intermidiate page table won't be read-only.

> 
> > +     */
> > +    bool alloc_only = mfn_eq(mfn, INVALID_MFN) && !(flags &
> > PTE_POPULATE);
> > +    pte_t pte, *entry;
> > +
> > +    /* convenience aliases */
> > +    DECLARE_OFFSETS(offsets, virt);
> > +
> > +    table = xen_map_table(root);
> > +    for ( ; level > target; level-- )
> > +    {
> > +        rc = pt_next_level(alloc_only, &table, offsets[level]);
> > +        if ( rc == XEN_TABLE_MAP_FAILED )
> > +        {
> > +            rc = 0;
> > +
> > +            /*
> > +             * We are here because pt_next_level has failed to map
> > +             * the intermediate page table (e.g the table does not
> > exist
> > +             * and the pt is read-only). It is a valid case when
> > +             * removing a mapping as it may not exist in the page
> > table.
> > +             * In this case, just ignore it.
> > +             */
> > +            if ( flags & PTE_VALID )
> > +            {
> > +                printk("%s: Unable to map level %u\n", __func__,
> > level);
> > +                rc = -ENOENT;
> > +            }
> > +
> > +            goto out;
> > +        }
> > +        else if ( rc != XEN_TABLE_NORMAL )
> > +            break;
> > +    }
> > +
> > +    if ( level != target )
> > +    {
> > +        printk("%s: Shattering superpage is not supported\n",
> > __func__);
> > +        rc = -EOPNOTSUPP;
> > +        goto out;
> > +    }
> > +
> > +    entry = table + offsets[level];
> > +
> > +    rc = -EINVAL;
> > +    if ( !pt_check_entry(*entry, mfn, flags) )
> > +        goto out;
> > +
> > +    /* We are removing the page */
> > +    if ( !(flags & PTE_VALID) )
> > +        memset(&pte, 0x00, sizeof(pte));
> > +    else
> > +    {
> > +        /* We are inserting a mapping => Create new pte. */
> > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > +            pte = pte_from_mfn(mfn, PTE_VALID);
> > +        else /* We are updating the permission => Copy the current
> > pte. */
> > +            pte = *entry;
> > +
> > +        /* update permission according to the flags */
> > +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
> > +    }
> > +
> > +    write_pte(entry, pte);
> > +
> > +    rc = 0;
> > +
> > + out:
> > +    xen_unmap_table(table);
> > +
> > +    return rc;
> > +}
> > +
> > +static DEFINE_SPINLOCK(xen_pt_lock);
> 
> If you put this in the middle of the file (which is fine), I think it
> wants putting immediately ahead of the (first) function using it, not
> at some seemingly random place.
> 
> > +/*
> > + * If `mfn` equals `INVALID_MFN`, it indicates that the following
> > page table
> > + * update operation might be related to either populating the
> > table,
> > + * destroying a mapping, or modifying an existing mapping.
> > + */
> > +static int pt_update(unsigned long virt,
> > +                     mfn_t mfn,
> > +                     unsigned long nr_mfns,
> > +                     unsigned int flags)
> > +{
> > +    int rc = 0;
> > +    unsigned long vfn = virt >> PAGE_SHIFT;
> > +    unsigned long left = nr_mfns;
> > +
> > +    const mfn_t root = get_root_page();
> > +
> > +    /*
> > +     * It is bad idea to have mapping both writeable and
> > +     * executable.
> > +     * When modifying/creating mapping (i.e PTE_VALID is set),
> > +     * prevent any update if this happen.
> > +     */
> > +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) &&
> > PTE_X_MASK(flags) )
> > +    {
> > +        printk("Mappings should not be both Writeable and
> > Executable.\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
> > +    {
> > +        printk("The virtual address is not aligned to the page-
> > size.\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    spin_lock(&xen_pt_lock);
> > +
> > +    while ( left )
> > +    {
> > +        unsigned int order, level;
> > +
> > +        level = pt_mapping_level(vfn, mfn, left, flags);
> > +        order = XEN_PT_LEVEL_ORDER(level);
> > +
> > +        ASSERT(left >= BIT(order, UL));
> > +
> > +        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level,
> > +                                    flags);
> 
> Indentation.
> 
> > +        if ( rc )
> > +            break;
> > +
> > +        vfn += 1U << order;
> > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > +            mfn = mfn_add(mfn, 1U << order);
> > +
> > +        left -= (1U << order);
> 
> To be on thje safe side, 1UL everywhere?
> 
> > +        if ( rc )
> > +            break;
> 
> There was such a check already a few lines up from here.
> 
> > +    }
> > +
> > +
> > +    /* ensure that PTEs are all updated before flushing */
> 
> Again: No double blank lines please.
> 
> > +    RISCV_FENCE(rw, rw);
> > +
> > +    /*
> > +     * always flush TLB at the end of the function as non-present
> > entries
> > +     * can be put in the TLB
> > +     */
> > +    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
> 
> Coming back to "negative" TLB entries: Assuming RISC-V, just like
> other
> architectures, also permits intermediate page table entries to be
> cached,
> the affected VA range may be much larger than the original request,
> if
> intermediate page tables needed creating.
It could be an issue. Could we some how  to calculate the proper range
or the only option we have is to flush all.
   And for some reason it isn't an issue for Arm:
   
       /*
        * The TLBs flush can be safely skipped when a mapping is
   inserted
        * as we don't allow mapping replacement (see
   xen_pt_check_entry()).
        * Although we still need an ISB to ensure any DSB in
        * write_pte() will complete because the mapping may be used
   soon
        * after.
        *
        * For all the other cases, the TLBs will be flushed
   unconditionally
        * even if the mapping has failed. This is because we may have
        * partially modified the PT. This will prevent any unexpected
        * behavior afterwards.
        */
       if ( !((flags & _PAGE_PRESENT) && !mfn_eq(mfn, INVALID_MFN)) )
           flush_xen_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
       else
           isb();
   

> 
> > +    spin_unlock(&xen_pt_lock);
> 
> Does this really need to come after fence and flush?
I think yes, as page table should be updated only by 1 CPU at the same
time. And before give ability to other CPU to update page table we have
to finish a work on current CPU.

> 
> > +    return rc;
> > +}
> > +
> > +int map_pages_to_xen(unsigned long virt,
> > +                     mfn_t mfn,
> > +                     unsigned long nr_mfns,
> > +                     unsigned int flags)
> > +{
> > +    /*
> > +     * Ensure that we have a valid MFN before proceeding.
> > +     *
> > +     * If the MFN is invalid, pt_update() might misinterpret the
> > operation,
> > +     * treating it as either a population, a mapping destruction,
> > +     * or a mapping modification.
> > +     */
> > +    ASSERT(!mfn_eq(mfn, INVALID_MFN));
> 
> But flags with PTE_VALID not set are fine to come into here?
It is fine for pt_update() but I don't know if it is fine for
map_pages_to_xen(). I see that other architectures don't check that.

~ Oleksii

> 
> > +    return pt_update(virt, mfn, nr_mfns, flags);
> > +}
> 
> Jan



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-14 16:50     ` oleksii.kurochko
@ 2024-08-15  8:09       ` Jan Beulich
  2024-08-15 11:21         ` oleksii.kurochko
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2024-08-15  8:09 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>> Introduce internal macros starting with PTE_* for convenience.
>>> These macros closely resemble PTE bits, with the exception of
>>> PTE_BLOCK, which indicates that a page larger than 4KB is
>>> needed.
>>
>> I did comment on this too, iirc: Is there going to be any case where
>> large pages are going to be "needed", i.e. not just preferred? If
>> not,
>> giving the caller control over things the other way around
>> (requesting
>> 4k mappings are needed, as we have it in x86) may be preferable.
> Yes, you did the comment but I thought that it will be enough to
> mention that shattering isn't supported now and  also since
> pt_update_entry()is used to unmap as well, there could be a need to
> unmap e.g. 4K page from 2M block mapping what will a little bit harder
> then just having 4k by default.

Shattering isn't supported now, but that's going to change at some point,
I suppose. Where possible the long-term behavior wants taking into account
right away, to avoid having to e.g. touch all callers again later on.

>> Hmm, but then ...
>>
>>> RISC-V detects superpages using `pte.x` and `pte.r`, as there
>>> is no specific bit in the PTE for this purpose. From the RISC-V
>>> spec:
>>> ```
>>>   ...
>>>   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to
>>> step 5.
>>>      Otherwise, this PTE is a pointer to the next level of the page
>>> table.
>>>      ... .
>>>   5. A leaf PTE has been found.
>>>      ...
>>>   ...
>>> ```
>>>
>>> The code doesn’t support super page shattering so 4KB pages are
>>> used as
>>> default.
>>
>> ... you have this. Yet still callers expecting re-mapping in the
>> (large)
>> range they map can request small-page mappings right away.
> I am not sure that I fully understand what do you mean by "expcting re-
> mapping".

Right now you have callers pass PTE_BLOCK when they know that no small
page re-mappings are going to occur for an area. What I'm suggesting is
that you invert this logic: Have callers pass PTE_SMALL when there is
a possibility that re-mapping requests may be issued later. Then,
later, by simply grep-ing for PTE_SMALL you'll be able to easily find
all candidates that possibly can be relaxed when super-page shattering
starts being supported. That's going to be easier than finding all
instances where PTE_BLOCK is _not_used.

>>> --- a/xen/arch/riscv/include/asm/flushtlb.h
>>> +++ b/xen/arch/riscv/include/asm/flushtlb.h
>>> @@ -5,12 +5,25 @@
>>>  #include <xen/bug.h>
>>>  #include <xen/cpumask.h>
>>>  
>>> +#include <asm/sbi.h>
>>> +
>>>  /* Flush TLB of local processor for address va. */
>>>  static inline void flush_tlb_one_local(vaddr_t va)
>>>  {
>>>      asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
>>>  }
>>>  
>>> +/*
>>> + * Flush a range of VA's hypervisor mappings from the TLB of all
>>> + * processors in the inner-shareable domain.
>>> + */
>>> +static inline void flush_tlb_range_va(vaddr_t va,
>>> +                                      size_t size)
>>
>> No need for line wrapping here?
> What is line wrapping here? Do you mean that size_t size should be on
> the previous line?

Yes. Everything will fit on one line quite nicely.

>>> --- /dev/null
>>> +++ b/xen/arch/riscv/pt.c
>>> @@ -0,0 +1,408 @@
>>> +#include <xen/bug.h>
>>> +#include <xen/domain_page.h>
>>> +#include <xen/errno.h>
>>> +#include <xen/mm.h>
>>> +#include <xen/mm-frame.h>
>>> +#include <xen/pmap.h>
>>> +#include <xen/spinlock.h>
>>> +
>>> +#include <asm/flushtlb.h>
>>> +#include <asm/page.h>
>>> +
>>> +static inline const mfn_t get_root_page(void)
>>> +{
>>> +    unsigned long root_maddr =
>>
>> maddr_t or paddr_t?
>>
>>> +        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
>>> +
>>> +    return maddr_to_mfn(root_maddr);
>>> +}
>>> +
>>> +/*
>>> + * Sanity check of the entry
>>> + * mfn is not valid and we are not populating page table. This
>>> means
>>
>> How does this fit with ...
>>
>>> + * we either modify entry or remove an entry.
>>> + */
>>> +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
>>> flags)
>>> +{
>>> +    /* Sanity check when modifying an entry. */
>>> +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
>>
>> ... the MFN check here?
> The comment is incorrect and should be corrected:
> " if mfn is valid or ... "
> 
>>  And why is (valid,INVALID_MFN) an indication
>> of a modification? But then ...
> the explanation is in the comment to pt_update():
>    /*
>     * If `mfn` equals `INVALID_MFN`, it indicates that the following page
>    table
>     * update operation might be related to either populating the table,
>     * destroying a mapping, or modifying an existing mapping.
>     */
>    static int pt_update(unsigned long virt,

And how do readers know that comments in pt_update() are crucial for
understanding what pt_check_entry() does? You certainly don't need to
have the same comment in two places, but you at least want to refer
to a relevant comment when that lives elsewhere.

>>> +static int pt_update(unsigned long virt,
>>> +                     mfn_t mfn,
>>> +                     unsigned long nr_mfns,
>>> +                     unsigned int flags)
>>> +{
>>> +    int rc = 0;
>>> +    unsigned long vfn = virt >> PAGE_SHIFT;
>>> +    unsigned long left = nr_mfns;
>>> +
>>> +    const mfn_t root = get_root_page();
>>> +
>>> +    /*
>>> +     * It is bad idea to have mapping both writeable and
>>> +     * executable.
>>> +     * When modifying/creating mapping (i.e PTE_VALID is set),
>>> +     * prevent any update if this happen.
>>> +     */
>>> +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) &&
>>> PTE_X_MASK(flags) )
>>> +    {
>>> +        printk("Mappings should not be both Writeable and
>>> Executable.\n");
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
>>> +    {
>>> +        printk("The virtual address is not aligned to the page-
>>> size.\n");
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    spin_lock(&xen_pt_lock);
>>> +
>>> +    while ( left )
>>> +    {
>>> +        unsigned int order, level;
>>> +
>>> +        level = pt_mapping_level(vfn, mfn, left, flags);
>>> +        order = XEN_PT_LEVEL_ORDER(level);
>>> +
>>> +        ASSERT(left >= BIT(order, UL));
>>> +
>>> +        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level,
>>> +                                    flags);
>>
>> Indentation.
>>
>>> +        if ( rc )
>>> +            break;
>>> +
>>> +        vfn += 1U << order;
>>> +        if ( !mfn_eq(mfn, INVALID_MFN) )
>>> +            mfn = mfn_add(mfn, 1U << order);
>>> +
>>> +        left -= (1U << order);
>>
>> To be on thje safe side, 1UL everywhere?
>>
>>> +        if ( rc )
>>> +            break;
>>
>> There was such a check already a few lines up from here.
>>
>>> +    }
>>> +
>>> +
>>> +    /* ensure that PTEs are all updated before flushing */
>>
>> Again: No double blank lines please.
>>
>>> +    RISCV_FENCE(rw, rw);
>>> +
>>> +    /*
>>> +     * always flush TLB at the end of the function as non-present
>>> entries
>>> +     * can be put in the TLB
>>> +     */
>>> +    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
>>
>> Coming back to "negative" TLB entries: Assuming RISC-V, just like
>> other
>> architectures, also permits intermediate page table entries to be
>> cached,
>> the affected VA range may be much larger than the original request,
>> if
>> intermediate page tables needed creating.
> It could be an issue. Could we some how  to calculate the proper range
> or the only option we have is to flush all.

Right - either you maintain state to know the biggest possible range
that can be affected, or you flush all when a new intermediate page
table needed inserting.

>>> +    spin_unlock(&xen_pt_lock);
>>
>> Does this really need to come after fence and flush?
> I think yes, as page table should be updated only by 1 CPU at the same
> time. And before give ability to other CPU to update page table we have
> to finish a work on current CPU.

Can you then explain to me, perhaps by way of an example, what will go
wrong if the unlock is ahead of the flush? (I'm less certain about the
fence, and that's also less expensive.)

>>> +int map_pages_to_xen(unsigned long virt,
>>> +                     mfn_t mfn,
>>> +                     unsigned long nr_mfns,
>>> +                     unsigned int flags)
>>> +{
>>> +    /*
>>> +     * Ensure that we have a valid MFN before proceeding.
>>> +     *
>>> +     * If the MFN is invalid, pt_update() might misinterpret the
>>> operation,
>>> +     * treating it as either a population, a mapping destruction,
>>> +     * or a mapping modification.
>>> +     */
>>> +    ASSERT(!mfn_eq(mfn, INVALID_MFN));
>>
>> But flags with PTE_VALID not set are fine to come into here?
> It is fine for pt_update() but I don't know if it is fine for
> map_pages_to_xen(). I see that other architectures don't check that.

That's not my point here. It's rather along the lines of an earlier
that I gave here: Since pte_update() is a pretty generic function, will
flags not having PTE_VALID set perhaps result in pte_update() doing
something that's not what the caller might expect?

And yes, there's a special case of map_pages_to_xen() being called with
_PAGE_NONE (if an arch defines such). That special case plays into here:
If an arch doesn't define it, unmap requests ought to exclusively come
through destroy_xen_mappings().

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/7] xen/riscv: set up fixmap mappings
  2024-08-14 15:08       ` Jan Beulich
@ 2024-08-15  8:16         ` oleksii.kurochko
  0 siblings, 0 replies; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-15  8:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Wed, 2024-08-14 at 17:08 +0200, Jan Beulich wrote:
> On 14.08.2024 16:21, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-08-13 at 10:22 +0200, Jan Beulich wrote:
> > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > --- a/xen/arch/riscv/include/asm/page.h
> > > > +++ b/xen/arch/riscv/include/asm/page.h
> > > > @@ -81,6 +81,12 @@ static inline void
> > > > flush_page_to_ram(unsigned
> > > > long mfn, bool sync_icache)
> > > >      BUG_ON("unimplemented");
> > > >  }
> > > >  
> > > > +/* Write a pagetable entry. */
> > > > +static inline void write_pte(pte_t *p, pte_t pte)
> > > > +{
> > > > +    *p = pte;
> > > > +}
> > > 
> > > No use of write_atomic() here? And no read_pte() counterpart
> > > right
> > > away (then
> > > also properly using read_atomic())?
> > I wanted to avoid the fence before "*p=pte" which in case of
> > write_atomic() will be present.
> 
> Well, this goes back to write_atomic() resolving to write{b,w,l,q}()
> for
> unclear reasons; even the comment in our atomic.h says "For some
> reason".
> The fence there is of course getting in the way here. I keep
> forgetting
> about that aspect, because neither x86 nor Arm has anything similar
> afaics.
Good point. I overlooked something here ( and misinterpreted your
comments regarding that write_atomic() implementation ) but I think it
would be better to use write{b,w,l,q}_cpu() for write_atomic.

~ Oleksii

> 
> > Won't it be enough to use here WRITE_ONCE()?
> 
> Maybe. I'm not entirely sure.
> 
> Jan



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-14 15:22       ` Jan Beulich
@ 2024-08-15  8:55         ` oleksii.kurochko
  2024-08-15  9:02           ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-15  8:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Wed, 2024-08-14 at 17:22 +0200, Jan Beulich wrote:
> On 14.08.2024 16:45, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-08-13 at 10:54 +0200, Jan Beulich wrote:
> > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > --- a/xen/arch/riscv/include/asm/smp.h
> > > > +++ b/xen/arch/riscv/include/asm/smp.h
> > > > @@ -5,6 +5,8 @@
> > > >  #include <xen/cpumask.h>
> > > >  #include <xen/percpu.h>
> > > >  
> > > > +#define INVALID_HARTID UINT_MAX
> > > > +
> > > >  DECLARE_PER_CPU(cpumask_var_t, cpu_sibling_mask);
> > > >  DECLARE_PER_CPU(cpumask_var_t, cpu_core_mask);
> > > >  
> > > > @@ -14,6 +16,14 @@ DECLARE_PER_CPU(cpumask_var_t,
> > > > cpu_core_mask);
> > > >   */
> > > >  #define park_offline_cpus false
> > > >  
> > > > +void smp_set_bootcpu_id(unsigned long boot_cpu_hartid);
> > > > +
> > > > +/*
> > > > + * Mapping between linux logical cpu index and hartid.
> > > > + */
> > > > +extern unsigned long cpuid_to_hartid_map[NR_CPUS];
> > > > +#define cpuid_to_hartid(cpu) cpuid_to_hartid_map[cpu]
> > > 
> > > How wide can hart IDs be? Wider than 32 bits? If not, why
> > > unsigned
> > > long?
> > According to the spec:
> > ```
> > The mhartid CSR is an MXLEN-bit read-only register containing the
> > integer ID of the hardware thread running the code
> > ```
> > where MXLEN can bit 32 and 64.
> 
> Hmm, okay. If the full MXLEN bits can be used, then using unsigned
> long
> is indeed the right thing here.
> 
> > > > @@ -40,6 +41,19 @@ void __init noreturn start_xen(unsigned long
> > > > bootcpu_id,
> > > >  {
> > > >      remove_identity_mapping();
> > > >  
> > > > +    /*
> > > > +     * tp register contains an address of physical cpu
> > > > information.
> > > > +     * So write physical CPU info of boot cpu to tp register
> > > > +     * It will be used later by get_processor_id() ( look at
> > > > +     * <asm/processor.h> ):
> > > > +     *   #define get_processor_id()    (tp->processor_id)
> > > > +     */
> > > > +    asm volatile ( "mv tp, %0" : : "r"((unsigned
> > > > long)&pcpu_info[0]) );
> > > 
> > > Perhaps be on the safe side and also add a memory barrier?
> > Do you mean compiler barrier? ( asm volatile ( "..." :: ... :
> > "memory"
> > )? )
> 
> Yes.
> 
> > > Perhaps be on the safe side and do this absolutely first in the
> > > function,
> > > or even in assembly (such that initializers of future variables
> > > declared
> > > at the top of the function end up safe, too)?
> > I am not sure that I am following your part related to put this
> > code in
> > assembler ( instructions in assembly code still code be re-ordered
> > what
> > can affect a time when tp will be set with correct value )
> 
> I'm afraid I don't understand this. Instructions can be re-ordered,
> sure,
> but later instructions are guaranteed to observe the effects on
> register
> state that earlier instructions caused.
> 
> > and what do
> > you mean by "initializers of future variables declared at the top
> > of
> > the function end up safe"
> 
> With
> 
> void start_xen()
> {
>     int var = func();
> 
>     asm volatile ( "mv tp, %0" :: ...);
>     ...
> 
> you end up with the requirement that func() must not use anything
> that
> depends on tp being set. In this simple example it may be easy to say
> "don't use an initializer and call the function afterwards". But that
> is
> going to be a risky game to play. Look at x86'es __start_xen(). An
> insertion into such a big set of declarations may not pay attention
> to
> tp not being set yet, when _all_ other C code may reasonably assume
> tp
> is set.
Thanks for the clarification. I missed the possibility that someone
might accidentally use tp before it is set. In this case, I agree that
it would be better to create a setup_tp() function and call it before
start_xen().

> 
> > > > --- /dev/null
> > > > +++ b/xen/arch/riscv/smp.c
> > > > @@ -0,0 +1,4 @@
> > > > +#include <xen/smp.h>
> > > > +
> > > > +/* tp points to one of these per cpu */
> > > > +struct pcpu_info pcpu_info[NR_CPUS];
> > > 
> > > And they all need setting up statically? Is there a plan to make
> > > this
> > > dynamic (which could be recorded in a "fixme" in the comment)?
> > I didn't plan to make allocation of this array dynamic. I don't
> > expect
> > that NR_CPUS will be big.
> 
> What is this expectation of yours based on? Other architectures
> permit
> systems with hundreds or even thousands of CPUs; why would RISC-V be
> different there?
Based on available dev boards. ( what isn't really strong argument )

I checked other architectures and they are using static allocation too:
   struct cpuinfo_x86 cpu_data[NR_CPUS];
   
   u32 x86_cpu_to_apicid[NR_CPUS] __read_mostly =
   	{ [0 ... NR_CPUS-1] = BAD_APICID };
   
   ... /* Arm */
   
   struct cpuinfo_arm cpu_data[NR_CPUS];

I wanted to check to understand which one API should be used to
allocate this array dynamically. xmalloc?

And I am curious how I can use xmalloc() at this stage if page
allocator (?) should be initialized what isn't true now.
Or just to allocate pcpu_info only for boot cpu and for other then use
xmalloc()?

> 
> > I can add "fixme" but I am not really
> > understand what will be advantages if pcpu_info[] will be allocated
> > dynamically.
> 
> Where possible it's better to avoid static allocations, of which on
> some systems only a very small part may be used. Even if you put
> yourself
> on the position that many take - memory being cheap - you then still
> waste cache and TLB bandwidth. Furthermore as long as struct
> pcpu_info
> isn't big enough (and properly aligned) for two successive array
> entries
> to not share cache lines, you may end up playing cacheline ping-pong
> when a CPU writes to its own array slot.
Why the mentioned issues aren't work for dynamic memory? We still
allocating memory for sizeof(pcpu_info) * NR_CPUS and when this
allocated memory access it will go to cache, CPU/MMU still will use TLB
for address translation for this region and without a proper alignment
of pcpu_info size it still could be an issue with cache line sharing.

~ Oleksii



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-15  8:55         ` oleksii.kurochko
@ 2024-08-15  9:02           ` Jan Beulich
  2024-08-15 13:29             ` oleksii.kurochko
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2024-08-15  9:02 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 15.08.2024 10:55, oleksii.kurochko@gmail.com wrote:
> On Wed, 2024-08-14 at 17:22 +0200, Jan Beulich wrote:
>> On 14.08.2024 16:45, oleksii.kurochko@gmail.com wrote:
>>> On Tue, 2024-08-13 at 10:54 +0200, Jan Beulich wrote:
>>>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>>>> --- /dev/null
>>>>> +++ b/xen/arch/riscv/smp.c
>>>>> @@ -0,0 +1,4 @@
>>>>> +#include <xen/smp.h>
>>>>> +
>>>>> +/* tp points to one of these per cpu */
>>>>> +struct pcpu_info pcpu_info[NR_CPUS];
>>>>
>>>> And they all need setting up statically? Is there a plan to make
>>>> this
>>>> dynamic (which could be recorded in a "fixme" in the comment)?
>>> I didn't plan to make allocation of this array dynamic. I don't
>>> expect
>>> that NR_CPUS will be big.
>>
>> What is this expectation of yours based on? Other architectures
>> permit
>> systems with hundreds or even thousands of CPUs; why would RISC-V be
>> different there?
> Based on available dev boards. ( what isn't really strong argument )
> 
> I checked other architectures and they are using static allocation too:
>    struct cpuinfo_x86 cpu_data[NR_CPUS];
>    
>    u32 x86_cpu_to_apicid[NR_CPUS] __read_mostly =
>    	{ [0 ... NR_CPUS-1] = BAD_APICID };
>    
>    ... /* Arm */
>    
>    struct cpuinfo_arm cpu_data[NR_CPUS];
> 
> I wanted to check to understand which one API should be used to
> allocate this array dynamically. xmalloc?

As of a few days ago xvmalloc() (or friends thereof), as long as ...

> And I am curious how I can use xmalloc() at this stage if page
> allocator (?) should be initialized what isn't true now.

... this happens late enough in the boot process. Indeed ...

> Or just to allocate pcpu_info only for boot cpu and for other then use
> xmalloc()?

... statically allocating space for the boot CPU only is another option.

>>> I can add "fixme" but I am not really
>>> understand what will be advantages if pcpu_info[] will be allocated
>>> dynamically.
>>
>> Where possible it's better to avoid static allocations, of which on
>> some systems only a very small part may be used. Even if you put
>> yourself
>> on the position that many take - memory being cheap - you then still
>> waste cache and TLB bandwidth. Furthermore as long as struct
>> pcpu_info
>> isn't big enough (and properly aligned) for two successive array
>> entries
>> to not share cache lines, you may end up playing cacheline ping-pong
>> when a CPU writes to its own array slot.
> Why the mentioned issues aren't work for dynamic memory? We still
> allocating memory for sizeof(pcpu_info) * NR_CPUS

Why NR_CPUS? At runtime you know how may CPUs the system has you're
running on. You only need to allocate as much then. Just like e.g.
dynamically allocated CPU mask variables (cpumask_var_t) deliberately
use less than NR_CPUS bits unless on really big iron.

Jan

> and when this
> allocated memory access it will go to cache, CPU/MMU still will use TLB
> for address translation for this region and without a proper alignment
> of pcpu_info size it still could be an issue with cache line sharing.
> 
> ~ Oleksii
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-14 15:53       ` Jan Beulich
@ 2024-08-15 10:00         ` oleksii.kurochko
  2024-08-15 10:32           ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-15 10:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Wed, 2024-08-14 at 17:53 +0200, Jan Beulich wrote:
> On 14.08.2024 17:41, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-08-13 at 11:34 +0200, Jan Beulich wrote:
> > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > +static unsigned long sbi_major_version(void)
> > > > +{
> > > > +    return (sbi_spec_version >> SBI_SPEC_VERSION_MAJOR_SHIFT)
> > > > &
> > > > +        SBI_SPEC_VERSION_MAJOR_MASK;
> > > > +}
> > > 
> > > Can you use MASK_EXTR() here, allowing to even drop the separate
> > > SBI_SPEC_VERSION_MAJOR_SHIFT?
> > I am not sure that:
> > (sbi_spec_version >> SBI_SPEC_VERSION_MAJOR_SHIFT) & 
> > SBI_SPEC_VERSION_MAJOR_MASK == MASK_EXTR(sbi_spec_version,
> > SBI_SPEC_VERSION_MAJOR_MASK)
> 
> How come you're not sure? That's precisely what MASK_EXTR() is for.
As the order matter here. First should be shift then applying
SBI_SPEC_VERSION_MAJOR_MASK, as SBI_SPEC_VERSION_MAJOR_MASK is defined
now as 7f. If SBI_SPEC_VERSION_MAJOR_MASK set to 0x7F000000 then it
will work.

> 
> > > > +static long sbi_ext_base_func(long fid)
> > > > +{
> > > > +    struct sbiret ret;
> > > > +
> > > > +    ret = sbi_ecall(SBI_EXT_BASE, fid, 0, 0, 0, 0, 0, 0);
> > > > +    if ( !ret.error )
> > > > +        return ret.value;
> > > > +    else
> > > > +        return ret.error;
> > > 
> > > With the folding of two distinct values, how is the caller going
> > > to
> > > tell failure from success?
> > By checking if the return value is < 0.
> > According to the SBI spec [
> > https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/binary-encoding.adoc?plain=1#L32
> > ] ret.error can be or 0 ( which means success ) or something < 0 if
> > it
> > was a failure.
> 
> Right. And what if ret.error was 0, but ret.value was negative?
I wasn't able to find a case in the SBI spec where sbiret.value could
be negative. Unfortunately, the spec does not specify the possible
values of sbiret.value, but based on the description of the SBI
function, it only returns a negative value in the case of an error.
Otherwise, ret.value is always greater than or equal to 0.

> 
> > > > +int sbi_remote_sfence_vma(cpumask_t *cpu_mask,
> > > > +                          unsigned long start_addr,
> > > > +                          unsigned long size)
> > > > +{
> > > > +    return __sbi_rfence(SBI_EXT_RFENCE_REMOTE_SFENCE_VMA,
> > > > +                        cpu_mask, start_addr, size, 0, 0);
> > > 
> > > No check (not even an assertion) here for __sbi_rfence still
> > > being
> > > NULL?
> > I thought that it would be enough to have BUG_ON() in sbi_init()
> > but
> > then probably would be better to update the message inside
> > BUG_ON():
> >    int __init sbi_init(void)
> >    {
> >    ....
> >    
> >        if ( !sbi_has_rfence() )
> >        {
> >            BUG_ON("At the moment flush_xen_tlb_range_va() uses SBI
> > rfence
> >    to "
> >                   "flush TLB for all CPUS!");
> 
> I never really liked this kind of BUG_ON(). I leave it uncommented in
> code which clearly is only temporary. Plus imo such BUG_ON()s want to
> be next to where the risk is, i.e. in this case ahead of the possible
> NULL deref. Then again, without PV guests and without anything mapped
> at VA 0, you'll crash cleanly anyway. So perhaps my request to add a
> check went too far.
But then there is not big difference whether BUG_ON() or ASSERT()
inside sbi_remote_sfence_vma() is present. Even none of the check
present it will be a crash anyway.

And the other reason why I thought it would be better to have BUG_ON()
in sbi_init() as it will be need only one BUG_ON() instead of having
ASSERT() in each function which will use __sbi_rfence().

And according to coding-best-practices.pandoc:
```
 * Programmers can use ASSERT(), which will cause the check to be
   executed in DEBUG builds, and cause the hypervisor to crash if it's
   violated
 * Programmers can use BUG_ON(), which will cause the check to be
   executed in both DEBUG and non-DEBUG builds, and cause the
hypervisor
   to crash if it's violated.
```
The difference between them is only the check is working only in DEBUG
build or both in DEBUG and non-DEBUG.

~ Oleksii


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-15 10:00         ` oleksii.kurochko
@ 2024-08-15 10:32           ` Jan Beulich
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2024-08-15 10:32 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 15.08.2024 12:00, oleksii.kurochko@gmail.com wrote:
> On Wed, 2024-08-14 at 17:53 +0200, Jan Beulich wrote:
>> On 14.08.2024 17:41, oleksii.kurochko@gmail.com wrote:
>>> On Tue, 2024-08-13 at 11:34 +0200, Jan Beulich wrote:
>>>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>>>> +static unsigned long sbi_major_version(void)
>>>>> +{
>>>>> +    return (sbi_spec_version >> SBI_SPEC_VERSION_MAJOR_SHIFT)
>>>>> &
>>>>> +        SBI_SPEC_VERSION_MAJOR_MASK;
>>>>> +}
>>>>
>>>> Can you use MASK_EXTR() here, allowing to even drop the separate
>>>> SBI_SPEC_VERSION_MAJOR_SHIFT?
>>> I am not sure that:
>>> (sbi_spec_version >> SBI_SPEC_VERSION_MAJOR_SHIFT) & 
>>> SBI_SPEC_VERSION_MAJOR_MASK == MASK_EXTR(sbi_spec_version,
>>> SBI_SPEC_VERSION_MAJOR_MASK)
>>
>> How come you're not sure? That's precisely what MASK_EXTR() is for.
> As the order matter here. First should be shift then applying
> SBI_SPEC_VERSION_MAJOR_MASK, as SBI_SPEC_VERSION_MAJOR_MASK is defined
> now as 7f. If SBI_SPEC_VERSION_MAJOR_MASK set to 0x7F000000 then it
> will work.

The assumption was of course that you'd adjust SBI_SPEC_VERSION_MAJOR_MASK.
That's the underlying idea after all: When the mask is shifted to its
appropriate position, the shift amount needed can be calculated from it.
Hence the lack of a need for SBI_SPEC_VERSION_MAJOR_SHIFT (and alike).

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-15  8:09       ` Jan Beulich
@ 2024-08-15 11:21         ` oleksii.kurochko
  2024-08-15 12:16           ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-15 11:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
> On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > Introduce internal macros starting with PTE_* for convenience.
> > > > These macros closely resemble PTE bits, with the exception of
> > > > PTE_BLOCK, which indicates that a page larger than 4KB is
> > > > needed.
> > > 
> > > I did comment on this too, iirc: Is there going to be any case
> > > where
> > > large pages are going to be "needed", i.e. not just preferred? If
> > > not,
> > > giving the caller control over things the other way around
> > > (requesting
> > > 4k mappings are needed, as we have it in x86) may be preferable.
> > Yes, you did the comment but I thought that it will be enough to
> > mention that shattering isn't supported now and  also since
> > pt_update_entry()is used to unmap as well, there could be a need to
> > unmap e.g. 4K page from 2M block mapping what will a little bit
> > harder
> > then just having 4k by default.
> 
> Shattering isn't supported now, but that's going to change at some
> point,
> I suppose. Where possible the long-term behavior wants taking into
> account
> right away, to avoid having to e.g. touch all callers again later on.
Arm still leaves without shattering support for Xen pages:
https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/arm/mmu/pt.c?ref_type=heads#L454

So it can be pretty long-term behaviour.

> 
> > > Hmm, but then ...
> > > 
> > > > RISC-V detects superpages using `pte.x` and `pte.r`, as there
> > > > is no specific bit in the PTE for this purpose. From the RISC-V
> > > > spec:
> > > > ```
> > > >   ...
> > > >   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go
> > > > to
> > > > step 5.
> > > >      Otherwise, this PTE is a pointer to the next level of the
> > > > page
> > > > table.
> > > >      ... .
> > > >   5. A leaf PTE has been found.
> > > >      ...
> > > >   ...
> > > > ```
> > > > 
> > > > The code doesn’t support super page shattering so 4KB pages are
> > > > used as
> > > > default.
> > > 
> > > ... you have this. Yet still callers expecting re-mapping in the
> > > (large)
> > > range they map can request small-page mappings right away.
> > I am not sure that I fully understand what do you mean by "expcting
> > re-
> > mapping".
> 
> Right now you have callers pass PTE_BLOCK when they know that no
> small
> page re-mappings are going to occur for an area. What I'm suggesting
> is
> that you invert this logic: Have callers pass PTE_SMALL when there is
> a possibility that re-mapping requests may be issued later. Then,
> later, by simply grep-ing for PTE_SMALL you'll be able to easily find
> all candidates that possibly can be relaxed when super-page
> shattering
> starts being supported. That's going to be easier than finding all
> instances where PTE_BLOCK is _not_used.
So if I understand correctly. Actually nothing will change in algorithm
of pt_update() and only PTE_SMALL should be introduced instead of
PTE_BLOCK. And if I will know that something will be better to map as
PTE_SMALL to not face shattering in case of unmap (for example) I just
can map this memory as PTE_SMALL and that is it?

> > > > --- /dev/null
> > > > +++ b/xen/arch/riscv/pt.c
> > > > @@ -0,0 +1,408 @@
> > > > +#include <xen/bug.h>
> > > > +#include <xen/domain_page.h>
> > > > +#include <xen/errno.h>
> > > > +#include <xen/mm.h>
> > > > +#include <xen/mm-frame.h>
> > > > +#include <xen/pmap.h>
> > > > +#include <xen/spinlock.h>
> > > > +
> > > > +#include <asm/flushtlb.h>
> > > > +#include <asm/page.h>
> > > > +
> > > > +static inline const mfn_t get_root_page(void)
> > > > +{
> > > > +    unsigned long root_maddr =
> > > 
> > > maddr_t or paddr_t?
> > > 
> > > > +        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
> > > > +
> > > > +    return maddr_to_mfn(root_maddr);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Sanity check of the entry
> > > > + * mfn is not valid and we are not populating page table. This
> > > > means
> > > 
> > > How does this fit with ...
> > > 
> > > > + * we either modify entry or remove an entry.
> > > > + */
> > > > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned
> > > > int
> > > > flags)
> > > > +{
> > > > +    /* Sanity check when modifying an entry. */
> > > > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> > > 
> > > ... the MFN check here?
> > The comment is incorrect and should be corrected:
> > " if mfn is valid or ... "
> > 
> > >  And why is (valid,INVALID_MFN) an indication
> > > of a modification? But then ...
> > the explanation is in the comment to pt_update():
> >    /*
> >     * If `mfn` equals `INVALID_MFN`, it indicates that the
> > following page
> >    table
> >     * update operation might be related to either populating the
> > table,
> >     * destroying a mapping, or modifying an existing mapping.
> >     */
> >    static int pt_update(unsigned long virt,
> 
> And how do readers know that comments in pt_update() are crucial for
> understanding what pt_check_entry() does? You certainly don't need to
> have the same comment in two places, but you at least want to refer
> to a relevant comment when that lives elsewhere.
Sure, I will update the comment in pt_check_entry() properly if this
function still makes any sense.

> 
> > > > +static int pt_update(unsigned long virt,
> > > > +                     mfn_t mfn,
> > > > +                     unsigned long nr_mfns,
> > > > +                     unsigned int flags)
> > > > +{
> > > > +    int rc = 0;
> > > > +    unsigned long vfn = virt >> PAGE_SHIFT;
> > > > +    unsigned long left = nr_mfns;
> > > > +
> > > > +    const mfn_t root = get_root_page();
> > > > +
> > > > +    /*
> > > > +     * It is bad idea to have mapping both writeable and
> > > > +     * executable.
> > > > +     * When modifying/creating mapping (i.e PTE_VALID is set),
> > > > +     * prevent any update if this happen.
> > > > +     */
> > > > +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) &&
> > > > PTE_X_MASK(flags) )
> > > > +    {
> > > > +        printk("Mappings should not be both Writeable and
> > > > Executable.\n");
> > > > +        return -EINVAL;
> > > > +    }
> > > > +
> > > > +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
> > > > +    {
> > > > +        printk("The virtual address is not aligned to the
> > > > page-
> > > > size.\n");
> > > > +        return -EINVAL;
> > > > +    }
> > > > +
> > > > +    spin_lock(&xen_pt_lock);
> > > > +
> > > > +    while ( left )
> > > > +    {
> > > > +        unsigned int order, level;
> > > > +
> > > > +        level = pt_mapping_level(vfn, mfn, left, flags);
> > > > +        order = XEN_PT_LEVEL_ORDER(level);
> > > > +
> > > > +        ASSERT(left >= BIT(order, UL));
> > > > +
> > > > +        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn,
> > > > level,
> > > > +                                    flags);
> > > 
> > > Indentation.
> > > 
> > > > +        if ( rc )
> > > > +            break;
> > > > +
> > > > +        vfn += 1U << order;
> > > > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > > > +            mfn = mfn_add(mfn, 1U << order);
> > > > +
> > > > +        left -= (1U << order);
> > > 
> > > To be on thje safe side, 1UL everywhere?
> > > 
> > > > +        if ( rc )
> > > > +            break;
> > > 
> > > There was such a check already a few lines up from here.
> > > 
> > > > +    }
> > > > +
> > > > +
> > > > +    /* ensure that PTEs are all updated before flushing */
> > > 
> > > Again: No double blank lines please.
> > > 
> > > > +    RISCV_FENCE(rw, rw);
> > > > +
> > > > +    /*
> > > > +     * always flush TLB at the end of the function as non-
> > > > present
> > > > entries
> > > > +     * can be put in the TLB
> > > > +     */
> > > > +    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
> > > 
> > > Coming back to "negative" TLB entries: Assuming RISC-V, just like
> > > other
> > > architectures, also permits intermediate page table entries to be
> > > cached,
> > > the affected VA range may be much larger than the original
> > > request,
> > > if
> > > intermediate page tables needed creating.
> > It could be an issue. Could we some how  to calculate the proper
> > range
> > or the only option we have is to flush all.
> 
> Right - either you maintain state to know the biggest possible range
> that can be affected, or you flush all when a new intermediate page
> table needed inserting.
I think that the second one option will be easier to implement in the
current implementation. It is not issue for now as fixmap, fdt and xen
are in the same slot so no new intermediate page tables are needed.

> 
> > > > +    spin_unlock(&xen_pt_lock);
> > > 
> > > Does this really need to come after fence and flush?
> > I think yes, as page table should be updated only by 1 CPU at the
> > same
> > time. And before give ability to other CPU to update page table we
> > have
> > to finish a work on current CPU.
> 
> Can you then explain to me, perhaps by way of an example, what will
> go
> wrong if the unlock is ahead of the flush? (I'm less certain about
> the
> fence, and that's also less expensive.)
pt_update() will be called for interleaved region, for example, by
different CPUs:

                     pt_update():
CPU1:                                    CPU2:
 ...                                spin_lock(&xen_pt_lock);
RISCV_FENCE(rw, rw);                 ....

/* After this function will be
   executed the following thing
   can happen ------------------>  start to update page table
*/                                 entries which was partially      
spin_unlock(&xen_pt_lock);         created during CPU1 but CPU2       
....                               doesn't know about them yet        
....                               because flush_tlb() ( sfence.vma ) 
....                               wasn't done      
....                                                                  
flush_tlb_range_va();

And it can be an issue if I understand correctly.
> 
> > > > +int map_pages_to_xen(unsigned long virt,
> > > > +                     mfn_t mfn,
> > > > +                     unsigned long nr_mfns,
> > > > +                     unsigned int flags)
> > > > +{
> > > > +    /*
> > > > +     * Ensure that we have a valid MFN before proceeding.
> > > > +     *
> > > > +     * If the MFN is invalid, pt_update() might misinterpret
> > > > the
> > > > operation,
> > > > +     * treating it as either a population, a mapping
> > > > destruction,
> > > > +     * or a mapping modification.
> > > > +     */
> > > > +    ASSERT(!mfn_eq(mfn, INVALID_MFN));
> > > 
> > > But flags with PTE_VALID not set are fine to come into here?
> > It is fine for pt_update() but I don't know if it is fine for
> > map_pages_to_xen(). I see that other architectures don't check
> > that.
> 
> That's not my point here. It's rather along the lines of an earlier
> that I gave here: Since pte_update() is a pretty generic function,
> will
> flags not having PTE_VALID set perhaps result in pte_update() doing
> something that's not what the caller might expect?
I think that everything will be okay, if PTE_VALID is set then it means
that pt_update() should update ( modify/remove/insert ) page table
entry and all the cases which isn't expected by the logic should be
covered by pt_check_entry().

and the case if when page table couldn't be mapped:
```
           rc = pt_next_level(alloc_only, &table, offsets[level]);
           if ( rc == XEN_TABLE_MAP_FAILED )
           {
               rc = 0;
   
               /*
                * We are here because pt_next_level has failed to map
                * the intermediate page table (e.g the table does not
   exist
                * and the pt is read-only). It is a valid case when
                * removing a mapping as it may not exist in the page
   table.
                * In this case, just ignore it.
                */
               if ( flags & PTE_VALID )
               {
                   printk("%s: Unable to map level %u\n", __func__,
   level);
                   rc = -ENOENT;
               }
```
> 
> And yes, there's a special case of map_pages_to_xen() being called
> with
> _PAGE_NONE (if an arch defines such). That special case plays into
> here:
> If an arch doesn't define it, unmap requests ought to exclusively
> come
> through destroy_xen_mappings().
I thought that it should always done through destroy_xen_mappings().

Arm doesn't introduce _PAGE_NONE and pt_update() is based on Arm's
version of xen_pt_update() so this special case should be covered
properly.

And it seems to me (if I am not confusing something ) that if it is
necessary to unmap pages mapped by map_pages_to_xen() they are using
destroy_xen_mappings() which is defined using xen_pt_update():
   int modify_xen_mappings(unsigned long s, unsigned long e, unsigned
   int nf)
   {
       ASSERT(IS_ALIGNED(s, PAGE_SIZE));
       ASSERT(IS_ALIGNED(e, PAGE_SIZE));
       ASSERT(s <= e);
       return xen_pt_update(s, INVALID_MFN, (e - s) >> PAGE_SHIFT, nf);
   }

~ Oleksii


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-15 11:21         ` oleksii.kurochko
@ 2024-08-15 12:16           ` Jan Beulich
  2024-08-15 13:34             ` oleksii.kurochko
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2024-08-15 12:16 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 15.08.2024 13:21, oleksii.kurochko@gmail.com wrote:
> On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
>> On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
>>> On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
>>>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>>>> RISC-V detects superpages using `pte.x` and `pte.r`, as there
>>>>> is no specific bit in the PTE for this purpose. From the RISC-V
>>>>> spec:
>>>>> ```
>>>>>   ...
>>>>>   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go
>>>>> to
>>>>> step 5.
>>>>>      Otherwise, this PTE is a pointer to the next level of the
>>>>> page
>>>>> table.
>>>>>      ... .
>>>>>   5. A leaf PTE has been found.
>>>>>      ...
>>>>>   ...
>>>>> ```
>>>>>
>>>>> The code doesn’t support super page shattering so 4KB pages are
>>>>> used as
>>>>> default.
>>>>
>>>> ... you have this. Yet still callers expecting re-mapping in the
>>>> (large)
>>>> range they map can request small-page mappings right away.
>>> I am not sure that I fully understand what do you mean by "expcting
>>> re-
>>> mapping".
>>
>> Right now you have callers pass PTE_BLOCK when they know that no
>> small
>> page re-mappings are going to occur for an area. What I'm suggesting
>> is
>> that you invert this logic: Have callers pass PTE_SMALL when there is
>> a possibility that re-mapping requests may be issued later. Then,
>> later, by simply grep-ing for PTE_SMALL you'll be able to easily find
>> all candidates that possibly can be relaxed when super-page
>> shattering
>> starts being supported. That's going to be easier than finding all
>> instances where PTE_BLOCK is _not_used.
> So if I understand correctly. Actually nothing will change in algorithm
> of pt_update() and only PTE_SMALL should be introduced instead of
> PTE_BLOCK. And if I will know that something will be better to map as
> PTE_SMALL to not face shattering in case of unmap (for example) I just
> can map this memory as PTE_SMALL and that is it?

That is it.

>>>>> +    spin_unlock(&xen_pt_lock);
>>>>
>>>> Does this really need to come after fence and flush?
>>> I think yes, as page table should be updated only by 1 CPU at the
>>> same
>>> time. And before give ability to other CPU to update page table we
>>> have
>>> to finish a work on current CPU.
>>
>> Can you then explain to me, perhaps by way of an example, what will
>> go
>> wrong if the unlock is ahead of the flush? (I'm less certain about
>> the
>> fence, and that's also less expensive.)
> pt_update() will be called for interleaved region, for example, by
> different CPUs:
> 
>                      pt_update():
> CPU1:                                    CPU2:
>  ...                                spin_lock(&xen_pt_lock);
> RISCV_FENCE(rw, rw);                 ....
> 
> /* After this function will be
>    executed the following thing
>    can happen ------------------>  start to update page table
> */                                 entries which was partially      
> spin_unlock(&xen_pt_lock);         created during CPU1 but CPU2       
> ....                               doesn't know about them yet        
> ....                               because flush_tlb() ( sfence.vma ) 
> ....                               wasn't done      
> ....                                                                  
> flush_tlb_range_va();

Not exactly: CPU2 knows about them as far as the memory used / modified
goes, and that's all that matters for further page table modifications.
CPU2 only doesn't know about the new page table entries yet when it comes
to using them for a translation (by the hardware page walker). Yet this
aspect is irrelevant here, if I'm not mistaken.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-15  9:02           ` Jan Beulich
@ 2024-08-15 13:29             ` oleksii.kurochko
  2024-08-15 15:24               ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-15 13:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Thu, 2024-08-15 at 11:02 +0200, Jan Beulich wrote:
> On 15.08.2024 10:55, oleksii.kurochko@gmail.com wrote:
> > On Wed, 2024-08-14 at 17:22 +0200, Jan Beulich wrote:
> > > On 14.08.2024 16:45, oleksii.kurochko@gmail.com wrote:
> > > > On Tue, 2024-08-13 at 10:54 +0200, Jan Beulich wrote:
> > > > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > > > --- /dev/null
> > > > > > +++ b/xen/arch/riscv/smp.c
> > > > > > @@ -0,0 +1,4 @@
> > > > > > +#include <xen/smp.h>
> > > > > > +
> > > > > > +/* tp points to one of these per cpu */
> > > > > > +struct pcpu_info pcpu_info[NR_CPUS];
> > > > > 
> > > > > And they all need setting up statically? Is there a plan to
> > > > > make
> > > > > this
> > > > > dynamic (which could be recorded in a "fixme" in the
> > > > > comment)?
> > > > I didn't plan to make allocation of this array dynamic. I don't
> > > > expect
> > > > that NR_CPUS will be big.
> > > 
> > > What is this expectation of yours based on? Other architectures
> > > permit
> > > systems with hundreds or even thousands of CPUs; why would RISC-V
> > > be
> > > different there?
> > Based on available dev boards. ( what isn't really strong argument
> > )
> > 
> > I checked other architectures and they are using static allocation
> > too:
> >    struct cpuinfo_x86 cpu_data[NR_CPUS];
> >    
> >    u32 x86_cpu_to_apicid[NR_CPUS] __read_mostly =
> >    	{ [0 ... NR_CPUS-1] = BAD_APICID };
> >    
> >    ... /* Arm */
> >    
> >    struct cpuinfo_arm cpu_data[NR_CPUS];
> > 
> > I wanted to check to understand which one API should be used to
> > allocate this array dynamically. xmalloc?
> 
> As of a few days ago xvmalloc() (or friends thereof), as long as ...
> 
> > And I am curious how I can use xmalloc() at this stage if page
> > allocator (?) should be initialized what isn't true now.
> 
> ... this happens late enough in the boot process. Indeed ...
> 
> > Or just to allocate pcpu_info only for boot cpu and for other then
> > use
> > xmalloc()?
> 
> ... statically allocating space for the boot CPU only is another
> option.
> 
> > > > I can add "fixme" but I am not really
> > > > understand what will be advantages if pcpu_info[] will be
> > > > allocated
> > > > dynamically.
> > > 
> > > Where possible it's better to avoid static allocations, of which
> > > on
> > > some systems only a very small part may be used. Even if you put
> > > yourself
> > > on the position that many take - memory being cheap - you then
> > > still
> > > waste cache and TLB bandwidth. Furthermore as long as struct
> > > pcpu_info
> > > isn't big enough (and properly aligned) for two successive array
> > > entries
> > > to not share cache lines, you may end up playing cacheline ping-
> > > pong
> > > when a CPU writes to its own array slot.
> > Why the mentioned issues aren't work for dynamic memory? We still
> > allocating memory for sizeof(pcpu_info) * NR_CPUS
> 
> Why NR_CPUS? At runtime you know how may CPUs the system has you're
> running on. You only need to allocate as much then. Just like e.g.
> dynamically allocated CPU mask variables (cpumask_var_t) deliberately
> use less than NR_CPUS bits unless on really big iron.
I thought that NR_CPUS tells me how many CPU the system has.

The other option is to read that from DTB but then pcpu_info[] can be
allocated only after DTB will be mapped.

~ Oleksii

> 
> > and when this
> > allocated memory access it will go to cache, CPU/MMU still will use
> > TLB
> > for address translation for this region and without a proper
> > alignment
> > of pcpu_info size it still could be an issue with cache line
> > sharing.
> > 
> > ~ Oleksii
> > 
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-15 12:16           ` Jan Beulich
@ 2024-08-15 13:34             ` oleksii.kurochko
  2024-08-15 15:26               ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-15 13:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Thu, 2024-08-15 at 14:16 +0200, Jan Beulich wrote:
> On 15.08.2024 13:21, oleksii.kurochko@gmail.com wrote:
> > On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
> > > On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
> > > > On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > > > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > > > RISC-V detects superpages using `pte.x` and `pte.r`, as
> > > > > > there
> > > > > > is no specific bit in the PTE for this purpose. From the
> > > > > > RISC-V
> > > > > > spec:
> > > > > > ```
> > > > > >   ...
> > > > > >   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x =
> > > > > > 1, go
> > > > > > to
> > > > > > step 5.
> > > > > >      Otherwise, this PTE is a pointer to the next level of
> > > > > > the
> > > > > > page
> > > > > > table.
> > > > > >      ... .
> > > > > >   5. A leaf PTE has been found.
> > > > > >      ...
> > > > > >   ...
> > > > > > ```
> > > > > > 
> > > > > > The code doesn’t support super page shattering so 4KB pages
> > > > > > are
> > > > > > used as
> > > > > > default.
> > > > > 
> > > > > ... you have this. Yet still callers expecting re-mapping in
> > > > > the
> > > > > (large)
> > > > > range they map can request small-page mappings right away.
> > > > I am not sure that I fully understand what do you mean by
> > > > "expcting
> > > > re-
> > > > mapping".
> > > 
> > > Right now you have callers pass PTE_BLOCK when they know that no
> > > small
> > > page re-mappings are going to occur for an area. What I'm
> > > suggesting
> > > is
> > > that you invert this logic: Have callers pass PTE_SMALL when
> > > there is
> > > a possibility that re-mapping requests may be issued later. Then,
> > > later, by simply grep-ing for PTE_SMALL you'll be able to easily
> > > find
> > > all candidates that possibly can be relaxed when super-page
> > > shattering
> > > starts being supported. That's going to be easier than finding
> > > all
> > > instances where PTE_BLOCK is _not_used.
> > So if I understand correctly. Actually nothing will change in
> > algorithm
> > of pt_update() and only PTE_SMALL should be introduced instead of
> > PTE_BLOCK. And if I will know that something will be better to map
> > as
> > PTE_SMALL to not face shattering in case of unmap (for example) I
> > just
> > can map this memory as PTE_SMALL and that is it?
> 
> That is it.
> 
> > > > > > +    spin_unlock(&xen_pt_lock);
> > > > > 
> > > > > Does this really need to come after fence and flush?
> > > > I think yes, as page table should be updated only by 1 CPU at
> > > > the
> > > > same
> > > > time. And before give ability to other CPU to update page table
> > > > we
> > > > have
> > > > to finish a work on current CPU.
> > > 
> > > Can you then explain to me, perhaps by way of an example, what
> > > will
> > > go
> > > wrong if the unlock is ahead of the flush? (I'm less certain
> > > about
> > > the
> > > fence, and that's also less expensive.)
> > pt_update() will be called for interleaved region, for example, by
> > different CPUs:
> > 
> >                      pt_update():
> > CPU1:                                    CPU2:
> >  ...                                spin_lock(&xen_pt_lock);
> > RISCV_FENCE(rw, rw);                 ....
> > 
> > /* After this function will be
> >    executed the following thing
> >    can happen ------------------>  start to update page table
> > */                                 entries which was partially     
> > spin_unlock(&xen_pt_lock);         created during CPU1 but
> > CPU2       
> > ....                               doesn't know about them
> > yet        
> > ....                               because flush_tlb() ( sfence.vma
> > ) 
> > ....                               wasn't done      
> > ....                                                               
> >    
> > flush_tlb_range_va();
> 
> Not exactly: CPU2 knows about them as far as the memory used /
> modified
> goes, and that's all that matters for further page table
> modifications.
> CPU2 only doesn't know about the new page table entries yet when it
> comes
> to using them for a translation (by the hardware page walker). Yet
> this
> aspect is irrelevant here, if I'm not mistaken.
And it isn't an issue that CPU2 will add these new page table entries
again during execution of CPU2's pt_update()?

~ Oleksii


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info
  2024-08-15 13:29             ` oleksii.kurochko
@ 2024-08-15 15:24               ` Jan Beulich
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2024-08-15 15:24 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 15.08.2024 15:29, oleksii.kurochko@gmail.com wrote:
> On Thu, 2024-08-15 at 11:02 +0200, Jan Beulich wrote:
>> On 15.08.2024 10:55, oleksii.kurochko@gmail.com wrote:
>>> On Wed, 2024-08-14 at 17:22 +0200, Jan Beulich wrote:
>>>> On 14.08.2024 16:45, oleksii.kurochko@gmail.com wrote:
>>>>> On Tue, 2024-08-13 at 10:54 +0200, Jan Beulich wrote:
>>>>>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>>>>>> --- /dev/null
>>>>>>> +++ b/xen/arch/riscv/smp.c
>>>>>>> @@ -0,0 +1,4 @@
>>>>>>> +#include <xen/smp.h>
>>>>>>> +
>>>>>>> +/* tp points to one of these per cpu */
>>>>>>> +struct pcpu_info pcpu_info[NR_CPUS];
>>>>>>
>>>>>> And they all need setting up statically? Is there a plan to
>>>>>> make
>>>>>> this
>>>>>> dynamic (which could be recorded in a "fixme" in the
>>>>>> comment)?
>>>>> I didn't plan to make allocation of this array dynamic. I don't
>>>>> expect
>>>>> that NR_CPUS will be big.
>>>>
>>>> What is this expectation of yours based on? Other architectures
>>>> permit
>>>> systems with hundreds or even thousands of CPUs; why would RISC-V
>>>> be
>>>> different there?
>>> Based on available dev boards. ( what isn't really strong argument
>>> )
>>>
>>> I checked other architectures and they are using static allocation
>>> too:
>>>    struct cpuinfo_x86 cpu_data[NR_CPUS];
>>>    
>>>    u32 x86_cpu_to_apicid[NR_CPUS] __read_mostly =
>>>    	{ [0 ... NR_CPUS-1] = BAD_APICID };
>>>    
>>>    ... /* Arm */
>>>    
>>>    struct cpuinfo_arm cpu_data[NR_CPUS];
>>>
>>> I wanted to check to understand which one API should be used to
>>> allocate this array dynamically. xmalloc?
>>
>> As of a few days ago xvmalloc() (or friends thereof), as long as ...
>>
>>> And I am curious how I can use xmalloc() at this stage if page
>>> allocator (?) should be initialized what isn't true now.
>>
>> ... this happens late enough in the boot process. Indeed ...
>>
>>> Or just to allocate pcpu_info only for boot cpu and for other then
>>> use
>>> xmalloc()?
>>
>> ... statically allocating space for the boot CPU only is another
>> option.
>>
>>>>> I can add "fixme" but I am not really
>>>>> understand what will be advantages if pcpu_info[] will be
>>>>> allocated
>>>>> dynamically.
>>>>
>>>> Where possible it's better to avoid static allocations, of which
>>>> on
>>>> some systems only a very small part may be used. Even if you put
>>>> yourself
>>>> on the position that many take - memory being cheap - you then
>>>> still
>>>> waste cache and TLB bandwidth. Furthermore as long as struct
>>>> pcpu_info
>>>> isn't big enough (and properly aligned) for two successive array
>>>> entries
>>>> to not share cache lines, you may end up playing cacheline ping-
>>>> pong
>>>> when a CPU writes to its own array slot.
>>> Why the mentioned issues aren't work for dynamic memory? We still
>>> allocating memory for sizeof(pcpu_info) * NR_CPUS
>>
>> Why NR_CPUS? At runtime you know how may CPUs the system has you're
>> running on. You only need to allocate as much then. Just like e.g.
>> dynamically allocated CPU mask variables (cpumask_var_t) deliberately
>> use less than NR_CPUS bits unless on really big iron.
> I thought that NR_CPUS tells me how many CPU the system has.

Oh, no, that not what it says (and it really can't, being a compile time
constant) - it says how many CPUs the specific build of Xen is going to
support at most.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-15 13:34             ` oleksii.kurochko
@ 2024-08-15 15:26               ` Jan Beulich
  2024-08-16  9:09                 ` oleksii.kurochko
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2024-08-15 15:26 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 15.08.2024 15:34, oleksii.kurochko@gmail.com wrote:
> On Thu, 2024-08-15 at 14:16 +0200, Jan Beulich wrote:
>> On 15.08.2024 13:21, oleksii.kurochko@gmail.com wrote:
>>> On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
>>>> On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
>>>>> On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
>>>>>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>>>>>> RISC-V detects superpages using `pte.x` and `pte.r`, as
>>>>>>> there
>>>>>>> is no specific bit in the PTE for this purpose. From the
>>>>>>> RISC-V
>>>>>>> spec:
>>>>>>> ```
>>>>>>>   ...
>>>>>>>   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x =
>>>>>>> 1, go
>>>>>>> to
>>>>>>> step 5.
>>>>>>>      Otherwise, this PTE is a pointer to the next level of
>>>>>>> the
>>>>>>> page
>>>>>>> table.
>>>>>>>      ... .
>>>>>>>   5. A leaf PTE has been found.
>>>>>>>      ...
>>>>>>>   ...
>>>>>>> ```
>>>>>>>
>>>>>>> The code doesn’t support super page shattering so 4KB pages
>>>>>>> are
>>>>>>> used as
>>>>>>> default.
>>>>>>
>>>>>> ... you have this. Yet still callers expecting re-mapping in
>>>>>> the
>>>>>> (large)
>>>>>> range they map can request small-page mappings right away.
>>>>> I am not sure that I fully understand what do you mean by
>>>>> "expcting
>>>>> re-
>>>>> mapping".
>>>>
>>>> Right now you have callers pass PTE_BLOCK when they know that no
>>>> small
>>>> page re-mappings are going to occur for an area. What I'm
>>>> suggesting
>>>> is
>>>> that you invert this logic: Have callers pass PTE_SMALL when
>>>> there is
>>>> a possibility that re-mapping requests may be issued later. Then,
>>>> later, by simply grep-ing for PTE_SMALL you'll be able to easily
>>>> find
>>>> all candidates that possibly can be relaxed when super-page
>>>> shattering
>>>> starts being supported. That's going to be easier than finding
>>>> all
>>>> instances where PTE_BLOCK is _not_used.
>>> So if I understand correctly. Actually nothing will change in
>>> algorithm
>>> of pt_update() and only PTE_SMALL should be introduced instead of
>>> PTE_BLOCK. And if I will know that something will be better to map
>>> as
>>> PTE_SMALL to not face shattering in case of unmap (for example) I
>>> just
>>> can map this memory as PTE_SMALL and that is it?
>>
>> That is it.
>>
>>>>>>> +    spin_unlock(&xen_pt_lock);
>>>>>>
>>>>>> Does this really need to come after fence and flush?
>>>>> I think yes, as page table should be updated only by 1 CPU at
>>>>> the
>>>>> same
>>>>> time. And before give ability to other CPU to update page table
>>>>> we
>>>>> have
>>>>> to finish a work on current CPU.
>>>>
>>>> Can you then explain to me, perhaps by way of an example, what
>>>> will
>>>> go
>>>> wrong if the unlock is ahead of the flush? (I'm less certain
>>>> about
>>>> the
>>>> fence, and that's also less expensive.)
>>> pt_update() will be called for interleaved region, for example, by
>>> different CPUs:
>>>
>>>                      pt_update():
>>> CPU1:                                    CPU2:
>>>  ...                                spin_lock(&xen_pt_lock);
>>> RISCV_FENCE(rw, rw);                 ....
>>>
>>> /* After this function will be
>>>    executed the following thing
>>>    can happen ------------------>  start to update page table
>>> */                                 entries which was partially     
>>> spin_unlock(&xen_pt_lock);         created during CPU1 but
>>> CPU2       
>>> ....                               doesn't know about them
>>> yet        
>>> ....                               because flush_tlb() ( sfence.vma
>>> ) 
>>> ....                               wasn't done      
>>> ....                                                               
>>>    
>>> flush_tlb_range_va();
>>
>> Not exactly: CPU2 knows about them as far as the memory used /
>> modified
>> goes, and that's all that matters for further page table
>> modifications.
>> CPU2 only doesn't know about the new page table entries yet when it
>> comes
>> to using them for a translation (by the hardware page walker). Yet
>> this
>> aspect is irrelevant here, if I'm not mistaken.
> And it isn't an issue that CPU2 will add these new page table entries
> again during execution of CPU2's pt_update()?

Add these page table entries again? That's only going to happen due to
another bug somewhere, I suppose. And it would be as much (or as little)
of an issue if that happened right after dropping the lock.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-15 15:26               ` Jan Beulich
@ 2024-08-16  9:09                 ` oleksii.kurochko
  0 siblings, 0 replies; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-16  9:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Thu, 2024-08-15 at 17:26 +0200, Jan Beulich wrote:
> On 15.08.2024 15:34, oleksii.kurochko@gmail.com wrote:
> > On Thu, 2024-08-15 at 14:16 +0200, Jan Beulich wrote:
> > > On 15.08.2024 13:21, oleksii.kurochko@gmail.com wrote:
> > > > On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
> > > > > On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
> > > > > > On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > > > > > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > > > > > RISC-V detects superpages using `pte.x` and `pte.r`, as
> > > > > > > > there
> > > > > > > > is no specific bit in the PTE for this purpose. From
> > > > > > > > the
> > > > > > > > RISC-V
> > > > > > > > spec:
> > > > > > > > ```
> > > > > > > >   ...
> > > > > > > >   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x
> > > > > > > > =
> > > > > > > > 1, go
> > > > > > > > to
> > > > > > > > step 5.
> > > > > > > >      Otherwise, this PTE is a pointer to the next level
> > > > > > > > of
> > > > > > > > the
> > > > > > > > page
> > > > > > > > table.
> > > > > > > >      ... .
> > > > > > > >   5. A leaf PTE has been found.
> > > > > > > >      ...
> > > > > > > >   ...
> > > > > > > > ```
> > > > > > > > 
> > > > > > > > The code doesn’t support super page shattering so 4KB
> > > > > > > > pages
> > > > > > > > are
> > > > > > > > used as
> > > > > > > > default.
> > > > > > > 
> > > > > > > ... you have this. Yet still callers expecting re-mapping
> > > > > > > in
> > > > > > > the
> > > > > > > (large)
> > > > > > > range they map can request small-page mappings right
> > > > > > > away.
> > > > > > I am not sure that I fully understand what do you mean by
> > > > > > "expcting
> > > > > > re-
> > > > > > mapping".
> > > > > 
> > > > > Right now you have callers pass PTE_BLOCK when they know that
> > > > > no
> > > > > small
> > > > > page re-mappings are going to occur for an area. What I'm
> > > > > suggesting
> > > > > is
> > > > > that you invert this logic: Have callers pass PTE_SMALL when
> > > > > there is
> > > > > a possibility that re-mapping requests may be issued later.
> > > > > Then,
> > > > > later, by simply grep-ing for PTE_SMALL you'll be able to
> > > > > easily
> > > > > find
> > > > > all candidates that possibly can be relaxed when super-page
> > > > > shattering
> > > > > starts being supported. That's going to be easier than
> > > > > finding
> > > > > all
> > > > > instances where PTE_BLOCK is _not_used.
> > > > So if I understand correctly. Actually nothing will change in
> > > > algorithm
> > > > of pt_update() and only PTE_SMALL should be introduced instead
> > > > of
> > > > PTE_BLOCK. And if I will know that something will be better to
> > > > map
> > > > as
> > > > PTE_SMALL to not face shattering in case of unmap (for example)
> > > > I
> > > > just
> > > > can map this memory as PTE_SMALL and that is it?
> > > 
> > > That is it.
> > > 
> > > > > > > > +    spin_unlock(&xen_pt_lock);
> > > > > > > 
> > > > > > > Does this really need to come after fence and flush?
> > > > > > I think yes, as page table should be updated only by 1 CPU
> > > > > > at
> > > > > > the
> > > > > > same
> > > > > > time. And before give ability to other CPU to update page
> > > > > > table
> > > > > > we
> > > > > > have
> > > > > > to finish a work on current CPU.
> > > > > 
> > > > > Can you then explain to me, perhaps by way of an example,
> > > > > what
> > > > > will
> > > > > go
> > > > > wrong if the unlock is ahead of the flush? (I'm less certain
> > > > > about
> > > > > the
> > > > > fence, and that's also less expensive.)
> > > > pt_update() will be called for interleaved region, for example,
> > > > by
> > > > different CPUs:
> > > > 
> > > >                      pt_update():
> > > > CPU1:                                    CPU2:
> > > >  ...                                spin_lock(&xen_pt_lock);
> > > > RISCV_FENCE(rw, rw);                 ....
> > > > 
> > > > /* After this function will be
> > > >    executed the following thing
> > > >    can happen ------------------>  start to update page table
> > > > */                                 entries which was
> > > > partially     
> > > > spin_unlock(&xen_pt_lock);         created during CPU1 but
> > > > CPU2       
> > > > ....                               doesn't know about them
> > > > yet        
> > > > ....                               because flush_tlb() (
> > > > sfence.vma
> > > > ) 
> > > > ....                               wasn't done      
> > > > ....                                                           
> > > >     
> > > >    
> > > > flush_tlb_range_va();
> > > 
> > > Not exactly: CPU2 knows about them as far as the memory used /
> > > modified
> > > goes, and that's all that matters for further page table
> > > modifications.
> > > CPU2 only doesn't know about the new page table entries yet when
> > > it
> > > comes
> > > to using them for a translation (by the hardware page walker).
> > > Yet
> > > this
> > > aspect is irrelevant here, if I'm not mistaken.
> > And it isn't an issue that CPU2 will add these new page table
> > entries
> > again during execution of CPU2's pt_update()?
> 
> Add these page table entries again? That's only going to happen due
> to
> another bug somewhere, I suppose. And it would be as much (or as
> little)
> of an issue if that happened right after dropping the lock.
Yes, agree, it sounds more like a bug. Thanks.

~ Oleksii


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-13  9:34   ` Jan Beulich
  2024-08-14 15:41     ` oleksii.kurochko
@ 2024-08-16 12:06     ` oleksii.kurochko
  2024-08-16 12:32       ` Jan Beulich
  1 sibling, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-16 12:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-13 at 11:34 +0200, Jan Beulich wrote:
> >   
> > +static unsigned long sbi_spec_version = SBI_SPEC_VERSION_DEFAULT;
> > +static unsigned long sbi_fw_id, sbi_fw_version;
> 
> __ro_after_init for perhaps all three of them?
> 
> Considering SBI_SPEC_VERSION_{MAJOR,MINOR}_MASK, at least the first
> of them also doesn't need to be unsigned long, but could be unsigned
> int?

sbi_spec_version can be really unsigned int as according to the spec
only 32 bits are used:
```
   struct sbiret sbi_get_spec_version(void);
   Returns the current SBI specification version. This function must
   always succeed. The minor number
   of the SBI specification is encoded in the low 24 bits, with the
   major number encoded in the next 7
   bits. Bit 31 must be 0 and is reserved for future expansion.
```

For sbi_fw_id, sbi_fw_version it is not mentioned the same thing, so we
can just assume ( despite of the fact that now this values are very
small and it is unlikely to be bigger the UINT_MAX ) that it will be
always fit in uint32_t.

But I think it would be better to leave unsigned long for everyone as
according to the specification this functions returns sbiret structure
which consist of 2 longs ( error and value ) and it is good idea to
follow the specification.

~ Oleksii

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension
  2024-08-16 12:06     ` oleksii.kurochko
@ 2024-08-16 12:32       ` Jan Beulich
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2024-08-16 12:32 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 16.08.2024 14:06, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-13 at 11:34 +0200, Jan Beulich wrote:
>>>   
>>> +static unsigned long sbi_spec_version = SBI_SPEC_VERSION_DEFAULT;
>>> +static unsigned long sbi_fw_id, sbi_fw_version;
>>
>> __ro_after_init for perhaps all three of them?
>>
>> Considering SBI_SPEC_VERSION_{MAJOR,MINOR}_MASK, at least the first
>> of them also doesn't need to be unsigned long, but could be unsigned
>> int?
> 
> sbi_spec_version can be really unsigned int as according to the spec
> only 32 bits are used:
> ```
>    struct sbiret sbi_get_spec_version(void);
>    Returns the current SBI specification version. This function must
>    always succeed. The minor number
>    of the SBI specification is encoded in the low 24 bits, with the
>    major number encoded in the next 7
>    bits. Bit 31 must be 0 and is reserved for future expansion.
> ```
> 
> For sbi_fw_id, sbi_fw_version it is not mentioned the same thing, so we
> can just assume ( despite of the fact that now this values are very
> small and it is unlikely to be bigger the UINT_MAX ) that it will be
> always fit in uint32_t.
> 
> But I think it would be better to leave unsigned long for everyone as
> according to the specification this functions returns sbiret structure
> which consist of 2 longs ( error and value ) and it is good idea to
> follow the specification.

Okay with me (for these two) then.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-13 10:31   ` Jan Beulich
  2024-08-14 16:50     ` oleksii.kurochko
@ 2024-08-20 13:18     ` oleksii.kurochko
  2024-08-20 13:47       ` Jan Beulich
  1 sibling, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-20 13:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > + * Sanity check of the entry
> > + * mfn is not valid and we are not populating page table. This
> > means
> 
> How does this fit with ...
> 
> > + * we either modify entry or remove an entry.
> > + */
> > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
> > flags)
> > +{
> > +    /* Sanity check when modifying an entry. */
> > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> 
> ... the MFN check here? And why is (valid,INVALID_MFN) an indication
> of a modification?
Because as mentioned here:
```
/*
 * If `mfn` equals `INVALID_MFN`, it indicates that the following page
table
 * update operation might be related to either populating the table,
 * destroying a mapping, or modifying an existing mapping.
 */
static int pt_update(unsigned long virt,
```
And so if requested flags are PTE_VALID ( present ) and mfn=INVALID it
will mean that we are going to modify an entry.


> But then ...
> 
> > +    {
> > +        /* We don't allow modifying an invalid entry. */
> > +        if ( !pte_is_valid(entry) )
> > +        {
> > +            printk("Modifying invalid entry is not allowed.\n");
> > +            return false;
> > +        }
> 
> ... I also don't understand what this is about. IOW I'm afraid I'm
> still confused about the purpose of this function as well as the
> transitions you want to permit / reject. 
In the case if the caller call modify_xen_mappings() on a region that
doesn't exist.

> I wonder whether the
> flags & PTE_VALID and pte_is_valid(entry) aren't in need of swapping.
I am not sure that I understand what you mean.


> 
> > +/* Update an entry at the level @target. */
> > +static int pt_update_entry(mfn_t root, unsigned long virt,
> > +                           mfn_t mfn, unsigned int target,
> > +                           unsigned int flags)
> > +{
> > +    int rc;
> > +    unsigned int level = HYP_PT_ROOT_LEVEL;
> > +    pte_t *table;
> > +    /*
> > +     * The intermediate page tables are read-only when the MFN is
> > not valid
> > +     * and we are not populating page table.
> 
> The way flags are handled in PTEs, how can page tables be read-only?
This is not needed for everyone case. In case of entry removing an
intermediate page table should be created in case when the user is
trying to remove a mapping that doesn't exist.


> 
> > +     * This means we either modify permissions or remove an entry.
> 
> From all I can determine we also get here when making brand new
> entries.
> What am I overlooking?
Yes, but in this case an intermediate page tables should be read_only,
so alloc_only will be true and it will be allowed to create new
intermediate page table.


> > +    return rc;
> > +}
> > +
> > +int map_pages_to_xen(unsigned long virt,
> > +                     mfn_t mfn,
> > +                     unsigned long nr_mfns,
> > +                     unsigned int flags)
> > +{
> > +    /*
> > +     * Ensure that we have a valid MFN before proceeding.
> > +     *
> > +     * If the MFN is invalid, pt_update() might misinterpret the
> > operation,
> > +     * treating it as either a population, a mapping destruction,
> > +     * or a mapping modification.
> > +     */
> > +    ASSERT(!mfn_eq(mfn, INVALID_MFN));
> 
> But flags with PTE_VALID not set are fine to come into here?
Probably not, I will double check again and if it is not okay, I will
update the ASSERT.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-20 13:18     ` oleksii.kurochko
@ 2024-08-20 13:47       ` Jan Beulich
  2024-08-20 14:42         ` oleksii.kurochko
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2024-08-20 13:47 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 20.08.2024 15:18, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
>>> + * Sanity check of the entry
>>> + * mfn is not valid and we are not populating page table. This
>>> means
>>
>> How does this fit with ...
>>
>>> + * we either modify entry or remove an entry.
>>> + */
>>> +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
>>> flags)
>>> +{
>>> +    /* Sanity check when modifying an entry. */
>>> +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
>>
>> ... the MFN check here? And why is (valid,INVALID_MFN) an indication
>> of a modification?
> Because as mentioned here:
> ```
> /*
>  * If `mfn` equals `INVALID_MFN`, it indicates that the following page
> table
>  * update operation might be related to either populating the table,
>  * destroying a mapping, or modifying an existing mapping.
>  */
> static int pt_update(unsigned long virt,
> ```

That's in the description of another function. How would one know that
the rules on (mfn,flags) tuples there would apply here as well, without
you saying so explicitly? It may not be necessary to repeat the other
comment, but at least you want to reference it.

> And so if requested flags are PTE_VALID ( present ) and mfn=INVALID it
> will mean that we are going to modify an entry.
> 
> 
>> But then ...
>>
>>> +    {
>>> +        /* We don't allow modifying an invalid entry. */
>>> +        if ( !pte_is_valid(entry) )
>>> +        {
>>> +            printk("Modifying invalid entry is not allowed.\n");
>>> +            return false;
>>> +        }
>>
>> ... I also don't understand what this is about. IOW I'm afraid I'm
>> still confused about the purpose of this function as well as the
>> transitions you want to permit / reject. 
> In the case if the caller call modify_xen_mappings() on a region that
> doesn't exist.

Perhaps. What I think is missing is a clear statement somewhere to describe
what the various combinations of (mfn,flags) mean, in terms of the operation
to be carried out. This may then also help with ...

>> I wonder whether the
>> flags & PTE_VALID and pte_is_valid(entry) aren't in need of swapping.
> I am not sure that I understand what you mean.

... this: It's hard to see what cannot be understood about my earlier
comment. In the code commented on you have a flags & PTE_VALID check and a
pte_is_valid(entry) one. I'm wondering whether the two simply are the wrong
way round.

>>> +/* Update an entry at the level @target. */
>>> +static int pt_update_entry(mfn_t root, unsigned long virt,
>>> +                           mfn_t mfn, unsigned int target,
>>> +                           unsigned int flags)
>>> +{
>>> +    int rc;
>>> +    unsigned int level = HYP_PT_ROOT_LEVEL;
>>> +    pte_t *table;
>>> +    /*
>>> +     * The intermediate page tables are read-only when the MFN is
>>> not valid
>>> +     * and we are not populating page table.
>>
>> The way flags are handled in PTEs, how can page tables be read-only?
> This is not needed for everyone case. In case of entry removing an
> intermediate page table should be created in case when the user is
> trying to remove a mapping that doesn't exist.

I don't follow: For one, how is this related to "read-only"-ness? And
then, why would any kind of removal, whether of a present or non-
present mapping, ever result in page tables being created?

>>> +     * This means we either modify permissions or remove an entry.
>>
>> From all I can determine we also get here when making brand new
>> entries.
>> What am I overlooking?
> Yes, but in this case an intermediate page tables should be read_only,
> so alloc_only will be true and it will be allowed to create new
> intermediate page table.

Hmm, so instead of "read-only" do you maybe mean page tables are not
supposed to be modified? There's a difference here: When they're
read-only, you can't write to them (or a fault will result). Whereas
when in principle they can be modified, there still may be a rule
saying "in this case they shouldn't be altered".

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-20 13:47       ` Jan Beulich
@ 2024-08-20 14:42         ` oleksii.kurochko
  2024-08-20 16:30           ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: oleksii.kurochko @ 2024-08-20 14:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Tue, 2024-08-20 at 15:47 +0200, Jan Beulich wrote:
> On 20.08.2024 15:18, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > > > + * Sanity check of the entry
> > > > + * mfn is not valid and we are not populating page table. This
> > > > means
> > > 
> > > How does this fit with ...
> > > 
> > > > + * we either modify entry or remove an entry.
> > > > + */
> > > > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned
> > > > int
> > > > flags)
> > > > +{
> > > > +    /* Sanity check when modifying an entry. */
> > > > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> > > 
> > > ... the MFN check here? And why is (valid,INVALID_MFN) an
> > > indication
> > > of a modification?
> > Because as mentioned here:
> > ```
> > /*
> >  * If `mfn` equals `INVALID_MFN`, it indicates that the following
> > page
> > table
> >  * update operation might be related to either populating the
> > table,
> >  * destroying a mapping, or modifying an existing mapping.
> >  */
> > static int pt_update(unsigned long virt,
> > ```
> 
> That's in the description of another function. How would one know
> that
> the rules on (mfn,flags) tuples there would apply here as well,
> without
> you saying so explicitly? It may not be necessary to repeat the other
> comment, but at least you want to reference it.
> 
> > And so if requested flags are PTE_VALID ( present ) and mfn=INVALID
> > it
> > will mean that we are going to modify an entry.
> > 
> > 
> > > But then ...
> > > 
> > > > +    {
> > > > +        /* We don't allow modifying an invalid entry. */
> > > > +        if ( !pte_is_valid(entry) )
> > > > +        {
> > > > +            printk("Modifying invalid entry is not
> > > > allowed.\n");
> > > > +            return false;
> > > > +        }
> > > 
> > > ... I also don't understand what this is about. IOW I'm afraid
> > > I'm
> > > still confused about the purpose of this function as well as the
> > > transitions you want to permit / reject. 
> > In the case if the caller call modify_xen_mappings() on a region
> > that
> > doesn't exist.
> 
> Perhaps. What I think is missing is a clear statement somewhere to
> describe
> what the various combinations of (mfn,flags) mean, in terms of the
> operation
> to be carried out. This may then also help with ...
> 
> > > I wonder whether the
> > > flags & PTE_VALID and pte_is_valid(entry) aren't in need of
> > > swapping.
> > I am not sure that I understand what you mean.
> 
> ... this: It's hard to see what cannot be understood about my earlier
> comment. In the code commented on you have a flags & PTE_VALID check
> and a
> pte_is_valid(entry) one. I'm wondering whether the two simply are the
> wrong
> way round.
Sure. I'll add additional comments and reference in the next patch
version to clarify that moment.

> 
> > > > +/* Update an entry at the level @target. */
> > > > +static int pt_update_entry(mfn_t root, unsigned long virt,
> > > > +                           mfn_t mfn, unsigned int target,
> > > > +                           unsigned int flags)
> > > > +{
> > > > +    int rc;
> > > > +    unsigned int level = HYP_PT_ROOT_LEVEL;
> > > > +    pte_t *table;
> > > > +    /*
> > > > +     * The intermediate page tables are read-only when the MFN
> > > > is
> > > > not valid
> > > > +     * and we are not populating page table.
> > > 
> > > The way flags are handled in PTEs, how can page tables be read-
> > > only?
> > This is not needed for everyone case. In case of entry removing an
> > intermediate page table should be created in case when the user is
> > trying to remove a mapping that doesn't exist.
> 
> I don't follow: For one, how is this related to "read-only"-ness? And
> then, why would any kind of removal, whether of a present or non-
> present mapping, ever result in page tables being created?
If the mapping doesn't exist and it was requested ( accidentally by the
caller ) then then the logic of PT update will try to allocate the page
table what is actually a bogus behaviour... I have to double-check
that.

> 
> > > > +     * This means we either modify permissions or remove an
> > > > entry.
> > > 
> > > From all I can determine we also get here when making brand new
> > > entries.
> > > What am I overlooking?
> > Yes, but in this case an intermediate page tables should be
> > read_only,
> > so alloc_only will be true and it will be allowed to create new
> > intermediate page table.
> 
> Hmm, so instead of "read-only" do you maybe mean page tables are not
> supposed to be modified? There's a difference here: When they're
> read-only, you can't write to them (or a fault will result). Whereas
> when in principle they can be modified, there still may be a rule
> saying "in this case they shouldn't be altered".

There is such rule which checks that page tables aren't supposed to be
modified ( so that is why they are read-only ):
```
    /* Sanity check when modifying an entry. */
    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
    {
	...

        /* We don't allow modifying a table entry */
        if ( pte_is_table(entry) )
        {
            printk("Modifying a table entry is not allowed.\n");
            return false;
        }
```

~ Oleksii


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/7] xen/riscv: page table handling
  2024-08-20 14:42         ` oleksii.kurochko
@ 2024-08-20 16:30           ` Jan Beulich
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2024-08-20 16:30 UTC (permalink / raw)
  To: oleksii.kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On 20.08.2024 16:42, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-20 at 15:47 +0200, Jan Beulich wrote:
>> On 20.08.2024 15:18, oleksii.kurochko@gmail.com wrote:
>>> On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
>>>> From all I can determine we also get here when making brand new
>>>> entries.
>>>> What am I overlooking?
>>> Yes, but in this case an intermediate page tables should be
>>> read_only,
>>> so alloc_only will be true and it will be allowed to create new
>>> intermediate page table.
>>
>> Hmm, so instead of "read-only" do you maybe mean page tables are not
>> supposed to be modified? There's a difference here: When they're
>> read-only, you can't write to them (or a fault will result). Whereas
>> when in principle they can be modified, there still may be a rule
>> saying "in this case they shouldn't be altered".
> 
> There is such rule which checks that page tables aren't supposed to be
> modified ( so that is why they are read-only ):

Hmm, you're saying "read-only" again in reply to me explaining that this
isn't the correct term here. I find this increasingly confusing.

Jan

> ```
>     /* Sanity check when modifying an entry. */
>     if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
>     {
> 	...
> 
>         /* We don't allow modifying a table entry */
>         if ( pte_is_table(entry) )
>         {
>             printk("Modifying a table entry is not allowed.\n");
>             return false;
>         }
> ```
> 
> ~ Oleksii



^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2024-08-20 16:30 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-09 16:19 [PATCH v4 0/7] RISCV device tree mapping Oleksii Kurochko
2024-08-09 16:19 ` [PATCH v4 1/7] xen/riscv: enable CONFIG_HAS_DEVICE_TREE Oleksii Kurochko
2024-08-09 16:19 ` [PATCH v4 2/7] xen/riscv: set up fixmap mappings Oleksii Kurochko
2024-08-13  8:22   ` Jan Beulich
2024-08-14 14:21     ` oleksii.kurochko
2024-08-14 15:08       ` Jan Beulich
2024-08-15  8:16         ` oleksii.kurochko
2024-08-09 16:19 ` [PATCH v4 3/7] xen/riscv: introduce asm/pmap.h header Oleksii Kurochko
2024-08-13  8:41   ` Jan Beulich
2024-08-09 16:19 ` [PATCH v4 4/7] xen/riscv: introduce functionality to work with CPU info Oleksii Kurochko
2024-08-13  8:54   ` Jan Beulich
2024-08-14 14:45     ` oleksii.kurochko
2024-08-14 15:22       ` Jan Beulich
2024-08-15  8:55         ` oleksii.kurochko
2024-08-15  9:02           ` Jan Beulich
2024-08-15 13:29             ` oleksii.kurochko
2024-08-15 15:24               ` Jan Beulich
2024-08-09 16:19 ` [PATCH v4 5/7] xen/riscv: introduce and initialize SBI RFENCE extension Oleksii Kurochko
2024-08-13  9:34   ` Jan Beulich
2024-08-14 15:41     ` oleksii.kurochko
2024-08-14 15:53       ` Jan Beulich
2024-08-15 10:00         ` oleksii.kurochko
2024-08-15 10:32           ` Jan Beulich
2024-08-16 12:06     ` oleksii.kurochko
2024-08-16 12:32       ` Jan Beulich
2024-08-09 16:19 ` [PATCH v4 6/7] xen/riscv: page table handling Oleksii Kurochko
2024-08-13 10:31   ` Jan Beulich
2024-08-14 16:50     ` oleksii.kurochko
2024-08-15  8:09       ` Jan Beulich
2024-08-15 11:21         ` oleksii.kurochko
2024-08-15 12:16           ` Jan Beulich
2024-08-15 13:34             ` oleksii.kurochko
2024-08-15 15:26               ` Jan Beulich
2024-08-16  9:09                 ` oleksii.kurochko
2024-08-20 13:18     ` oleksii.kurochko
2024-08-20 13:47       ` Jan Beulich
2024-08-20 14:42         ` oleksii.kurochko
2024-08-20 16:30           ` Jan Beulich
2024-08-09 16:19 ` [PATCH v4 7/7] xen/riscv: introduce early_fdt_map() Oleksii Kurochko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.