qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] Rework x86 page table walks
@ 2024-06-06 14:02 Don Porter
  2024-06-06 14:02 ` [PATCH v3 1/6] Add an "info pg" command that prints the current page tables Don Porter
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Don Porter @ 2024-06-06 14:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: dave, peter.maydell, nadav.amit, richard.henderson, philmd,
	Don Porter

This version of the 'info pg' command adopts Peter Maydell's request
to write guest-agnostic page table iterator and accessor code, along
with architecture-specific hooks.  The first patch in this series
contributes a generic page table iterator and an x86 instantiation.
As a client, we first introduce an 'info pg' monitor command, as well
as a compressing callback hook for creating succinct page table
representations.

After this, each successive patch replaces an exisitng x86 page table
walker with a use of common iterator code.

I could use advice on how to ensure this is sufficiently well tested.
I used 'make check' and 'make check-avocado', which both pass; what is
the typical standard for testing something like a page table related
change?

As far as generality, I have only attempted this on x86, but I expect
the design would work for any similar radix-tree style page table.

I am still new enough to the code base that I wasn't certain about
where to put the generic code, as well as naming conventions.

Per David Gilbert's suggestion, I was careful to ensure that monitor
calls do not perturb TLB state (see the read-only flag in some
functions).

Version 3 of this patch series moves 'info pg' into common monitor
code and implements the architecture-specific code hooks.  I did not
do this with the 'info mem' and 'info tlb' commands, since they have
implementations on other ISAs.

Don Porter (6):
  Add an "info pg" command that prints the current page tables
  Convert 'info tlb' to use generic iterator
  Convert 'info mem' to use generic iterator
  Convert x86_cpu_get_memory_mapping() to use generic iterators
  Move tcg implementation of x86 get_physical_address into common helper
    code.
  Convert x86_mmu_translate() to use common code.

 hmp-commands-info.hx                 |  13 +
 hw/core/cpu-sysemu.c                 | 140 ++++++
 include/hw/core/cpu.h                |  34 +-
 include/hw/core/sysemu-cpu-ops.h     | 169 +++++++
 include/monitor/hmp-target.h         |   1 +
 include/monitor/monitor.h            |   4 +
 monitor/hmp-cmds-target.c            | 198 ++++++++
 target/i386/arch_memory_mapping.c    | 621 ++++++++++++++---------
 target/i386/cpu.c                    |  12 +
 target/i386/cpu.h                    |  63 +++
 target/i386/helper.c                 | 523 +++++++++++++++----
 target/i386/monitor.c                | 724 +++++++++------------------
 target/i386/tcg/sysemu/excp_helper.c | 555 +-------------------
 13 files changed, 1688 insertions(+), 1369 deletions(-)

--
2.34.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/6] Add an "info pg" command that prints the current page tables
  2024-06-06 14:02 [PATCH v3 0/6] Rework x86 page table walks Don Porter
@ 2024-06-06 14:02 ` Don Porter
  2024-06-07  6:09   ` Philippe Mathieu-Daudé
                     ` (3 more replies)
  2024-06-06 14:02 ` [PATCH v3 2/6] Convert 'info tlb' to use generic iterator Don Porter
                   ` (4 subsequent siblings)
  5 siblings, 4 replies; 22+ messages in thread
From: Don Porter @ 2024-06-06 14:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: dave, peter.maydell, nadav.amit, richard.henderson, philmd,
	Don Porter

The new "info pg" monitor command prints the current page table,
including virtual address ranges, flag bits, and snippets of physical
page numbers.  Completely filled regions of the page table with
compatible flags are "folded", with the result that the complete
output for a freshly booted x86-64 Linux VM can fit in a single
terminal window.  The output looks like this:

VPN range             Entry         Flags            Physical page
[7f0000000-7f0000000] PML4[0fe]     ---DA--UWP
  [7f28c0000-7f28fffff]  PDP[0a3]     ---DA--UWP
    [7f28c4600-7f28c47ff]  PDE[023]     ---DA--UWP
      [7f28c4655-7f28c4656]  PTE[055-056] X--D---U-P 0000007f14-0000007f15
      [7f28c465b-7f28c465b]  PTE[05b]     ----A--U-P 0000001cfc
...
[ff8000000-ff8000000] PML4[1ff]     ---DA--UWP
  [ffff80000-ffffbffff]  PDP[1fe]     ---DA---WP
    [ffff81000-ffff81dff]  PDE[008-00e] -GSDA---WP 0000001000-0000001dff
  [ffffc0000-fffffffff]  PDP[1ff]     ---DA--UWP
    [ffffff400-ffffff5ff]  PDE[1fa]     ---DA--UWP
      [ffffff5fb-ffffff5fc]  PTE[1fb-1fc] XG-DACT-WP 00000fec00 00000fee00
    [ffffff600-ffffff7ff]  PDE[1fb]     ---DA--UWP
      [ffffff600-ffffff600]  PTE[000]     -G-DA--U-P 0000001467

This draws heavy inspiration from Austin Clements' original patch.

This also adds a generic page table walker, which other monitor
and execution commands will be migrated to in subsequent patches.

Signed-off-by: Don Porter <porter@cs.unc.edu>
---
 hmp-commands-info.hx              |  13 ++
 hw/core/cpu-sysemu.c              | 140 ++++++++++++
 include/hw/core/cpu.h             |  34 ++-
 include/hw/core/sysemu-cpu-ops.h  | 156 +++++++++++++
 include/monitor/hmp-target.h      |   1 +
 monitor/hmp-cmds-target.c         | 198 +++++++++++++++++
 target/i386/arch_memory_mapping.c | 351 +++++++++++++++++++++++++++++-
 target/i386/cpu.c                 |  11 +
 target/i386/cpu.h                 |  15 ++
 target/i386/monitor.c             | 165 ++++++++++++++
 10 files changed, 1082 insertions(+), 2 deletions(-)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 20a9835ea8..a873841920 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -242,6 +242,19 @@ SRST
     Show memory tree.
 ERST
 
+    {
+        .name       = "pg",
+        .args_type  = "",
+        .params     = "",
+        .help       = "show the page table",
+        .cmd        = hmp_info_pg,
+    },
+
+SRST
+  ``info pg``
+    Show the active page table.
+ERST
+
 #if defined(CONFIG_TCG)
     {
         .name       = "jit",
diff --git a/hw/core/cpu-sysemu.c b/hw/core/cpu-sysemu.c
index 2a9a2a4eb5..fd936fa90c 100644
--- a/hw/core/cpu-sysemu.c
+++ b/hw/core/cpu-sysemu.c
@@ -142,3 +142,143 @@ GuestPanicInformation *cpu_get_crash_info(CPUState *cpu)
     }
     return res;
 }
+
+/**
+ * _for_each_pte - recursive helper function
+ *
+ * @cs - CPU state
+ * @fn(cs, data, pte, vaddr, height) - User-provided function to call on each
+ *                                     pte.
+ *   * @cs - pass through cs
+ *   * @data - user-provided, opaque pointer
+ *   * @pte - current pte
+ *   * @vaddr_in - virtual address translated by pte
+ *   * @height - height in the tree of pte
+ * @data - user-provided, opaque pointer, passed to fn()
+ * @visit_interior_nodes - if true, call fn() on page table entries in
+ *                         interior nodes.  If false, only call fn() on page
+ *                         table entries in leaves.
+ * @visit_not_present - if true, call fn() on entries that are not present.
+ *                         if false, visit only present entries.
+ * @node - The physical address of the current page table radix tree node
+ * @vaddr_in - The virtual address bits translated in walking the page
+ *          table to node
+ * @height - The height of node in the radix tree
+ *
+ * height starts at the max and counts down.
+ * In a 4 level x86 page table, pml4e is level 4, pdpe is level 3,
+ *  pde is level 2, and pte is level 1
+ *
+ * Returns true on success, false on error.
+ */
+static bool
+_for_each_pte(CPUState *cs,
+              int (*fn)(CPUState *cs, void *data, PTE_t *pte,
+                        vaddr vaddr_in, int height, int offset),
+              void *data, bool visit_interior_nodes,
+              bool visit_not_present, hwaddr node,
+              vaddr vaddr_in, int height)
+{
+    int ptes_per_node;
+    int i;
+
+    assert(height > 0);
+
+    CPUClass *cc = CPU_GET_CLASS(cs);
+
+    if ((!cc->sysemu_ops->page_table_entries_per_node)
+        || (!cc->sysemu_ops->get_pte)
+        || (!cc->sysemu_ops->pte_present)
+        || (!cc->sysemu_ops->pte_leaf)
+        || (!cc->sysemu_ops->pte_child)) {
+        return false;
+    }
+
+    ptes_per_node = cc->sysemu_ops->page_table_entries_per_node(cs, height);
+
+    for (i = 0; i < ptes_per_node; i++) {
+        PTE_t pt_entry;
+        vaddr vaddr_i;
+        bool pte_present;
+
+        cc->sysemu_ops->get_pte(cs, node, i, height, &pt_entry, vaddr_in,
+                                &vaddr_i, NULL);
+        pte_present = cc->sysemu_ops->pte_present(cs, &pt_entry);
+
+        if (pte_present || visit_not_present) {
+            if ((!pte_present) || cc->sysemu_ops->pte_leaf(cs, height,
+                                                           &pt_entry)) {
+                if (fn(cs, data, &pt_entry, vaddr_i, height, i)) {
+                    /* Error */
+                    return false;
+                }
+            } else { /* Non-leaf */
+                if (visit_interior_nodes) {
+                    if (fn(cs, data, &pt_entry, vaddr_i, height, i)) {
+                        /* Error */
+                        return false;
+                    }
+                }
+                hwaddr child = cc->sysemu_ops->pte_child(cs, &pt_entry, height);
+                assert(height > 1);
+                if (!_for_each_pte(cs, fn, data, visit_interior_nodes,
+                                   visit_not_present, child, vaddr_i,
+                                   height - 1)) {
+                    return false;
+                }
+            }
+        }
+    }
+
+    return true;
+}
+
+/**
+ * for_each_pte - iterate over a page table, and
+ *                call fn on each entry
+ *
+ * @cs - CPU state
+ * @fn(cs, data, pte, vaddr, height) - User-provided function to call on each
+ *                                     pte.
+ *   * @cs - pass through cs
+ *   * @data - user-provided, opaque pointer
+ *   * @pte - current pte
+ *   * @vaddr - virtual address translated by pte
+ *   * @height - height in the tree of pte
+ * @data - opaque pointer; passed through to fn
+ * @visit_interior_nodes - if true, call fn() on interior entries in
+ *                         page table; if false, visit only leaf entries.
+ * @visit_not_present - if true, call fn() on entries that are not present.
+ *                         if false, visit only present entries.
+ *
+ * Returns true on success, false on error.
+ *
+ */
+bool for_each_pte(CPUState *cs,
+                  int (*fn)(CPUState *cs, void *data, PTE_t *pte,
+                            vaddr vaddr, int height, int offset),
+                  void *data, bool visit_interior_nodes,
+                  bool visit_not_present)
+{
+    int height;
+    vaddr vaddr = 0;
+    hwaddr root;
+    CPUClass *cc = CPU_GET_CLASS(cs);
+
+    if (!cpu_paging_enabled(cs)) {
+        /* paging is disabled */
+        return true;
+    }
+
+    if (!cc->sysemu_ops->page_table_root) {
+        return false;
+    }
+
+    root = cc->sysemu_ops->page_table_root(cs, &height);
+
+    assert(height > 1);
+
+    /* Recursively call a helper to walk the page table */
+    return _for_each_pte(cs, fn, data, visit_interior_nodes, visit_not_present,
+                         root, vaddr, height);
+}
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index a2c8536943..00d7162795 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -671,9 +671,41 @@ int cpu_write_elf32_qemunote(WriteCoreDumpFunction f, CPUState *cpu,
  * Caller is responsible for freeing the data.
  */
 GuestPanicInformation *cpu_get_crash_info(CPUState *cpu);
-
 #endif /* !CONFIG_USER_ONLY */
 
+/* Intended to become a generic PTE type */
+typedef union PTE {
+    uint64_t pte64_t;
+    uint32_t pte32_t;
+} PTE_t;
+
+/**
+ * for_each_pte - iterate over a page table, and
+ *                call fn on each entry
+ *
+ * @cs - CPU state
+ * @fn(cs, data, pte, vaddr, height) - User-provided function to call on each
+ *                                     pte.
+ *   * @cs - pass through cs
+ *   * @data - user-provided, opaque pointer
+ *   * @pte - current pte
+ *   * @vaddr - virtual address translated by pte
+ *   * @height - height in the tree of pte
+ * @data - opaque pointer; passed through to fn
+ * @visit_interior_nodes - if true, call fn() on interior entries in
+ *                         page table; if false, visit only leaf entries.
+ * @visit_not_present - if true, call fn() on entries that are not present.
+ *                         if false, visit only present entries.
+ *
+ * Returns true on success, false on error.
+ *
+ */
+bool for_each_pte(CPUState *cs,
+                  int (*fn)(CPUState *cs, void *data, PTE_t *pte, vaddr vaddr,
+                            int height, int offset), void *data,
+                  bool visit_interior_nodes, bool visit_not_present);
+
+
 /**
  * CPUDumpFlags:
  * @CPU_DUMP_CODE:
diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
index 24d003fe04..eb16a1c3e2 100644
--- a/include/hw/core/sysemu-cpu-ops.h
+++ b/include/hw/core/sysemu-cpu-ops.h
@@ -12,6 +12,39 @@
 
 #include "hw/core/cpu.h"
 
+/* Maximum supported page table height - currently x86 at 5 */
+#define MAX_HEIGHT 5
+
+/*
+ * struct mem_print_state: Used by monitor in walking page tables.
+ */
+struct mem_print_state {
+    Monitor *mon;
+    CPUArchState *env;
+    int vaw, paw; /* VA and PA width in characters */
+    int max_height;
+    bool (*flusher)(CPUState *cs, struct mem_print_state *state);
+    bool flush_interior; /* If false, only call flusher() on leaves */
+    bool require_physical_contiguity;
+    /*
+     * The height at which we started accumulating ranges, i.e., the
+     * next height we need to print once we hit the end of a
+     * contiguous range.
+     */
+    int start_height;
+    /*
+     * For compressing contiguous ranges, track the
+     * start and end of the range
+     */
+    hwaddr vstart[MAX_HEIGHT + 1]; /* Starting virt. addr. of open pte range */
+    hwaddr vend[MAX_HEIGHT + 1]; /* Ending virtual address of open pte range */
+    hwaddr pstart; /* Starting physical address of open pte range */
+    hwaddr pend; /* Ending physical address of open pte range */
+    int64_t ent[MAX_HEIGHT + 1]; /* PTE contents on current root->leaf path */
+    int offset[MAX_HEIGHT + 1]; /* PTE range starting offsets */
+    int last_offset[MAX_HEIGHT + 1]; /* PTE range ending offsets */
+};
+
 /*
  * struct SysemuCPUOps: System operations specific to a CPU class
  */
@@ -87,6 +120,129 @@ typedef struct SysemuCPUOps {
      */
     const VMStateDescription *legacy_vmsd;
 
+    /**
+     * page_table_root - Given a CPUState, return the physical address
+     *                    of the current page table root, as well as
+     *                    write the height of the tree into *height.
+     *
+     * @cs - CPU state
+
+     * @height - a pointer to an integer, to store the page table tree
+     *           height
+     *
+     * Returns a hardware address on success.  Should not fail (i.e.,
+     * caller is responsible to ensure that a page table is actually
+     * present).
+     */
+    hwaddr (*page_table_root)(CPUState *cs, int *height);
+
+    /**
+     * page_table_entries_per_node - Return the number of entries in a
+     *                                   page table node for the CPU
+     *                                   at a given height.
+     *
+     * @cs - CPU state
+     * @height - height of the page table tree to query, where the leaves
+     *          are 1.
+     *
+     * Returns a value greater than zero on success, -1 on error.
+     */
+    int (*page_table_entries_per_node)(CPUState *cs, int height);
+
+    /**
+     * get_pte - Copy the contents of the page table entry at node[i]
+     *           into pt_entry.  Optionally, add the relevant bits to
+     *           the virtual address in vaddr_pte.
+     *
+     * @cs - CPU state
+     * @node - physical address of the current page table node
+     * @i - index (in page table entries, not bytes) of the page table
+     *      entry, within node
+     * @height - height of node within the tree (leaves are 1, not 0)
+     * @pt_entry - Pointer to a PTE_t, stores the contents of the page
+     *             table entry
+     * @vaddr_parent - The virtual address bits already translated in
+     *                 walking the page table to node.  Optional: only
+     *                 used if vaddr_pte is set.
+     * @vaddr_pte - Optional pointer to a variable storing the virtual
+     *              address bits translated by node[i].
+     * @pte_paddr - Pointer to the physical address of the PTE within node.
+     *              Optional parameter.
+     */
+
+    void (*get_pte)(CPUState *cs, hwaddr node, int i, int height,
+                    PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
+                    hwaddr *pte_paddr);
+
+    /**
+     * pte_present - Return true if the pte is marked 'present'
+     */
+    bool (*pte_present)(CPUState *cs, PTE_t *pte);
+
+    /**
+     * pte_leaf - Return true if the pte is a page table leaf, false
+     *                if the pte points to another node in the radix
+     *                tree.
+     */
+    bool (*pte_leaf)(CPUState *cs, int height, PTE_t *pte);
+
+    /**
+     * pte_child - Returns the physical address of a radix tree node
+     *                 pointed to by pte.
+     *
+     * @cs - CPU state
+     * @pte - The page table entry
+     * @height - The height in the tree of pte
+     *
+     * Returns the physical address stored in pte on success, -1 on
+     * error.
+     */
+    hwaddr (*pte_child)(CPUState *cs, PTE_t *pte, int height);
+
+    /**
+     * pte_leaf_page_size - Return the page size of a leaf entry,
+     *                          given the height and CPU state
+     *
+     * @cs - CPU state
+     * @height - height of the page table tree to query, where the leaves
+     *          are 1.
+     *
+     * Returns a value greater than zero on success, -1 on error.
+     */
+    uint64_t (*pte_leaf_page_size)(CPUState *cs, int height);
+
+    /**
+     * pte_flags - Return the flag bits of the page table entry.
+     *
+     * @pte - the contents of the PTE, not the address.
+     *
+     * Returns pte with the non-flag bits masked out.
+     */
+    uint64_t (*pte_flags)(uint64_t pte);
+
+    /**
+     * @mon_init_page_table_iterator: Callback to configure a page table
+     * iterator for use by a monitor function.
+     * Returns true on success, false if not supported (e.g., no paging disabled
+     * or not implemented on this CPU).
+     */
+    bool (*mon_init_page_table_iterator)(Monitor *mon,
+                                         struct mem_print_state *state);
+
+    /**
+     * @mon_info_pg_print_header: Prints the header line for 'info pg'.
+     */
+    void (*mon_info_pg_print_header)(Monitor *mon,
+                                     struct mem_print_state *state);
+
+    /**
+     * @flush_page_table_iterator_state: Prints the last entry,
+     * if one is present.  Useful for iterators that aggregate information
+     * across page table entries.
+     */
+    bool (*mon_flush_page_print_state)(CPUState *cs,
+                                       struct mem_print_state *state);
+
 } SysemuCPUOps;
 
 #endif /* SYSEMU_CPU_OPS_H */
diff --git a/include/monitor/hmp-target.h b/include/monitor/hmp-target.h
index b679aaebbf..9af72ea58d 100644
--- a/include/monitor/hmp-target.h
+++ b/include/monitor/hmp-target.h
@@ -50,6 +50,7 @@ CPUState *mon_get_cpu(Monitor *mon);
 void hmp_info_mem(Monitor *mon, const QDict *qdict);
 void hmp_info_tlb(Monitor *mon, const QDict *qdict);
 void hmp_mce(Monitor *mon, const QDict *qdict);
+void hmp_info_pg(Monitor *mon, const QDict *qdict);
 void hmp_info_local_apic(Monitor *mon, const QDict *qdict);
 void hmp_info_sev(Monitor *mon, const QDict *qdict);
 void hmp_info_sgx(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
index ff01cf9d8d..60a8bd0c37 100644
--- a/monitor/hmp-cmds-target.c
+++ b/monitor/hmp-cmds-target.c
@@ -31,6 +31,7 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qdict.h"
 #include "sysemu/hw_accel.h"
+#include "hw/core/sysemu-cpu-ops.h"
 
 /* Set the current CPU defined by the user. Callers must hold BQL. */
 int monitor_set_cpu(Monitor *mon, int cpu_index)
@@ -120,6 +121,203 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict)
     }
 }
 
+/* Assume only called on present entries */
+static
+int compressing_iterator(CPUState *cs, void *data, PTE_t *pte,
+                         vaddr vaddr_in, int height, int offset)
+{
+    CPUClass *cc = CPU_GET_CLASS(cs);
+    struct mem_print_state *state = (struct mem_print_state *) data;
+    hwaddr paddr = cc->sysemu_ops->pte_child(cs, pte, height);
+    target_ulong size = cc->sysemu_ops->pte_leaf_page_size(cs, height);
+    bool start_new_run = false, flush = false;
+    bool is_leaf = cc->sysemu_ops->pte_leaf(cs, height, pte);
+
+    int entries_per_node = cc->sysemu_ops->page_table_entries_per_node(cs,
+                                                                       height);
+
+    /* Prot of current pte */
+    int prot = cc->sysemu_ops->pte_flags(pte->pte64_t);
+
+    /* If there is a prior run, first try to extend it. */
+    if (state->start_height != 0) {
+
+        /*
+         * If we aren't flushing interior nodes, raise the start height.
+         * We don't need to detect non-compressible interior nodes.
+         */
+        if ((!state->flush_interior) && state->start_height < height) {
+            state->start_height = height;
+            state->vstart[height] = vaddr_in;
+            state->vend[height] = vaddr_in;
+            state->ent[height] = pte->pte64_t;
+            if (offset == 0) {
+                state->last_offset[height] = entries_per_node - 1;
+            } else {
+                state->last_offset[height] = offset - 1;
+            }
+        }
+
+        /* Detect when we are walking down the "left edge" of a range */
+        if (state->vstart[height] == -1
+            && (height + 1) <= state->start_height
+            && state->vstart[height + 1] == vaddr_in) {
+
+            state->vstart[height] = vaddr_in;
+            state->vend[height] = vaddr_in;
+            state->ent[height] = pte->pte64_t;
+            state->offset[height] = offset;
+            state->last_offset[height] = offset;
+
+            if (is_leaf) {
+                state->pstart = paddr;
+                state->pend = paddr;
+            }
+
+            /* Detect contiguous entries at same level */
+        } else if ((state->vstart[height] != -1)
+                   && (state->start_height >= height)
+                   && cc->sysemu_ops->pte_flags(state->ent[height]) == prot
+                   && (((state->last_offset[height] + 1) % entries_per_node)
+                       == offset)
+                   && ((!is_leaf)
+                       || (!state->require_physical_contiguity)
+                       || (state->pend + size == paddr))) {
+
+
+            /*
+             * If there are entries at the levels below, make sure we
+             * completed them.  We only compress interior nodes
+             * without holes in the mappings.
+             */
+            if (height != 1) {
+                for (int i = height - 1; i >= 1; i--) {
+                    int entries = cc->sysemu_ops->page_table_entries_per_node(
+                        cs, i);
+
+                    /* Stop if we hit large pages before level 1 */
+                    if (state->vstart[i] == -1) {
+                        break;
+                    }
+
+                    if ((state->last_offset[i] + 1) != entries) {
+                        flush = true;
+                        start_new_run = true;
+                        break;
+                    }
+                }
+            }
+
+
+            if (!flush) {
+
+                /* We can compress these entries */
+                state->ent[height] = pte->pte64_t;
+                state->vend[height] = vaddr_in;
+                state->last_offset[height] = offset;
+
+                /* Only update the physical range on leaves */
+                if (is_leaf) {
+                    state->pend = paddr;
+                }
+            }
+            /* Let PTEs accumulate... */
+        } else {
+            flush = true;
+        }
+
+        if (flush) {
+            /*
+             * We hit dicontiguous permissions or pages.
+             * Print the old entries, then start accumulating again
+             *
+             * Some clients only want the flusher called on a leaf.
+             * Check that too.
+             *
+             * We can infer whether the accumulated range includes a
+             * leaf based on whether pstart is -1.
+             */
+            if (state->flush_interior || (state->pstart != -1)) {
+                if (state->flusher(cs, state)) {
+                    start_new_run = true;
+                }
+            } else {
+                start_new_run = true;
+            }
+        }
+    } else {
+        start_new_run = true;
+    }
+
+    if (start_new_run) {
+        /* start a new run with this PTE */
+        for (int i = state->start_height; i > 0; i--) {
+            if (state->vstart[i] != -1) {
+                state->ent[i] = 0;
+                state->last_offset[i] = 0;
+                state->vstart[i] = -1;
+            }
+        }
+        state->pstart = -1;
+        state->vstart[height] = vaddr_in;
+        state->vend[height] = vaddr_in;
+        state->ent[height] = pte->pte64_t;
+        state->offset[height] = offset;
+        state->last_offset[height] = offset;
+        if (is_leaf) {
+            state->pstart = paddr;
+            state->pend = paddr;
+        }
+        state->start_height = height;
+    }
+
+    return 0;
+}
+
+void hmp_info_pg(Monitor *mon, const QDict *qdict)
+{
+    struct mem_print_state state;
+
+    CPUState *cs = mon_get_cpu(mon);
+    if (!cs) {
+        monitor_printf(mon, "Unable to get CPUState.  Internal error\n");
+        return;
+    }
+
+    CPUClass *cc = CPU_GET_CLASS(cs);
+
+    if ((!cc->sysemu_ops->pte_child)
+        || (!cc->sysemu_ops->pte_leaf)
+        || (!cc->sysemu_ops->pte_leaf_page_size)
+        || (!cc->sysemu_ops->page_table_entries_per_node)
+        || (!cc->sysemu_ops->pte_flags)
+        || (!cc->sysemu_ops->mon_init_page_table_iterator)
+        || (!cc->sysemu_ops->mon_info_pg_print_header)
+        || (!cc->sysemu_ops->mon_flush_page_print_state)) {
+        monitor_printf(mon, "Info pg unsupported on this ISA\n");
+        return;
+    }
+
+    if (!cc->sysemu_ops->mon_init_page_table_iterator(mon, &state)) {
+        monitor_printf(mon, "Unable to initialize page table iterator\n");
+        return;
+    }
+
+    state.flush_interior = true;
+    state.require_physical_contiguity = true;
+    state.flusher = cc->sysemu_ops->mon_flush_page_print_state;
+
+    cc->sysemu_ops->mon_info_pg_print_header(mon, &state);
+
+    /*
+     * We must visit interior entries to get the hierarchy, but
+     * can skip not present mappings
+     */
+    for_each_pte(cs, &compressing_iterator, &state, true, false);
+
+    /* Print last entry, if one present */
+    cc->sysemu_ops->mon_flush_page_print_state(cs, &state);
+}
 static void memory_dump(Monitor *mon, int count, int format, int wsize,
                         hwaddr addr, int is_physical)
 {
diff --git a/target/i386/arch_memory_mapping.c b/target/i386/arch_memory_mapping.c
index d1ff659128..562a00b5a7 100644
--- a/target/i386/arch_memory_mapping.c
+++ b/target/i386/arch_memory_mapping.c
@@ -15,6 +15,356 @@
 #include "cpu.h"
 #include "sysemu/memory_mapping.h"
 
+/**
+ ************** code hook implementations for x86 ***********
+ */
+
+#define PML4_ADDR_MASK 0xffffffffff000ULL /* selects bits 51:12 */
+
+/**
+ * x86_page_table_root - Given a CPUState, return the physical address
+ *                       of the current page table root, as well as
+ *                       write the height of the tree into *height.
+ *
+ * @cs - CPU state
+ * @height - a pointer to an integer, to store the page table tree height
+ *
+ * Returns a hardware address on success.  Should not fail (i.e., caller is
+ * responsible to ensure that a page table is actually present).
+ */
+hwaddr x86_page_table_root(CPUState *cs, int *height)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+    /*
+     * DEP 5/15/24: Some original page table walking code sets the a20
+     * mask as a 32 bit integer and checks it on each level of hte
+     * page table walk; some only checks it against the final result.
+     * For 64 bits, I think we need to sign extend in the common case
+     * it is not set (and returns -1), or we will lose bits.
+     */
+    int64_t a20_mask;
+
+    assert(cpu_paging_enabled(cs));
+    a20_mask = x86_get_a20_mask(env);
+
+    if (env->cr[4] & CR4_PAE_MASK) {
+#ifdef TARGET_X86_64
+        if (env->hflags & HF_LMA_MASK) {
+            if (env->cr[4] & CR4_LA57_MASK) {
+                *height = 5;
+            } else {
+                *height = 4;
+            }
+            return (env->cr[3] & PML4_ADDR_MASK) & a20_mask;
+        } else
+#endif
+        {
+            *height = 3;
+            return (env->cr[3] & ~0x1f) & a20_mask;
+        }
+    } else {
+        *height = 2;
+        return (env->cr[3] & ~0xfff) & a20_mask;
+    }
+}
+
+
+/**
+ * x86_page_table_entries_per_node - Return the number of entries in a
+ *                                   page table node for the CPU at a
+ *                                   given height.
+ *
+ * @cs - CPU state
+ * @height - height of the page table tree to query, where the leaves
+ *          are 1.
+ *
+ * Returns a value greater than zero on success, -1 on error.
+ */
+int x86_page_table_entries_per_node(CPUState *cs, int height)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+    bool pae_enabled = env->cr[4] & CR4_PAE_MASK;
+
+    assert(height < 6);
+    assert(height > 0);
+
+    switch (height) {
+#ifdef TARGET_X86_64
+    case 5:
+        assert(env->cr[4] & CR4_LA57_MASK);
+    case 4:
+        assert(env->hflags & HF_LMA_MASK);
+        assert(pae_enabled);
+        return 512;
+#endif
+    case 3:
+        assert(pae_enabled);
+#ifdef TARGET_X86_64
+        if (env->hflags & HF_LMA_MASK) {
+            return 512;
+        } else
+#endif
+        {
+            return 4;
+        }
+    case 2:
+    case 1:
+        return pae_enabled ? 512 : 1024;
+    default:
+        g_assert_not_reached();
+    }
+    return -1;
+}
+
+/**
+ * x86_pte_leaf_page_size - Return the page size of a leaf entry,
+ *                          given the height and CPU state
+ *
+ * @cs - CPU state
+ * @height - height of the page table tree to query, where the leaves
+ *          are 1.
+ *
+ * Returns a value greater than zero on success, -1 on error.
+ */
+uint64_t x86_pte_leaf_page_size(CPUState *cs, int height)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+    bool pae_enabled = env->cr[4] & CR4_PAE_MASK;
+
+    assert(height < 6);
+    assert(height > 0);
+
+    switch (height) {
+#ifdef TARGET_X86_64
+    case 5:
+        assert(pae_enabled);
+        assert(env->cr[4] & CR4_LA57_MASK);
+        assert(env->hflags & HF_LMA_MASK);
+        return 1ULL << 48;
+    case 4:
+        assert(pae_enabled);
+        assert(env->hflags & HF_LMA_MASK);
+        return 1ULL << 39;
+#endif
+    case 3:
+        assert(pae_enabled);
+        return 1 << 30;
+    case 2:
+        if (pae_enabled) {
+            return 1 << 21;
+        } else {
+            return 1 << 22;
+        }
+    case 1:
+        return 4096;
+    default:
+        g_assert_not_reached();
+    }
+    return -1;
+}
+
+/*
+ * Given a CPU state and height, return the number of bits
+ * to shift right/left in going from virtual to PTE index
+ * and vice versa, the number of useful bits.
+ */
+static void _mmu_decode_va_parameters(CPUState *cs, int height,
+                                      int *shift, int *width)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+    int _shift = 0;
+    int _width = 0;
+    bool pae_enabled = env->cr[4] & CR4_PAE_MASK;
+
+    switch (height) {
+    case 5:
+        _shift = 48;
+        _width = 9;
+        break;
+    case 4:
+        _shift = 39;
+        _width = 9;
+        break;
+    case 3:
+        _shift = 30;
+        _width = 9;
+        break;
+    case 2:
+        /* 64 bit page tables shift from 30->21 bits here */
+        if (pae_enabled) {
+            _shift = 21;
+            _width = 9;
+        } else {
+            /* 32 bit page tables shift from 32->22 bits */
+            _shift = 22;
+            _width = 10;
+        }
+        break;
+    case 1:
+        _shift = 12;
+        if (pae_enabled) {
+            _width = 9;
+        } else {
+            _width = 10;
+        }
+
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    if (shift) {
+        *shift = _shift;
+    }
+
+    if (width) {
+        *width = _width;
+    }
+}
+
+/**
+ * get_pte - Copy the contents of the page table entry at node[i] into pt_entry.
+ *           Optionally, add the relevant bits to the virtual address in
+ *           vaddr_pte.
+ *
+ * @cs - CPU state
+ * @node - physical address of the current page table node
+ * @i - index (in page table entries, not bytes) of the page table
+ *      entry, within node
+ * @height - height of node within the tree (leaves are 1, not 0)
+ * @pt_entry - Poiter to a PTE_t, stores the contents of the page table entry
+ * @vaddr_parent - The virtual address bits already translated in walking the
+ *                 page table to node.  Optional: only used if vaddr_pte is set.
+ * @vaddr_pte - Optional pointer to a variable storing the virtual address bits
+ *              translated by node[i].
+ * @pte_paddr - Pointer to the physical address of the PTE within node.
+ *              Optional parameter.
+ */
+void
+x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
+            PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
+            hwaddr *pte_paddr)
+
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+    int32_t a20_mask = x86_get_a20_mask(env);
+    hwaddr pte;
+
+    if (env->hflags & HF_LMA_MASK) {
+        /* 64 bit */
+        int pte_width = 8;
+        pte = (node + (i * pte_width)) & a20_mask;
+        pt_entry->pte64_t = address_space_ldq(cs->as, pte,
+                                              MEMTXATTRS_UNSPECIFIED, NULL);
+    } else {
+        /* 32 bit */
+        int pte_width = 4;
+        pte = (node + (i * pte_width)) & a20_mask;
+        pt_entry->pte32_t = address_space_ldl(cs->as, pte,
+                                              MEMTXATTRS_UNSPECIFIED, NULL);
+    }
+
+    if (vaddr_pte) {
+        int shift = 0;
+        _mmu_decode_va_parameters(cs, height, &shift, NULL);
+        *vaddr_pte = vaddr_parent | ((i & 0x1ffULL) << shift);
+    }
+
+    if (pte_paddr) {
+        *pte_paddr = pte;
+    }
+}
+
+
+static bool
+mmu_pte_check_bits(CPUState *cs, PTE_t *pte, int64_t mask)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+    if (env->hflags & HF_LMA_MASK) {
+        return pte->pte64_t & mask;
+    } else {
+        return pte->pte32_t & mask;
+    }
+}
+
+/**
+ * x86_pte_present - Return true if the pte is marked 'present'
+ */
+bool
+x86_pte_present(CPUState *cs, PTE_t *pte)
+{
+    return mmu_pte_check_bits(cs, pte, PG_PRESENT_MASK);
+}
+
+/**
+ * x86_pte_leaf - Return true if the pte is
+ *                a page table leaf, false if
+ *                the pte points to another
+ *                node in the radix tree.
+ */
+bool
+x86_pte_leaf(CPUState *cs, int height, PTE_t *pte)
+{
+    return height == 1 || mmu_pte_check_bits(cs, pte, PG_PSE_MASK);
+}
+
+/**
+ * x86_pte_child - Returns the physical address
+ *                 of a radix tree node pointed to by pte.
+ *
+ * @cs - CPU state
+ * @pte - The page table entry
+ * @height - The height in the tree of pte
+ *
+ * Returns the physical address stored in pte on success,
+ *     -1 on error.
+ */
+hwaddr
+x86_pte_child(CPUState *cs, PTE_t *pte, int height)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+    bool pae_enabled = env->cr[4] & CR4_PAE_MASK;
+    int32_t a20_mask = x86_get_a20_mask(env);
+
+    switch (height) {
+#ifdef TARGET_X86_64
+    case 5:
+        assert(env->cr[4] & CR4_LA57_MASK);
+    case 4:
+        assert(env->hflags & HF_LMA_MASK);
+        /* assert(pae_enabled); */
+        /* Fall through */
+#endif
+    case 3:
+        assert(pae_enabled);
+#ifdef TARGET_X86_64
+        if (env->hflags & HF_LMA_MASK) {
+            return (pte->pte64_t & PG_ADDRESS_MASK) & a20_mask;
+        } else
+#endif
+        {
+            return (pte->pte64_t & ~0xfff) & a20_mask;
+        }
+    case 2:
+    case 1:
+        if (pae_enabled) {
+            return (pte->pte64_t & PG_ADDRESS_MASK) & a20_mask;
+        } else {
+            return (pte->pte32_t & ~0xfff) & a20_mask;
+        }
+    default:
+        g_assert_not_reached();
+    }
+    return -1;
+}
+
 /* PAE Paging or IA-32e Paging */
 static void walk_pte(MemoryMappingList *list, AddressSpace *as,
                      hwaddr pte_start_addr,
@@ -313,4 +663,3 @@ bool x86_cpu_get_memory_mapping(CPUState *cs, MemoryMappingList *list,
 
     return true;
 }
-
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 914bef442c..8bd6164b68 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -8305,6 +8305,17 @@ static const struct SysemuCPUOps i386_sysemu_ops = {
     .write_elf32_qemunote = x86_cpu_write_elf32_qemunote,
     .write_elf64_qemunote = x86_cpu_write_elf64_qemunote,
     .legacy_vmsd = &vmstate_x86_cpu,
+    .page_table_root = &x86_page_table_root,
+    .page_table_entries_per_node = &x86_page_table_entries_per_node,
+    .get_pte = &x86_get_pte,
+    .pte_present = &x86_pte_present,
+    .pte_leaf = &x86_pte_leaf,
+    .pte_child = &x86_pte_child,
+    .pte_leaf_page_size = &x86_pte_leaf_page_size,
+    .pte_flags = &x86_pte_flags,
+    .mon_init_page_table_iterator = &x86_mon_init_page_table_iterator,
+    .mon_info_pg_print_header = &x86_mon_info_pg_print_header,
+    .mon_flush_page_print_state = &x86_mon_flush_print_pg_state,
 };
 #endif
 
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index c64ef0c1a2..cbb6f6fc4d 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -21,6 +21,7 @@
 #define I386_CPU_H
 
 #include "sysemu/tcg.h"
+#include "hw/core/sysemu-cpu-ops.h"
 #include "cpu-qom.h"
 #include "kvm/hyperv-proto.h"
 #include "exec/cpu-defs.h"
@@ -2150,8 +2151,22 @@ int x86_cpu_write_elf64_qemunote(WriteCoreDumpFunction f, CPUState *cpu,
 int x86_cpu_write_elf32_qemunote(WriteCoreDumpFunction f, CPUState *cpu,
                                  DumpState *s);
 
+hwaddr x86_page_table_root(CPUState *cs, int *height);
+int x86_page_table_entries_per_node(CPUState *cs, int height);
+uint64_t x86_pte_leaf_page_size(CPUState *cs, int height);
+void x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
+                 PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
+                 hwaddr *pte_paddr);
+bool x86_pte_present(CPUState *cs, PTE_t *pte);
+bool x86_pte_leaf(CPUState *cs, int height, PTE_t *pte);
+hwaddr x86_pte_child(CPUState *cs, PTE_t *pte, int height);
+uint64_t x86_pte_flags(uint64_t pte);
 bool x86_cpu_get_memory_mapping(CPUState *cpu, MemoryMappingList *list,
                                 Error **errp);
+bool x86_mon_init_page_table_iterator(Monitor *mon,
+                                      struct mem_print_state *state);
+void x86_mon_info_pg_print_header(Monitor *mon, struct mem_print_state *state);
+bool x86_mon_flush_print_pg_state(CPUState *cs, struct mem_print_state *state);
 
 void x86_cpu_dump_state(CPUState *cs, FILE *f, int flags);
 
diff --git a/target/i386/monitor.c b/target/i386/monitor.c
index 2d766b2637..65e82e73e8 100644
--- a/target/i386/monitor.c
+++ b/target/i386/monitor.c
@@ -32,6 +32,171 @@
 #include "qapi/qapi-commands-misc-target.h"
 #include "qapi/qapi-commands-misc.h"
 
+
+/********************* x86 specific hooks for printing page table stuff ****/
+
+const char *names[7] = {(char *)NULL, "PTE", "PDE", "PDP", "PML4", "Pml5",
+                        (char *)NULL};
+static char *pg_bits(hwaddr ent)
+{
+    static char buf[32];
+    snprintf(buf, 32, "%c%c%c%c%c%c%c%c%c%c",
+            ent & PG_NX_MASK ? 'X' : '-',
+            ent & PG_GLOBAL_MASK ? 'G' : '-',
+            ent & PG_PSE_MASK ? 'S' : '-',
+            ent & PG_DIRTY_MASK ? 'D' : '-',
+            ent & PG_ACCESSED_MASK ? 'A' : '-',
+            ent & PG_PCD_MASK ? 'C' : '-',
+            ent & PG_PWT_MASK ? 'T' : '-',
+            ent & PG_USER_MASK ? 'U' : '-',
+            ent & PG_RW_MASK ? 'W' : '-',
+            ent & PG_PRESENT_MASK ? 'P' : '-');
+    return buf;
+}
+
+bool x86_mon_init_page_table_iterator(Monitor *mon,
+                                      struct mem_print_state *state)
+{
+    CPUArchState *env;
+    state->mon = mon;
+    state->flush_interior = false;
+    state->require_physical_contiguity = false;
+
+    for (int i = 0; i < MAX_HEIGHT; i++) {
+        state->vstart[i] = -1;
+        state->last_offset[i] = 0;
+    }
+    state->start_height = 0;
+
+    env = mon_get_cpu_env(mon);
+    if (!env) {
+        monitor_printf(mon, "No CPU available\n");
+        return false;
+    }
+    state->env = env;
+
+    if (!(env->cr[0] & CR0_PG_MASK)) {
+        monitor_printf(mon, "PG disabled\n");
+        return false;
+    }
+
+    /* set va and pa width */
+    if (env->cr[4] & CR4_PAE_MASK) {
+        state->paw = 13;
+#ifdef TARGET_X86_64
+        if (env->hflags & HF_LMA_MASK) {
+            if (env->cr[4] & CR4_LA57_MASK) {
+                state->vaw = 15;
+                state->max_height = 5;
+            } else {
+                state->vaw = 12;
+                state->max_height = 4;
+            }
+        } else
+#endif
+        {
+            state->vaw = 8;
+            state->max_height = 3;
+        }
+    } else {
+        state->max_height = 2;
+        state->vaw = 8;
+        state->paw = 8;
+    }
+
+    return true;
+}
+
+void x86_mon_info_pg_print_header(Monitor *mon, struct mem_print_state *state)
+{
+    /* Header line */
+    monitor_printf(mon, "%-*s %-13s %-10s %*s%s\n",
+                   3 + 2 * (state->vaw - 3), "VPN range",
+                   "Entry", "Flags",
+                   2 * (state->max_height - 1), "", "Physical page(s)");
+}
+
+
+static void pg_print(CPUState *cs, Monitor *mon, uint64_t pt_ent,
+                     target_ulong vaddr_s, target_ulong vaddr_l,
+                     hwaddr paddr_s, hwaddr paddr_l,
+                     int offset_s, int offset_l,
+                     int height, int max_height, int vaw, int paw,
+                     bool is_leaf)
+
+{
+    char buf[128];
+    char *pos = buf, *end = buf + sizeof(buf);
+    target_ulong size = x86_pte_leaf_page_size(cs, height);
+
+    /* VFN range */
+    pos += snprintf(pos, end - pos, "%*s[%0*"PRIx64"-%0*"PRIx64"] ",
+                    (max_height - height) * 2, "",
+                    vaw - 3, (uint64_t)vaddr_s >> 12,
+                    vaw - 3, ((uint64_t)vaddr_l + size - 1) >> 12);
+
+    /* Slot */
+    if (vaddr_s == vaddr_l) {
+        pos += snprintf(pos, end - pos, "%4s[%03x]    ",
+                       names[height], offset_s);
+    } else {
+        pos += snprintf(pos, end - pos, "%4s[%03x-%03x]",
+                       names[height], offset_s, offset_l);
+    }
+
+    /* Flags */
+    pos += snprintf(pos, end - pos, " %s", pg_bits(pt_ent));
+
+
+    /* Range-compressed PFN's */
+    if (is_leaf) {
+        if (vaddr_s == vaddr_l) {
+            pos += snprintf(pos, end - pos, " %0*"PRIx64,
+                            paw - 3, (uint64_t)paddr_s >> 12);
+        } else {
+            pos += snprintf(pos, end - pos, " %0*"PRIx64"-%0*"PRIx64,
+                            paw - 3, (uint64_t)paddr_s >> 12,
+                            paw - 3, (uint64_t)paddr_l >> 12);
+        }
+        pos = MIN(pos, end);
+    }
+
+    /* Trim line to fit screen */
+    if (pos - buf > 79) {
+        strcpy(buf + 77, "..");
+    }
+
+    monitor_printf(mon, "%s\n", buf);
+}
+
+uint64_t x86_pte_flags(uint64_t pte)
+{
+    return pte & (PG_USER_MASK | PG_RW_MASK |
+                  PG_PRESENT_MASK);
+}
+
+/* Returns true if it emitted anything */
+bool x86_mon_flush_print_pg_state(CPUState *cs, struct mem_print_state *state)
+{
+    bool ret = false;
+    for (int i = state->start_height; i > 0; i--) {
+        if (state->vstart[i] == -1) {
+            break;
+        }
+        PTE_t my_pte;
+        my_pte.pte64_t = state->ent[i];
+        ret = true;
+        pg_print(cs, state->mon, state->ent[i],
+                 state->vstart[i], state->vend[i],
+                 state->pstart, state->pend,
+                 state->offset[i], state->last_offset[i],
+                 i, state->max_height, state->vaw, state->paw,
+                 x86_pte_leaf(cs, i, &my_pte));
+    }
+
+    return ret;
+}
+
 /* Perform linear address sign extension */
 static hwaddr addr_canonical(CPUArchState *env, hwaddr addr)
 {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 2/6] Convert 'info tlb' to use generic iterator
  2024-06-06 14:02 [PATCH v3 0/6] Rework x86 page table walks Don Porter
  2024-06-06 14:02 ` [PATCH v3 1/6] Add an "info pg" command that prints the current page tables Don Porter
@ 2024-06-06 14:02 ` Don Porter
  2024-06-07  6:02   ` Philippe Mathieu-Daudé
  2024-06-06 14:02 ` [PATCH v3 3/6] Convert 'info mem' " Don Porter
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Don Porter @ 2024-06-06 14:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: dave, peter.maydell, nadav.amit, richard.henderson, philmd,
	Don Porter

Signed-off-by: Don Porter <porter@cs.unc.edu>
---
 include/hw/core/sysemu-cpu-ops.h |   7 +
 monitor/hmp-cmds-target.c        |   1 +
 target/i386/cpu.h                |   2 +
 target/i386/monitor.c            | 217 ++++++-------------------------
 4 files changed, 53 insertions(+), 174 deletions(-)

diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
index eb16a1c3e2..bf3de3e004 100644
--- a/include/hw/core/sysemu-cpu-ops.h
+++ b/include/hw/core/sysemu-cpu-ops.h
@@ -243,6 +243,13 @@ typedef struct SysemuCPUOps {
     bool (*mon_flush_page_print_state)(CPUState *cs,
                                        struct mem_print_state *state);
 
+    /**
+     * @mon_print_pte: Hook called by the monitor to print a page
+     * table entry at address addr, with contents pte.
+     */
+    void (*mon_print_pte) (Monitor *mon, CPUArchState *env, hwaddr addr,
+                           hwaddr pte);
+
 } SysemuCPUOps;
 
 #endif /* SYSEMU_CPU_OPS_H */
diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
index 60a8bd0c37..3393e5ad0b 100644
--- a/monitor/hmp-cmds-target.c
+++ b/monitor/hmp-cmds-target.c
@@ -318,6 +318,7 @@ void hmp_info_pg(Monitor *mon, const QDict *qdict)
     /* Print last entry, if one present */
     cc->sysemu_ops->mon_flush_page_print_state(cs, &state);
 }
+
 static void memory_dump(Monitor *mon, int count, int format, int wsize,
                         hwaddr addr, int is_physical)
 {
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index cbb6f6fc4d..1346ec0033 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2167,6 +2167,8 @@ bool x86_mon_init_page_table_iterator(Monitor *mon,
                                       struct mem_print_state *state);
 void x86_mon_info_pg_print_header(Monitor *mon, struct mem_print_state *state);
 bool x86_mon_flush_print_pg_state(CPUState *cs, struct mem_print_state *state);
+void x86_mon_print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
+                       hwaddr pte);
 
 void x86_cpu_dump_state(CPUState *cs, FILE *f, int flags);
 
diff --git a/target/i386/monitor.c b/target/i386/monitor.c
index 65e82e73e8..ecde164857 100644
--- a/target/i386/monitor.c
+++ b/target/i386/monitor.c
@@ -214,202 +214,71 @@ static hwaddr addr_canonical(CPUArchState *env, hwaddr addr)
     return addr;
 }
 
-static void print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
-                      hwaddr pte, hwaddr mask)
+void x86_mon_print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
+                       hwaddr pte)
 {
+    char buf[128];
+    char *pos = buf, *end = buf + sizeof(buf);
+
     addr = addr_canonical(env, addr);
 
-    monitor_printf(mon, HWADDR_FMT_plx ": " HWADDR_FMT_plx
-                   " %c%c%c%c%c%c%c%c%c\n",
-                   addr,
-                   pte & mask,
-                   pte & PG_NX_MASK ? 'X' : '-',
-                   pte & PG_GLOBAL_MASK ? 'G' : '-',
-                   pte & PG_PSE_MASK ? 'P' : '-',
-                   pte & PG_DIRTY_MASK ? 'D' : '-',
-                   pte & PG_ACCESSED_MASK ? 'A' : '-',
-                   pte & PG_PCD_MASK ? 'C' : '-',
-                   pte & PG_PWT_MASK ? 'T' : '-',
-                   pte & PG_USER_MASK ? 'U' : '-',
-                   pte & PG_RW_MASK ? 'W' : '-');
-}
+    pos += snprintf(pos, end - pos, HWADDR_FMT_plx ": " HWADDR_FMT_plx " ",
+                    addr, (hwaddr) (pte & PG_ADDRESS_MASK));
 
-static void tlb_info_32(Monitor *mon, CPUArchState *env)
-{
-    unsigned int l1, l2;
-    uint32_t pgd, pde, pte;
+    pos += snprintf(pos, end - pos, " %s", pg_bits(pte));
 
-    pgd = env->cr[3] & ~0xfff;
-    for(l1 = 0; l1 < 1024; l1++) {
-        cpu_physical_memory_read(pgd + l1 * 4, &pde, 4);
-        pde = le32_to_cpu(pde);
-        if (pde & PG_PRESENT_MASK) {
-            if ((pde & PG_PSE_MASK) && (env->cr[4] & CR4_PSE_MASK)) {
-                /* 4M pages */
-                print_pte(mon, env, (l1 << 22), pde, ~((1 << 21) - 1));
-            } else {
-                for(l2 = 0; l2 < 1024; l2++) {
-                    cpu_physical_memory_read((pde & ~0xfff) + l2 * 4, &pte, 4);
-                    pte = le32_to_cpu(pte);
-                    if (pte & PG_PRESENT_MASK) {
-                        print_pte(mon, env, (l1 << 22) + (l2 << 12),
-                                  pte & ~PG_PSE_MASK,
-                                  ~0xfff);
-                    }
-                }
-            }
-        }
+    /* Trim line to fit screen */
+    if (pos - buf > 79) {
+        strcpy(buf + 77, "..");
     }
-}
 
-static void tlb_info_pae32(Monitor *mon, CPUArchState *env)
-{
-    unsigned int l1, l2, l3;
-    uint64_t pdpe, pde, pte;
-    uint64_t pdp_addr, pd_addr, pt_addr;
-
-    pdp_addr = env->cr[3] & ~0x1f;
-    for (l1 = 0; l1 < 4; l1++) {
-        cpu_physical_memory_read(pdp_addr + l1 * 8, &pdpe, 8);
-        pdpe = le64_to_cpu(pdpe);
-        if (pdpe & PG_PRESENT_MASK) {
-            pd_addr = pdpe & 0x3fffffffff000ULL;
-            for (l2 = 0; l2 < 512; l2++) {
-                cpu_physical_memory_read(pd_addr + l2 * 8, &pde, 8);
-                pde = le64_to_cpu(pde);
-                if (pde & PG_PRESENT_MASK) {
-                    if (pde & PG_PSE_MASK) {
-                        /* 2M pages with PAE, CR4.PSE is ignored */
-                        print_pte(mon, env, (l1 << 30) + (l2 << 21), pde,
-                                  ~((hwaddr)(1 << 20) - 1));
-                    } else {
-                        pt_addr = pde & 0x3fffffffff000ULL;
-                        for (l3 = 0; l3 < 512; l3++) {
-                            cpu_physical_memory_read(pt_addr + l3 * 8, &pte, 8);
-                            pte = le64_to_cpu(pte);
-                            if (pte & PG_PRESENT_MASK) {
-                                print_pte(mon, env, (l1 << 30) + (l2 << 21)
-                                          + (l3 << 12),
-                                          pte & ~PG_PSE_MASK,
-                                          ~(hwaddr)0xfff);
-                            }
-                        }
-                    }
-                }
-            }
-        }
-    }
+    monitor_printf(mon, "%s\n", buf);
 }
 
-#ifdef TARGET_X86_64
-static void tlb_info_la48(Monitor *mon, CPUArchState *env,
-        uint64_t l0, uint64_t pml4_addr)
+static
+int mem_print_tlb(CPUState *cs, void *data, PTE_t *pte,
+                  vaddr vaddr_in, int height, int offset)
 {
-    uint64_t l1, l2, l3, l4;
-    uint64_t pml4e, pdpe, pde, pte;
-    uint64_t pdp_addr, pd_addr, pt_addr;
-
-    for (l1 = 0; l1 < 512; l1++) {
-        cpu_physical_memory_read(pml4_addr + l1 * 8, &pml4e, 8);
-        pml4e = le64_to_cpu(pml4e);
-        if (!(pml4e & PG_PRESENT_MASK)) {
-            continue;
-        }
+    struct mem_print_state *state = (struct mem_print_state *) data;
+    CPUClass *cc = CPU_GET_CLASS(cs);
 
-        pdp_addr = pml4e & 0x3fffffffff000ULL;
-        for (l2 = 0; l2 < 512; l2++) {
-            cpu_physical_memory_read(pdp_addr + l2 * 8, &pdpe, 8);
-            pdpe = le64_to_cpu(pdpe);
-            if (!(pdpe & PG_PRESENT_MASK)) {
-                continue;
-            }
-
-            if (pdpe & PG_PSE_MASK) {
-                /* 1G pages, CR4.PSE is ignored */
-                print_pte(mon, env, (l0 << 48) + (l1 << 39) + (l2 << 30),
-                        pdpe, 0x3ffffc0000000ULL);
-                continue;
-            }
-
-            pd_addr = pdpe & 0x3fffffffff000ULL;
-            for (l3 = 0; l3 < 512; l3++) {
-                cpu_physical_memory_read(pd_addr + l3 * 8, &pde, 8);
-                pde = le64_to_cpu(pde);
-                if (!(pde & PG_PRESENT_MASK)) {
-                    continue;
-                }
-
-                if (pde & PG_PSE_MASK) {
-                    /* 2M pages, CR4.PSE is ignored */
-                    print_pte(mon, env, (l0 << 48) + (l1 << 39) + (l2 << 30) +
-                            (l3 << 21), pde, 0x3ffffffe00000ULL);
-                    continue;
-                }
-
-                pt_addr = pde & 0x3fffffffff000ULL;
-                for (l4 = 0; l4 < 512; l4++) {
-                    cpu_physical_memory_read(pt_addr
-                            + l4 * 8,
-                            &pte, 8);
-                    pte = le64_to_cpu(pte);
-                    if (pte & PG_PRESENT_MASK) {
-                        print_pte(mon, env, (l0 << 48) + (l1 << 39) +
-                                (l2 << 30) + (l3 << 21) + (l4 << 12),
-                                pte & ~PG_PSE_MASK, 0x3fffffffff000ULL);
-                    }
-                }
-            }
-        }
-    }
+    cc->sysemu_ops->mon_print_pte(state->mon, state->env, vaddr_in,
+                                  pte->pte64_t);
+    return 0;
 }
 
-static void tlb_info_la57(Monitor *mon, CPUArchState *env)
+void hmp_info_tlb(Monitor *mon, const QDict *qdict)
 {
-    uint64_t l0;
-    uint64_t pml5e;
-    uint64_t pml5_addr;
+    struct mem_print_state state;
 
-    pml5_addr = env->cr[3] & 0x3fffffffff000ULL;
-    for (l0 = 0; l0 < 512; l0++) {
-        cpu_physical_memory_read(pml5_addr + l0 * 8, &pml5e, 8);
-        pml5e = le64_to_cpu(pml5e);
-        if (pml5e & PG_PRESENT_MASK) {
-            tlb_info_la48(mon, env, l0, pml5e & 0x3fffffffff000ULL);
-        }
+    CPUState *cs = mon_get_cpu(mon);
+    if (!cs) {
+        monitor_printf(mon, "Unable to get CPUState.  Internal error\n");
+        return;
     }
-}
-#endif /* TARGET_X86_64 */
 
-void hmp_info_tlb(Monitor *mon, const QDict *qdict)
-{
-    CPUArchState *env;
+    CPUClass *cc = CPU_GET_CLASS(cs);
 
-    env = mon_get_cpu_env(mon);
-    if (!env) {
-        monitor_printf(mon, "No CPU available\n");
-        return;
+    if ((!cc->sysemu_ops->pte_child)
+        || (!cc->sysemu_ops->pte_leaf)
+        || (!cc->sysemu_ops->pte_leaf_page_size)
+        || (!cc->sysemu_ops->page_table_entries_per_node)
+        || (!cc->sysemu_ops->pte_flags)
+        || (!cc->sysemu_ops->mon_print_pte)
+        || (!cc->sysemu_ops->mon_init_page_table_iterator)) {
+        monitor_printf(mon, "Info tlb unsupported on this ISA\n");
     }
 
-    if (!(env->cr[0] & CR0_PG_MASK)) {
-        monitor_printf(mon, "PG disabled\n");
+    if (!cc->sysemu_ops->mon_init_page_table_iterator(mon, &state)) {
+        monitor_printf(mon, "Unable to initialize page table iterator\n");
         return;
     }
-    if (env->cr[4] & CR4_PAE_MASK) {
-#ifdef TARGET_X86_64
-        if (env->hflags & HF_LMA_MASK) {
-            if (env->cr[4] & CR4_LA57_MASK) {
-                tlb_info_la57(mon, env);
-            } else {
-                tlb_info_la48(mon, env, 0, env->cr[3] & 0x3fffffffff000ULL);
-            }
-        } else
-#endif
-        {
-            tlb_info_pae32(mon, env);
-        }
-    } else {
-        tlb_info_32(mon, env);
-    }
+
+    /**
+     * 'info tlb' visits only leaf PTEs marked present.
+     * It does not check other protection bits.
+     */
+    for_each_pte(cs, &mem_print_tlb, &state, false, false);
 }
 
 static void mem_print(Monitor *mon, CPUArchState *env,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 3/6] Convert 'info mem' to use generic iterator
  2024-06-06 14:02 [PATCH v3 0/6] Rework x86 page table walks Don Porter
  2024-06-06 14:02 ` [PATCH v3 1/6] Add an "info pg" command that prints the current page tables Don Porter
  2024-06-06 14:02 ` [PATCH v3 2/6] Convert 'info tlb' to use generic iterator Don Porter
@ 2024-06-06 14:02 ` Don Porter
  2024-06-10  9:15   ` Daniel P. Berrangé
  2024-06-06 14:02 ` [PATCH v3 4/6] Convert x86_cpu_get_memory_mapping() to use generic iterators Don Porter
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Don Porter @ 2024-06-06 14:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: dave, peter.maydell, nadav.amit, richard.henderson, philmd,
	Don Porter

Signed-off-by: Don Porter <porter@cs.unc.edu>
---
 include/hw/core/sysemu-cpu-ops.h |   6 +
 include/monitor/monitor.h        |   4 +
 monitor/hmp-cmds-target.c        |   5 +-
 target/i386/cpu.c                |   1 +
 target/i386/cpu.h                |   1 +
 target/i386/monitor.c            | 354 ++++---------------------------
 6 files changed, 60 insertions(+), 311 deletions(-)

diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
index bf3de3e004..3bef129460 100644
--- a/include/hw/core/sysemu-cpu-ops.h
+++ b/include/hw/core/sysemu-cpu-ops.h
@@ -250,6 +250,12 @@ typedef struct SysemuCPUOps {
     void (*mon_print_pte) (Monitor *mon, CPUArchState *env, hwaddr addr,
                            hwaddr pte);
 
+    /**
+     * @mon_print_mem: Hook called by the monitor to print a range
+     * of memory mappings in 'info mem'
+     */
+    bool (*mon_print_mem)(CPUState *cs, struct mem_print_state *state);
+
 } SysemuCPUOps;
 
 #endif /* SYSEMU_CPU_OPS_H */
diff --git a/include/monitor/monitor.h b/include/monitor/monitor.h
index 965f5d5450..e954946ba0 100644
--- a/include/monitor/monitor.h
+++ b/include/monitor/monitor.h
@@ -5,6 +5,7 @@
 #include "qapi/qapi-types-misc.h"
 #include "qemu/readline.h"
 #include "exec/hwaddr.h"
+#include "hw/core/cpu.h"
 
 typedef struct MonitorHMP MonitorHMP;
 typedef struct MonitorOptions MonitorOptions;
@@ -63,4 +64,7 @@ void monitor_register_hmp_info_hrt(const char *name,
 int error_vprintf_unless_qmp(const char *fmt, va_list ap) G_GNUC_PRINTF(1, 0);
 int error_printf_unless_qmp(const char *fmt, ...) G_GNUC_PRINTF(1, 2);
 
+int compressing_iterator(CPUState *cs, void *data, PTE_t *pte, vaddr vaddr_in,
+                         int height, int offset);
+
 #endif /* MONITOR_H */
diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
index 3393e5ad0b..8ce37d3187 100644
--- a/monitor/hmp-cmds-target.c
+++ b/monitor/hmp-cmds-target.c
@@ -122,9 +122,8 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict)
 }
 
 /* Assume only called on present entries */
-static
-int compressing_iterator(CPUState *cs, void *data, PTE_t *pte,
-                         vaddr vaddr_in, int height, int offset)
+int compressing_iterator(CPUState *cs, void *data, PTE_t *pte, vaddr vaddr_in,
+                         int height, int offset)
 {
     CPUClass *cc = CPU_GET_CLASS(cs);
     struct mem_print_state *state = (struct mem_print_state *) data;
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 8bd6164b68..046d75f6bb 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -8316,6 +8316,7 @@ static const struct SysemuCPUOps i386_sysemu_ops = {
     .mon_init_page_table_iterator = &x86_mon_init_page_table_iterator,
     .mon_info_pg_print_header = &x86_mon_info_pg_print_header,
     .mon_flush_page_print_state = &x86_mon_flush_print_pg_state,
+    .mon_print_mem = &x86_mon_print_mem,
 };
 #endif
 
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 1346ec0033..1e463cc556 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2169,6 +2169,7 @@ void x86_mon_info_pg_print_header(Monitor *mon, struct mem_print_state *state);
 bool x86_mon_flush_print_pg_state(CPUState *cs, struct mem_print_state *state);
 void x86_mon_print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
                        hwaddr pte);
+bool x86_mon_print_mem(CPUState *cs, struct mem_print_state *state);
 
 void x86_cpu_dump_state(CPUState *cs, FILE *f, int flags);
 
diff --git a/target/i386/monitor.c b/target/i386/monitor.c
index ecde164857..215c018d1f 100644
--- a/target/i386/monitor.c
+++ b/target/i386/monitor.c
@@ -281,332 +281,70 @@ void hmp_info_tlb(Monitor *mon, const QDict *qdict)
     for_each_pte(cs, &mem_print_tlb, &state, false, false);
 }
 
-static void mem_print(Monitor *mon, CPUArchState *env,
-                      hwaddr *pstart, int *plast_prot,
-                      hwaddr end, int prot)
+bool x86_mon_print_mem(CPUState *cs, struct mem_print_state *state)
 {
-    int prot1;
-    prot1 = *plast_prot;
-    if (prot != prot1) {
-        if (*pstart != -1) {
-            monitor_printf(mon, HWADDR_FMT_plx "-" HWADDR_FMT_plx " "
-                           HWADDR_FMT_plx " %c%c%c\n",
-                           addr_canonical(env, *pstart),
-                           addr_canonical(env, end),
-                           addr_canonical(env, end - *pstart),
-                           prot1 & PG_USER_MASK ? 'u' : '-',
-                           'r',
-                           prot1 & PG_RW_MASK ? 'w' : '-');
-        }
-        if (prot != 0)
-            *pstart = end;
-        else
-            *pstart = -1;
-        *plast_prot = prot;
-    }
-}
+    CPUArchState *env = state->env;
+    int i = 0;
 
-static void mem_info_32(Monitor *mon, CPUArchState *env)
-{
-    unsigned int l1, l2;
-    int prot, last_prot;
-    uint32_t pgd, pde, pte;
-    hwaddr start, end;
-
-    pgd = env->cr[3] & ~0xfff;
-    last_prot = 0;
-    start = -1;
-    for(l1 = 0; l1 < 1024; l1++) {
-        cpu_physical_memory_read(pgd + l1 * 4, &pde, 4);
-        pde = le32_to_cpu(pde);
-        end = l1 << 22;
-        if (pde & PG_PRESENT_MASK) {
-            if ((pde & PG_PSE_MASK) && (env->cr[4] & CR4_PSE_MASK)) {
-                prot = pde & (PG_USER_MASK | PG_RW_MASK | PG_PRESENT_MASK);
-                mem_print(mon, env, &start, &last_prot, end, prot);
-            } else {
-                for(l2 = 0; l2 < 1024; l2++) {
-                    cpu_physical_memory_read((pde & ~0xfff) + l2 * 4, &pte, 4);
-                    pte = le32_to_cpu(pte);
-                    end = (l1 << 22) + (l2 << 12);
-                    if (pte & PG_PRESENT_MASK) {
-                        prot = pte & pde &
-                            (PG_USER_MASK | PG_RW_MASK | PG_PRESENT_MASK);
-                    } else {
-                        prot = 0;
-                    }
-                    mem_print(mon, env, &start, &last_prot, end, prot);
-                }
-            }
-        } else {
-            prot = 0;
-            mem_print(mon, env, &start, &last_prot, end, prot);
+    /* We need to figure out the lowest populated level */
+    for ( ; i < state->max_height; i++) {
+        if (state->vstart[i] != -1) {
+            break;
         }
     }
-    /* Flush last range */
-    mem_print(mon, env, &start, &last_prot, (hwaddr)1 << 32, 0);
-}
 
-static void mem_info_pae32(Monitor *mon, CPUArchState *env)
-{
-    unsigned int l1, l2, l3;
-    int prot, last_prot;
-    uint64_t pdpe, pde, pte;
-    uint64_t pdp_addr, pd_addr, pt_addr;
-    hwaddr start, end;
-
-    pdp_addr = env->cr[3] & ~0x1f;
-    last_prot = 0;
-    start = -1;
-    for (l1 = 0; l1 < 4; l1++) {
-        cpu_physical_memory_read(pdp_addr + l1 * 8, &pdpe, 8);
-        pdpe = le64_to_cpu(pdpe);
-        end = l1 << 30;
-        if (pdpe & PG_PRESENT_MASK) {
-            pd_addr = pdpe & 0x3fffffffff000ULL;
-            for (l2 = 0; l2 < 512; l2++) {
-                cpu_physical_memory_read(pd_addr + l2 * 8, &pde, 8);
-                pde = le64_to_cpu(pde);
-                end = (l1 << 30) + (l2 << 21);
-                if (pde & PG_PRESENT_MASK) {
-                    if (pde & PG_PSE_MASK) {
-                        prot = pde & (PG_USER_MASK | PG_RW_MASK |
-                                      PG_PRESENT_MASK);
-                        mem_print(mon, env, &start, &last_prot, end, prot);
-                    } else {
-                        pt_addr = pde & 0x3fffffffff000ULL;
-                        for (l3 = 0; l3 < 512; l3++) {
-                            cpu_physical_memory_read(pt_addr + l3 * 8, &pte, 8);
-                            pte = le64_to_cpu(pte);
-                            end = (l1 << 30) + (l2 << 21) + (l3 << 12);
-                            if (pte & PG_PRESENT_MASK) {
-                                prot = pte & pde & (PG_USER_MASK | PG_RW_MASK |
-                                                    PG_PRESENT_MASK);
-                            } else {
-                                prot = 0;
-                            }
-                            mem_print(mon, env, &start, &last_prot, end, prot);
-                        }
-                    }
-                } else {
-                    prot = 0;
-                    mem_print(mon, env, &start, &last_prot, end, prot);
-                }
-            }
-        } else {
-            prot = 0;
-            mem_print(mon, env, &start, &last_prot, end, prot);
-        }
-    }
-    /* Flush last range */
-    mem_print(mon, env, &start, &last_prot, (hwaddr)1 << 32, 0);
-}
+    hwaddr vstart = state->vstart[i];
+    hwaddr end = state->vend[i] + x86_pte_leaf_page_size(cs, i);
+    int prot = x86_pte_flags(state->ent[i]);
 
 
-#ifdef TARGET_X86_64
-static void mem_info_la48(Monitor *mon, CPUArchState *env)
-{
-    int prot, last_prot;
-    uint64_t l1, l2, l3, l4;
-    uint64_t pml4e, pdpe, pde, pte;
-    uint64_t pml4_addr, pdp_addr, pd_addr, pt_addr, start, end;
-
-    pml4_addr = env->cr[3] & 0x3fffffffff000ULL;
-    last_prot = 0;
-    start = -1;
-    for (l1 = 0; l1 < 512; l1++) {
-        cpu_physical_memory_read(pml4_addr + l1 * 8, &pml4e, 8);
-        pml4e = le64_to_cpu(pml4e);
-        end = l1 << 39;
-        if (pml4e & PG_PRESENT_MASK) {
-            pdp_addr = pml4e & 0x3fffffffff000ULL;
-            for (l2 = 0; l2 < 512; l2++) {
-                cpu_physical_memory_read(pdp_addr + l2 * 8, &pdpe, 8);
-                pdpe = le64_to_cpu(pdpe);
-                end = (l1 << 39) + (l2 << 30);
-                if (pdpe & PG_PRESENT_MASK) {
-                    if (pdpe & PG_PSE_MASK) {
-                        prot = pdpe & (PG_USER_MASK | PG_RW_MASK |
-                                       PG_PRESENT_MASK);
-                        prot &= pml4e;
-                        mem_print(mon, env, &start, &last_prot, end, prot);
-                    } else {
-                        pd_addr = pdpe & 0x3fffffffff000ULL;
-                        for (l3 = 0; l3 < 512; l3++) {
-                            cpu_physical_memory_read(pd_addr + l3 * 8, &pde, 8);
-                            pde = le64_to_cpu(pde);
-                            end = (l1 << 39) + (l2 << 30) + (l3 << 21);
-                            if (pde & PG_PRESENT_MASK) {
-                                if (pde & PG_PSE_MASK) {
-                                    prot = pde & (PG_USER_MASK | PG_RW_MASK |
-                                                  PG_PRESENT_MASK);
-                                    prot &= pml4e & pdpe;
-                                    mem_print(mon, env, &start,
-                                              &last_prot, end, prot);
-                                } else {
-                                    pt_addr = pde & 0x3fffffffff000ULL;
-                                    for (l4 = 0; l4 < 512; l4++) {
-                                        cpu_physical_memory_read(pt_addr
-                                                                 + l4 * 8,
-                                                                 &pte, 8);
-                                        pte = le64_to_cpu(pte);
-                                        end = (l1 << 39) + (l2 << 30) +
-                                            (l3 << 21) + (l4 << 12);
-                                        if (pte & PG_PRESENT_MASK) {
-                                            prot = pte & (PG_USER_MASK | PG_RW_MASK |
-                                                          PG_PRESENT_MASK);
-                                            prot &= pml4e & pdpe & pde;
-                                        } else {
-                                            prot = 0;
-                                        }
-                                        mem_print(mon, env, &start,
-                                                  &last_prot, end, prot);
-                                    }
-                                }
-                            } else {
-                                prot = 0;
-                                mem_print(mon, env, &start,
-                                          &last_prot, end, prot);
-                            }
-                        }
-                    }
-                } else {
-                    prot = 0;
-                    mem_print(mon, env, &start, &last_prot, end, prot);
-                }
-            }
-        } else {
-            prot = 0;
-            mem_print(mon, env, &start, &last_prot, end, prot);
-        }
-    }
-    /* Flush last range */
-    mem_print(mon, env, &start, &last_prot, (hwaddr)1 << 48, 0);
+    monitor_printf(state->mon, HWADDR_FMT_plx "-" HWADDR_FMT_plx " "
+                   HWADDR_FMT_plx " %c%c%c\n",
+                   addr_canonical(env, vstart),
+                   addr_canonical(env, end),
+                   addr_canonical(env, end - vstart),
+                   prot & PG_USER_MASK ? 'u' : '-',
+                   'r',
+                   prot & PG_RW_MASK ? 'w' : '-');
+    return true;
 }
 
-static void mem_info_la57(Monitor *mon, CPUArchState *env)
+void hmp_info_mem(Monitor *mon, const QDict *qdict)
 {
-    int prot, last_prot;
-    uint64_t l0, l1, l2, l3, l4;
-    uint64_t pml5e, pml4e, pdpe, pde, pte;
-    uint64_t pml5_addr, pml4_addr, pdp_addr, pd_addr, pt_addr, start, end;
-
-    pml5_addr = env->cr[3] & 0x3fffffffff000ULL;
-    last_prot = 0;
-    start = -1;
-    for (l0 = 0; l0 < 512; l0++) {
-        cpu_physical_memory_read(pml5_addr + l0 * 8, &pml5e, 8);
-        pml5e = le64_to_cpu(pml5e);
-        end = l0 << 48;
-        if (!(pml5e & PG_PRESENT_MASK)) {
-            prot = 0;
-            mem_print(mon, env, &start, &last_prot, end, prot);
-            continue;
-        }
-
-        pml4_addr = pml5e & 0x3fffffffff000ULL;
-        for (l1 = 0; l1 < 512; l1++) {
-            cpu_physical_memory_read(pml4_addr + l1 * 8, &pml4e, 8);
-            pml4e = le64_to_cpu(pml4e);
-            end = (l0 << 48) + (l1 << 39);
-            if (!(pml4e & PG_PRESENT_MASK)) {
-                prot = 0;
-                mem_print(mon, env, &start, &last_prot, end, prot);
-                continue;
-            }
+    struct mem_print_state state;
+    CPUState *cs = mon_get_cpu(mon);
 
-            pdp_addr = pml4e & 0x3fffffffff000ULL;
-            for (l2 = 0; l2 < 512; l2++) {
-                cpu_physical_memory_read(pdp_addr + l2 * 8, &pdpe, 8);
-                pdpe = le64_to_cpu(pdpe);
-                end = (l0 << 48) + (l1 << 39) + (l2 << 30);
-                if (pdpe & PG_PRESENT_MASK) {
-                    prot = 0;
-                    mem_print(mon, env, &start, &last_prot, end, prot);
-                    continue;
-                }
-
-                if (pdpe & PG_PSE_MASK) {
-                    prot = pdpe & (PG_USER_MASK | PG_RW_MASK |
-                            PG_PRESENT_MASK);
-                    prot &= pml5e & pml4e;
-                    mem_print(mon, env, &start, &last_prot, end, prot);
-                    continue;
-                }
-
-                pd_addr = pdpe & 0x3fffffffff000ULL;
-                for (l3 = 0; l3 < 512; l3++) {
-                    cpu_physical_memory_read(pd_addr + l3 * 8, &pde, 8);
-                    pde = le64_to_cpu(pde);
-                    end = (l0 << 48) + (l1 << 39) + (l2 << 30) + (l3 << 21);
-                    if (pde & PG_PRESENT_MASK) {
-                        prot = 0;
-                        mem_print(mon, env, &start, &last_prot, end, prot);
-                        continue;
-                    }
-
-                    if (pde & PG_PSE_MASK) {
-                        prot = pde & (PG_USER_MASK | PG_RW_MASK |
-                                PG_PRESENT_MASK);
-                        prot &= pml5e & pml4e & pdpe;
-                        mem_print(mon, env, &start, &last_prot, end, prot);
-                        continue;
-                    }
-
-                    pt_addr = pde & 0x3fffffffff000ULL;
-                    for (l4 = 0; l4 < 512; l4++) {
-                        cpu_physical_memory_read(pt_addr + l4 * 8, &pte, 8);
-                        pte = le64_to_cpu(pte);
-                        end = (l0 << 48) + (l1 << 39) + (l2 << 30) +
-                            (l3 << 21) + (l4 << 12);
-                        if (pte & PG_PRESENT_MASK) {
-                            prot = pte & (PG_USER_MASK | PG_RW_MASK |
-                                    PG_PRESENT_MASK);
-                            prot &= pml5e & pml4e & pdpe & pde;
-                        } else {
-                            prot = 0;
-                        }
-                        mem_print(mon, env, &start, &last_prot, end, prot);
-                    }
-                }
-            }
-        }
+    if (!cs) {
+        monitor_printf(mon, "Unable to get CPUState.  Internal error\n");
+        return;
     }
-    /* Flush last range */
-    mem_print(mon, env, &start, &last_prot, (hwaddr)1 << 57, 0);
-}
-#endif /* TARGET_X86_64 */
 
-void hmp_info_mem(Monitor *mon, const QDict *qdict)
-{
-    CPUArchState *env;
+    CPUClass *cc = CPU_GET_CLASS(cs);
 
-    env = mon_get_cpu_env(mon);
-    if (!env) {
-        monitor_printf(mon, "No CPU available\n");
-        return;
+    if ((!cc->sysemu_ops->pte_child)
+        || (!cc->sysemu_ops->pte_leaf)
+        || (!cc->sysemu_ops->pte_leaf_page_size)
+        || (!cc->sysemu_ops->page_table_entries_per_node)
+        || (!cc->sysemu_ops->pte_flags)
+        || (!cc->sysemu_ops->mon_print_mem)
+        || (!cc->sysemu_ops->mon_init_page_table_iterator)) {
+        monitor_printf(mon, "Info tlb unsupported on this ISA\n");
     }
 
-    if (!(env->cr[0] & CR0_PG_MASK)) {
-        monitor_printf(mon, "PG disabled\n");
+    if (!cc->sysemu_ops->mon_init_page_table_iterator(mon, &state)) {
+        monitor_printf(mon, "Unable to initialize page table iterator\n");
         return;
     }
-    if (env->cr[4] & CR4_PAE_MASK) {
-#ifdef TARGET_X86_64
-        if (env->hflags & HF_LMA_MASK) {
-            if (env->cr[4] & CR4_LA57_MASK) {
-                mem_info_la57(mon, env);
-            } else {
-                mem_info_la48(mon, env);
-            }
-        } else
-#endif
-        {
-            mem_info_pae32(mon, env);
-        }
-    } else {
-        mem_info_32(mon, env);
-    }
+
+    state.flusher = cc->sysemu_ops->mon_print_mem;
+
+    /**
+     * We must visit interior entries to update prot
+     */
+    for_each_pte(cs, &compressing_iterator, &state, true, false);
+
+    /* Flush the last entry, if needed */
+    cc->sysemu_ops->mon_print_mem(cs, &state);
 }
 
 void hmp_mce(Monitor *mon, const QDict *qdict)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 4/6] Convert x86_cpu_get_memory_mapping() to use generic iterators
  2024-06-06 14:02 [PATCH v3 0/6] Rework x86 page table walks Don Porter
                   ` (2 preceding siblings ...)
  2024-06-06 14:02 ` [PATCH v3 3/6] Convert 'info mem' " Don Porter
@ 2024-06-06 14:02 ` Don Porter
  2024-06-06 14:02 ` [PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code Don Porter
  2024-06-06 14:02 ` [PATCH v3 6/6] Convert x86_mmu_translate() to use common code Don Porter
  5 siblings, 0 replies; 22+ messages in thread
From: Don Porter @ 2024-06-06 14:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: dave, peter.maydell, nadav.amit, richard.henderson, philmd,
	Don Porter

Signed-off-by: Don Porter <porter@cs.unc.edu>
---
 target/i386/arch_memory_mapping.c | 320 ++++--------------------------
 1 file changed, 43 insertions(+), 277 deletions(-)

diff --git a/target/i386/arch_memory_mapping.c b/target/i386/arch_memory_mapping.c
index 562a00b5a7..b52e98133c 100644
--- a/target/i386/arch_memory_mapping.c
+++ b/target/i386/arch_memory_mapping.c
@@ -19,6 +19,7 @@
  ************** code hook implementations for x86 ***********
  */
 
+/* PAE Paging or IA-32e Paging */
 #define PML4_ADDR_MASK 0xffffffffff000ULL /* selects bits 51:12 */
 
 /**
@@ -365,301 +366,66 @@ x86_pte_child(CPUState *cs, PTE_t *pte, int height)
     return -1;
 }
 
-/* PAE Paging or IA-32e Paging */
-static void walk_pte(MemoryMappingList *list, AddressSpace *as,
-                     hwaddr pte_start_addr,
-                     int32_t a20_mask, target_ulong start_line_addr)
-{
-    hwaddr pte_addr, start_paddr;
-    uint64_t pte;
-    target_ulong start_vaddr;
-    int i;
-
-    for (i = 0; i < 512; i++) {
-        pte_addr = (pte_start_addr + i * 8) & a20_mask;
-        pte = address_space_ldq(as, pte_addr, MEMTXATTRS_UNSPECIFIED, NULL);
-        if (!(pte & PG_PRESENT_MASK)) {
-            /* not present */
-            continue;
-        }
-
-        start_paddr = (pte & ~0xfff) & ~(0x1ULL << 63);
-        if (cpu_physical_memory_is_io(start_paddr)) {
-            /* I/O region */
-            continue;
-        }
-
-        start_vaddr = start_line_addr | ((i & 0x1ff) << 12);
-        memory_mapping_list_add_merge_sorted(list, start_paddr,
-                                             start_vaddr, 1 << 12);
-    }
-}
-
-/* 32-bit Paging */
-static void walk_pte2(MemoryMappingList *list, AddressSpace *as,
-                      hwaddr pte_start_addr, int32_t a20_mask,
-                      target_ulong start_line_addr)
-{
-    hwaddr pte_addr, start_paddr;
-    uint32_t pte;
-    target_ulong start_vaddr;
-    int i;
-
-    for (i = 0; i < 1024; i++) {
-        pte_addr = (pte_start_addr + i * 4) & a20_mask;
-        pte = address_space_ldl(as, pte_addr, MEMTXATTRS_UNSPECIFIED, NULL);
-        if (!(pte & PG_PRESENT_MASK)) {
-            /* not present */
-            continue;
-        }
-
-        start_paddr = pte & ~0xfff;
-        if (cpu_physical_memory_is_io(start_paddr)) {
-            /* I/O region */
-            continue;
-        }
-
-        start_vaddr = start_line_addr | ((i & 0x3ff) << 12);
-        memory_mapping_list_add_merge_sorted(list, start_paddr,
-                                             start_vaddr, 1 << 12);
-    }
-}
-
-/* PAE Paging or IA-32e Paging */
-#define PLM4_ADDR_MASK 0xffffffffff000ULL /* selects bits 51:12 */
+/**
+ * Back to x86 hooks
+ */
+struct memory_mapping_data {
+    MemoryMappingList *list;
+};
 
-static void walk_pde(MemoryMappingList *list, AddressSpace *as,
-                     hwaddr pde_start_addr,
-                     int32_t a20_mask, target_ulong start_line_addr)
+static int add_memory_mapping_to_list(CPUState *cs, void *data, PTE_t *pte,
+                                      vaddr vaddr_in, int height,
+                                      int offset)
 {
-    hwaddr pde_addr, pte_start_addr, start_paddr;
-    uint64_t pde;
-    target_ulong line_addr, start_vaddr;
-    int i;
-
-    for (i = 0; i < 512; i++) {
-        pde_addr = (pde_start_addr + i * 8) & a20_mask;
-        pde = address_space_ldq(as, pde_addr, MEMTXATTRS_UNSPECIFIED, NULL);
-        if (!(pde & PG_PRESENT_MASK)) {
-            /* not present */
-            continue;
-        }
-
-        line_addr = start_line_addr | ((i & 0x1ff) << 21);
-        if (pde & PG_PSE_MASK) {
-            /* 2 MB page */
-            start_paddr = (pde & ~0x1fffff) & ~(0x1ULL << 63);
-            if (cpu_physical_memory_is_io(start_paddr)) {
-                /* I/O region */
-                continue;
-            }
-            start_vaddr = line_addr;
-            memory_mapping_list_add_merge_sorted(list, start_paddr,
-                                                 start_vaddr, 1 << 21);
-            continue;
-        }
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
 
-        pte_start_addr = (pde & PLM4_ADDR_MASK) & a20_mask;
-        walk_pte(list, as, pte_start_addr, a20_mask, line_addr);
-    }
-}
+    struct memory_mapping_data *mm_data = (struct memory_mapping_data *) data;
 
-/* 32-bit Paging */
-static void walk_pde2(MemoryMappingList *list, AddressSpace *as,
-                      hwaddr pde_start_addr, int32_t a20_mask,
-                      bool pse)
-{
-    hwaddr pde_addr, pte_start_addr, start_paddr, high_paddr;
-    uint32_t pde;
-    target_ulong line_addr, start_vaddr;
-    int i;
-
-    for (i = 0; i < 1024; i++) {
-        pde_addr = (pde_start_addr + i * 4) & a20_mask;
-        pde = address_space_ldl(as, pde_addr, MEMTXATTRS_UNSPECIFIED, NULL);
-        if (!(pde & PG_PRESENT_MASK)) {
-            /* not present */
-            continue;
+    hwaddr start_paddr = 0;
+    size_t pg_size = x86_pte_leaf_page_size(cs, height);
+    switch (height) {
+    case 1:
+        start_paddr = pte->pte64_t & ~0xfff;
+        if (env->cr[4] & CR4_PAE_MASK) {
+            start_paddr &= ~(0x1ULL << 63);
         }
-
-        line_addr = (((unsigned int)i & 0x3ff) << 22);
-        if ((pde & PG_PSE_MASK) && pse) {
+        break;
+    case 2:
+        if (env->cr[4] & CR4_PAE_MASK) {
+            start_paddr = (pte->pte64_t & ~0x1fffff) & ~(0x1ULL << 63);
+        } else {
+            assert(!!(env->cr[4] & CR4_PSE_MASK));
             /*
              * 4 MB page:
              * bits 39:32 are bits 20:13 of the PDE
              * bit3 31:22 are bits 31:22 of the PDE
              */
-            high_paddr = ((hwaddr)(pde & 0x1fe000) << 19);
-            start_paddr = (pde & ~0x3fffff) | high_paddr;
-            if (cpu_physical_memory_is_io(start_paddr)) {
-                /* I/O region */
-                continue;
-            }
-            start_vaddr = line_addr;
-            memory_mapping_list_add_merge_sorted(list, start_paddr,
-                                                 start_vaddr, 1 << 22);
-            continue;
-        }
-
-        pte_start_addr = (pde & ~0xfff) & a20_mask;
-        walk_pte2(list, as, pte_start_addr, a20_mask, line_addr);
-    }
-}
-
-/* PAE Paging */
-static void walk_pdpe2(MemoryMappingList *list, AddressSpace *as,
-                       hwaddr pdpe_start_addr, int32_t a20_mask)
-{
-    hwaddr pdpe_addr, pde_start_addr;
-    uint64_t pdpe;
-    target_ulong line_addr;
-    int i;
-
-    for (i = 0; i < 4; i++) {
-        pdpe_addr = (pdpe_start_addr + i * 8) & a20_mask;
-        pdpe = address_space_ldq(as, pdpe_addr, MEMTXATTRS_UNSPECIFIED, NULL);
-        if (!(pdpe & PG_PRESENT_MASK)) {
-            /* not present */
-            continue;
-        }
-
-        line_addr = (((unsigned int)i & 0x3) << 30);
-        pde_start_addr = (pdpe & ~0xfff) & a20_mask;
-        walk_pde(list, as, pde_start_addr, a20_mask, line_addr);
-    }
-}
-
-#ifdef TARGET_X86_64
-/* IA-32e Paging */
-static void walk_pdpe(MemoryMappingList *list, AddressSpace *as,
-                      hwaddr pdpe_start_addr, int32_t a20_mask,
-                      target_ulong start_line_addr)
-{
-    hwaddr pdpe_addr, pde_start_addr, start_paddr;
-    uint64_t pdpe;
-    target_ulong line_addr, start_vaddr;
-    int i;
-
-    for (i = 0; i < 512; i++) {
-        pdpe_addr = (pdpe_start_addr + i * 8) & a20_mask;
-        pdpe = address_space_ldq(as, pdpe_addr, MEMTXATTRS_UNSPECIFIED, NULL);
-        if (!(pdpe & PG_PRESENT_MASK)) {
-            /* not present */
-            continue;
-        }
-
-        line_addr = start_line_addr | ((i & 0x1ffULL) << 30);
-        if (pdpe & PG_PSE_MASK) {
-            /* 1 GB page */
-            start_paddr = (pdpe & ~0x3fffffff) & ~(0x1ULL << 63);
-            if (cpu_physical_memory_is_io(start_paddr)) {
-                /* I/O region */
-                continue;
-            }
-            start_vaddr = line_addr;
-            memory_mapping_list_add_merge_sorted(list, start_paddr,
-                                                 start_vaddr, 1 << 30);
-            continue;
+            hwaddr high_paddr = ((hwaddr)(pte->pte64_t & 0x1fe000) << 19);
+            start_paddr = (pte->pte64_t & ~0x3fffff) | high_paddr;
         }
-
-        pde_start_addr = (pdpe & PLM4_ADDR_MASK) & a20_mask;
-        walk_pde(list, as, pde_start_addr, a20_mask, line_addr);
+        break;
+    case 3:
+        /* Select bits 30--51 */
+        start_paddr = (pte->pte64_t & 0xfffffc0000000);
+        break;
+    default:
+        g_assert_not_reached();
     }
-}
-
-/* IA-32e Paging */
-static void walk_pml4e(MemoryMappingList *list, AddressSpace *as,
-                       hwaddr pml4e_start_addr, int32_t a20_mask,
-                       target_ulong start_line_addr)
-{
-    hwaddr pml4e_addr, pdpe_start_addr;
-    uint64_t pml4e;
-    target_ulong line_addr;
-    int i;
-
-    for (i = 0; i < 512; i++) {
-        pml4e_addr = (pml4e_start_addr + i * 8) & a20_mask;
-        pml4e = address_space_ldq(as, pml4e_addr, MEMTXATTRS_UNSPECIFIED,
-                                  NULL);
-        if (!(pml4e & PG_PRESENT_MASK)) {
-            /* not present */
-            continue;
-        }
 
-        line_addr = start_line_addr | ((i & 0x1ffULL) << 39);
-        pdpe_start_addr = (pml4e & PLM4_ADDR_MASK) & a20_mask;
-        walk_pdpe(list, as, pdpe_start_addr, a20_mask, line_addr);
+    /* This hook skips mappings for the I/O region */
+    if (cpu_physical_memory_is_io(start_paddr)) {
+        /* I/O region */
+        return 0;
     }
-}
 
-static void walk_pml5e(MemoryMappingList *list, AddressSpace *as,
-                       hwaddr pml5e_start_addr, int32_t a20_mask)
-{
-    hwaddr pml5e_addr, pml4e_start_addr;
-    uint64_t pml5e;
-    target_ulong line_addr;
-    int i;
-
-    for (i = 0; i < 512; i++) {
-        pml5e_addr = (pml5e_start_addr + i * 8) & a20_mask;
-        pml5e = address_space_ldq(as, pml5e_addr, MEMTXATTRS_UNSPECIFIED,
-                                  NULL);
-        if (!(pml5e & PG_PRESENT_MASK)) {
-            /* not present */
-            continue;
-        }
-
-        line_addr = (0x7fULL << 57) | ((i & 0x1ffULL) << 48);
-        pml4e_start_addr = (pml5e & PLM4_ADDR_MASK) & a20_mask;
-        walk_pml4e(list, as, pml4e_start_addr, a20_mask, line_addr);
-    }
+    memory_mapping_list_add_merge_sorted(mm_data->list, start_paddr,
+                                         vaddr_in, pg_size);
+    return 0;
 }
-#endif
 
 bool x86_cpu_get_memory_mapping(CPUState *cs, MemoryMappingList *list,
                                 Error **errp)
 {
-    X86CPU *cpu = X86_CPU(cs);
-    CPUX86State *env = &cpu->env;
-    int32_t a20_mask;
-
-    if (!cpu_paging_enabled(cs)) {
-        /* paging is disabled */
-        return true;
-    }
-
-    a20_mask = x86_get_a20_mask(env);
-    if (env->cr[4] & CR4_PAE_MASK) {
-#ifdef TARGET_X86_64
-        if (env->hflags & HF_LMA_MASK) {
-            if (env->cr[4] & CR4_LA57_MASK) {
-                hwaddr pml5e_addr;
-
-                pml5e_addr = (env->cr[3] & PLM4_ADDR_MASK) & a20_mask;
-                walk_pml5e(list, cs->as, pml5e_addr, a20_mask);
-            } else {
-                hwaddr pml4e_addr;
-
-                pml4e_addr = (env->cr[3] & PLM4_ADDR_MASK) & a20_mask;
-                walk_pml4e(list, cs->as, pml4e_addr, a20_mask,
-                        0xffffULL << 48);
-            }
-        } else
-#endif
-        {
-            hwaddr pdpe_addr;
-
-            pdpe_addr = (env->cr[3] & ~0x1f) & a20_mask;
-            walk_pdpe2(list, cs->as, pdpe_addr, a20_mask);
-        }
-    } else {
-        hwaddr pde_addr;
-        bool pse;
-
-        pde_addr = (env->cr[3] & ~0xfff) & a20_mask;
-        pse = !!(env->cr[4] & CR4_PSE_MASK);
-        walk_pde2(list, cs->as, pde_addr, a20_mask, pse);
-    }
-
-    return true;
+    return for_each_pte(cs, &add_memory_mapping_to_list, list, false, false);
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code.
  2024-06-06 14:02 [PATCH v3 0/6] Rework x86 page table walks Don Porter
                   ` (3 preceding siblings ...)
  2024-06-06 14:02 ` [PATCH v3 4/6] Convert x86_cpu_get_memory_mapping() to use generic iterators Don Porter
@ 2024-06-06 14:02 ` Don Porter
  2024-06-07  6:12   ` Philippe Mathieu-Daudé
  2024-06-07 17:03   ` Richard Henderson
  2024-06-06 14:02 ` [PATCH v3 6/6] Convert x86_mmu_translate() to use common code Don Porter
  5 siblings, 2 replies; 22+ messages in thread
From: Don Porter @ 2024-06-06 14:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: dave, peter.maydell, nadav.amit, richard.henderson, philmd,
	Don Porter

Signed-off-by: Don Porter <porter@cs.unc.edu>
---
 target/i386/cpu.h                    |  42 ++
 target/i386/helper.c                 | 515 +++++++++++++++++++++++++
 target/i386/tcg/sysemu/excp_helper.c | 555 +--------------------------
 3 files changed, 562 insertions(+), 550 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 1e463cc556..2c7cfe7901 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2136,6 +2136,43 @@ struct X86CPUClass {
     ResettablePhases parent_phases;
 };
 
+
+typedef struct X86TranslateParams {
+    target_ulong addr;
+    target_ulong cr3;
+    int pg_mode;
+    int mmu_idx;
+    int ptw_idx;
+    MMUAccessType access_type;
+} X86TranslateParams;
+
+typedef struct X86TranslateResult {
+    hwaddr paddr;
+    int prot;
+    int page_size;
+} X86TranslateResult;
+
+typedef enum X86TranslateFaultStage2 {
+    S2_NONE,
+    S2_GPA,
+    S2_GPT,
+} X86TranslateFaultStage2;
+
+typedef struct X86TranslateFault {
+    int exception_index;
+    int error_code;
+    target_ulong cr2;
+    X86TranslateFaultStage2 stage2;
+} X86TranslateFault;
+
+typedef struct X86PTETranslate {
+    CPUX86State *env;
+    X86TranslateFault *err;
+    int ptw_idx;
+    void *haddr;
+    hwaddr gaddr;
+} X86PTETranslate;
+
 #ifndef CONFIG_USER_ONLY
 extern const VMStateDescription vmstate_x86_cpu;
 #endif
@@ -2180,6 +2217,11 @@ void x86_cpu_list(void);
 int cpu_x86_support_mca_broadcast(CPUX86State *env);
 
 #ifndef CONFIG_USER_ONLY
+bool x86_cpu_get_physical_address(CPUX86State *env, vaddr addr,
+                                  MMUAccessType access_type, int mmu_idx,
+                                  X86TranslateResult *out,
+                                  X86TranslateFault *err, uint64_t ra);
+
 hwaddr x86_cpu_get_phys_page_attrs_debug(CPUState *cpu, vaddr addr,
                                          MemTxAttrs *attrs);
 int cpu_get_pic_interrupt(CPUX86State *s);
diff --git a/target/i386/helper.c b/target/i386/helper.c
index f9d1381f90..746570a442 100644
--- a/target/i386/helper.c
+++ b/target/i386/helper.c
@@ -26,6 +26,7 @@
 #include "sysemu/hw_accel.h"
 #include "monitor/monitor.h"
 #include "kvm/kvm_i386.h"
+#include "exec/cpu_ldst.h"
 #endif
 #include "qemu/log.h"
 #ifdef CONFIG_TCG
@@ -231,6 +232,520 @@ void cpu_x86_update_cr4(CPUX86State *env, uint32_t new_cr4)
 }
 
 #if !defined(CONFIG_USER_ONLY)
+
+static inline uint32_t ptw_ldl(const X86PTETranslate *in, uint64_t ra)
+{
+    if (likely(in->haddr)) {
+        return ldl_p(in->haddr);
+    }
+    return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
+}
+
+static inline uint64_t ptw_ldq(const X86PTETranslate *in, uint64_t ra)
+{
+    if (likely(in->haddr)) {
+        return ldq_p(in->haddr);
+    }
+    return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
+}
+/*
+ * Note that we can use a 32-bit cmpxchg for all page table entries,
+ * even 64-bit ones, because PG_PRESENT_MASK, PG_ACCESSED_MASK and
+ * PG_DIRTY_MASK are all in the low 32 bits.
+ */
+static bool ptw_setl_slow(const X86PTETranslate *in, uint32_t old, uint32_t new)
+{
+    uint32_t cmp;
+
+    /* Does x86 really perform a rmw cycle on mmio for ptw? */
+    start_exclusive();
+    cmp = cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
+    if (cmp == old) {
+        cpu_stl_mmuidx_ra(in->env, in->gaddr, new, in->ptw_idx, 0);
+    }
+    end_exclusive();
+    return cmp == old;
+}
+
+static inline bool ptw_setl(const X86PTETranslate *in, uint32_t old,
+                            uint32_t set)
+{
+    if (set & ~old) {
+        uint32_t new = old | set;
+        if (likely(in->haddr)) {
+            old = cpu_to_le32(old);
+            new = cpu_to_le32(new);
+            return qatomic_cmpxchg((uint32_t *)in->haddr, old, new) == old;
+        }
+        return ptw_setl_slow(in, old, new);
+    }
+    return true;
+}
+
+
+static bool ptw_translate(X86PTETranslate *inout, hwaddr addr, uint64_t ra)
+{
+    CPUTLBEntryFull *full;
+    int flags;
+
+    inout->gaddr = addr;
+    flags = probe_access_full(inout->env, addr, 0, MMU_DATA_STORE,
+                              inout->ptw_idx, true, &inout->haddr, &full, ra);
+
+    if (unlikely(flags & TLB_INVALID_MASK)) {
+        X86TranslateFault *err = inout->err;
+
+        assert(inout->ptw_idx == MMU_NESTED_IDX);
+        *err = (X86TranslateFault){
+            .error_code = inout->env->error_code,
+            .cr2 = addr,
+            .stage2 = S2_GPT,
+        };
+        return false;
+    }
+    return true;
+}
+
+static bool x86_mmu_translate(CPUX86State *env, const X86TranslateParams *in,
+                              X86TranslateResult *out,
+                              X86TranslateFault *err, uint64_t ra)
+{
+    const target_ulong addr = in->addr;
+    const int pg_mode = in->pg_mode;
+    const bool is_user = is_mmu_index_user(in->mmu_idx);
+    const MMUAccessType access_type = in->access_type;
+    uint64_t ptep, pte, rsvd_mask;
+    X86PTETranslate pte_trans = {
+        .env = env,
+        .err = err,
+        .ptw_idx = in->ptw_idx,
+    };
+    hwaddr pte_addr, paddr;
+    uint32_t pkr;
+    int page_size;
+    int error_code;
+
+ restart_all:
+    rsvd_mask = ~MAKE_64BIT_MASK(0, env_archcpu(env)->phys_bits);
+    rsvd_mask &= PG_ADDRESS_MASK;
+    if (!(pg_mode & PG_MODE_NXE)) {
+        rsvd_mask |= PG_NX_MASK;
+    }
+
+    if (pg_mode & PG_MODE_PAE) {
+#ifdef TARGET_X86_64
+        if (pg_mode & PG_MODE_LMA) {
+            if (pg_mode & PG_MODE_LA57) {
+                /*
+                 * Page table level 5
+                 */
+                pte_addr = (in->cr3 & ~0xfff) + (((addr >> 48) & 0x1ff) << 3);
+                if (!ptw_translate(&pte_trans, pte_addr, ra)) {
+                    return false;
+                }
+            restart_5:
+                pte = ptw_ldq(&pte_trans, ra);
+                if (!(pte & PG_PRESENT_MASK)) {
+                    goto do_fault;
+                }
+                if (pte & (rsvd_mask | PG_PSE_MASK)) {
+                    goto do_fault_rsvd;
+                }
+                if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
+                    goto restart_5;
+                }
+                ptep = pte ^ PG_NX_MASK;
+            } else {
+                pte = in->cr3;
+                ptep = PG_NX_MASK | PG_USER_MASK | PG_RW_MASK;
+            }
+
+            /*
+             * Page table level 4
+             */
+            pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 39) & 0x1ff) << 3);
+            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
+                return false;
+            }
+        restart_4:
+            pte = ptw_ldq(&pte_trans, ra);
+            if (!(pte & PG_PRESENT_MASK)) {
+                goto do_fault;
+            }
+            if (pte & (rsvd_mask | PG_PSE_MASK)) {
+                goto do_fault_rsvd;
+            }
+            if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
+                goto restart_4;
+            }
+            ptep &= pte ^ PG_NX_MASK;
+
+            /*
+             * Page table level 3
+             */
+            pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 30) & 0x1ff) << 3);
+            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
+                return false;
+            }
+        restart_3_lma:
+            pte = ptw_ldq(&pte_trans, ra);
+            if (!(pte & PG_PRESENT_MASK)) {
+                goto do_fault;
+            }
+            if (pte & rsvd_mask) {
+                goto do_fault_rsvd;
+            }
+            if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
+                goto restart_3_lma;
+            }
+            ptep &= pte ^ PG_NX_MASK;
+            if (pte & PG_PSE_MASK) {
+                /* 1 GB page */
+                page_size = 1024 * 1024 * 1024;
+                goto do_check_protect;
+            }
+        } else
+#endif
+        {
+            /*
+             * Page table level 3
+             */
+            pte_addr = (in->cr3 & 0xffffffe0ULL) + ((addr >> 27) & 0x18);
+            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
+                return false;
+            }
+            rsvd_mask |= PG_HI_USER_MASK;
+        restart_3_nolma:
+            pte = ptw_ldq(&pte_trans, ra);
+            if (!(pte & PG_PRESENT_MASK)) {
+                goto do_fault;
+            }
+            if (pte & (rsvd_mask | PG_NX_MASK)) {
+                goto do_fault_rsvd;
+            }
+            if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
+                goto restart_3_nolma;
+            }
+            ptep = PG_NX_MASK | PG_USER_MASK | PG_RW_MASK;
+        }
+
+        /*
+         * Page table level 2
+         */
+        pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 21) & 0x1ff) << 3);
+        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
+            return false;
+        }
+    restart_2_pae:
+        pte = ptw_ldq(&pte_trans, ra);
+        if (!(pte & PG_PRESENT_MASK)) {
+            goto do_fault;
+        }
+        if (pte & rsvd_mask) {
+            goto do_fault_rsvd;
+        }
+        if (pte & PG_PSE_MASK) {
+            /* 2 MB page */
+            page_size = 2048 * 1024;
+            ptep &= pte ^ PG_NX_MASK;
+            goto do_check_protect;
+        }
+        if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
+            goto restart_2_pae;
+        }
+        ptep &= pte ^ PG_NX_MASK;
+
+        /*
+         * Page table level 1
+         */
+        pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 12) & 0x1ff) << 3);
+        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
+            return false;
+        }
+        pte = ptw_ldq(&pte_trans, ra);
+        if (!(pte & PG_PRESENT_MASK)) {
+            goto do_fault;
+        }
+        if (pte & rsvd_mask) {
+            goto do_fault_rsvd;
+        }
+        /* combine pde and pte nx, user and rw protections */
+        ptep &= pte ^ PG_NX_MASK;
+        page_size = 4096;
+    } else {
+        /*
+         * Page table level 2
+         */
+        pte_addr = (in->cr3 & 0xfffff000ULL) + ((addr >> 20) & 0xffc);
+        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
+            return false;
+        }
+    restart_2_nopae:
+        pte = ptw_ldl(&pte_trans, ra);
+        if (!(pte & PG_PRESENT_MASK)) {
+            goto do_fault;
+        }
+        ptep = pte | PG_NX_MASK;
+
+        /* if PSE bit is set, then we use a 4MB page */
+        if ((pte & PG_PSE_MASK) && (pg_mode & PG_MODE_PSE)) {
+            page_size = 4096 * 1024;
+            /*
+             * Bits 20-13 provide bits 39-32 of the address, bit 21 is reserved.
+             * Leave bits 20-13 in place for setting accessed/dirty bits below.
+             */
+            pte = (uint32_t)pte | ((pte & 0x1fe000LL) << (32 - 13));
+            rsvd_mask = 0x200000;
+            goto do_check_protect_pse36;
+        }
+        if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
+            goto restart_2_nopae;
+        }
+
+        /*
+         * Page table level 1
+         */
+        pte_addr = (pte & ~0xfffu) + ((addr >> 10) & 0xffc);
+        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
+            return false;
+        }
+        pte = ptw_ldl(&pte_trans, ra);
+        if (!(pte & PG_PRESENT_MASK)) {
+            goto do_fault;
+        }
+        /* combine pde and pte user and rw protections */
+        ptep &= pte | PG_NX_MASK;
+        page_size = 4096;
+        rsvd_mask = 0;
+    }
+
+do_check_protect:
+    rsvd_mask |= (page_size - 1) & PG_ADDRESS_MASK & ~PG_PSE_PAT_MASK;
+do_check_protect_pse36:
+    if (pte & rsvd_mask) {
+        goto do_fault_rsvd;
+    }
+    ptep ^= PG_NX_MASK;
+
+    /* can the page can be put in the TLB?  prot will tell us */
+    if (is_user && !(ptep & PG_USER_MASK)) {
+        goto do_fault_protect;
+    }
+
+    int prot = 0;
+    if (!is_mmu_index_smap(in->mmu_idx) || !(ptep & PG_USER_MASK)) {
+        prot |= PAGE_READ;
+        if ((ptep & PG_RW_MASK) || !(is_user || (pg_mode & PG_MODE_WP))) {
+            prot |= PAGE_WRITE;
+        }
+    }
+    if (!(ptep & PG_NX_MASK) &&
+        (is_user ||
+         !((pg_mode & PG_MODE_SMEP) && (ptep & PG_USER_MASK)))) {
+        prot |= PAGE_EXEC;
+    }
+
+    if (ptep & PG_USER_MASK) {
+        pkr = pg_mode & PG_MODE_PKE ? env->pkru : 0;
+    } else {
+        pkr = pg_mode & PG_MODE_PKS ? env->pkrs : 0;
+    }
+    if (pkr) {
+        uint32_t pk = (pte & PG_PKRU_MASK) >> PG_PKRU_BIT;
+        uint32_t pkr_ad = (pkr >> pk * 2) & 1;
+        uint32_t pkr_wd = (pkr >> pk * 2) & 2;
+        uint32_t pkr_prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
+
+        if (pkr_ad) {
+            pkr_prot &= ~(PAGE_READ | PAGE_WRITE);
+        } else if (pkr_wd && (is_user || (pg_mode & PG_MODE_WP))) {
+            pkr_prot &= ~PAGE_WRITE;
+        }
+        if ((pkr_prot & (1 << access_type)) == 0) {
+            goto do_fault_pk_protect;
+        }
+        prot &= pkr_prot;
+    }
+
+    if ((prot & (1 << access_type)) == 0) {
+        goto do_fault_protect;
+    }
+
+    /* yes, it can! */
+    {
+        uint32_t set = PG_ACCESSED_MASK;
+        if (access_type == MMU_DATA_STORE) {
+            set |= PG_DIRTY_MASK;
+        } else if (!(pte & PG_DIRTY_MASK)) {
+            /*
+             * Only set write access if already dirty...
+             * otherwise wait for dirty access.
+             */
+            prot &= ~PAGE_WRITE;
+        }
+        if (!ptw_setl(&pte_trans, pte, set)) {
+            /*
+             * We can arrive here from any of 3 levels and 2 formats.
+             * The only safe thing is to restart the entire lookup.
+             */
+            goto restart_all;
+        }
+    }
+
+    /* merge offset within page */
+    paddr = (pte & PG_ADDRESS_MASK & ~(page_size - 1)) |
+        (addr & (page_size - 1));
+
+    /*
+     * Note that NPT is walked (for both paging structures and final guest
+     * addresses) using the address with the A20 bit set.
+     */
+    if (in->ptw_idx == MMU_NESTED_IDX) {
+        CPUTLBEntryFull *full;
+        int flags, nested_page_size;
+
+        flags = probe_access_full(env, paddr, 0, access_type,
+                                  MMU_NESTED_IDX, true,
+                                  &pte_trans.haddr, &full, 0);
+        if (unlikely(flags & TLB_INVALID_MASK)) {
+            *err = (X86TranslateFault){
+                .error_code = env->error_code,
+                .cr2 = paddr,
+                .stage2 = S2_GPA,
+            };
+            return false;
+        }
+
+        /* Merge stage1 & stage2 protection bits. */
+        prot &= full->prot;
+
+        /* Re-verify resulting protection. */
+        if ((prot & (1 << access_type)) == 0) {
+            goto do_fault_protect;
+        }
+
+        /* Merge stage1 & stage2 addresses to final physical address. */
+        nested_page_size = 1 << full->lg_page_size;
+        paddr = (full->phys_addr & ~(nested_page_size - 1))
+              | (paddr & (nested_page_size - 1));
+
+        /*
+         * Use the larger of stage1 & stage2 page sizes, so that
+         * invalidation works.
+         */
+        if (nested_page_size > page_size) {
+            page_size = nested_page_size;
+        }
+    }
+
+    out->paddr = paddr & x86_get_a20_mask(env);
+    out->prot = prot;
+    out->page_size = page_size;
+    return true;
+
+ do_fault_rsvd:
+    error_code = PG_ERROR_RSVD_MASK;
+    goto do_fault_cont;
+ do_fault_protect:
+    error_code = PG_ERROR_P_MASK;
+    goto do_fault_cont;
+ do_fault_pk_protect:
+    assert(access_type != MMU_INST_FETCH);
+    error_code = PG_ERROR_PK_MASK | PG_ERROR_P_MASK;
+    goto do_fault_cont;
+ do_fault:
+    error_code = 0;
+ do_fault_cont:
+    if (is_user) {
+        error_code |= PG_ERROR_U_MASK;
+    }
+    switch (access_type) {
+    case MMU_DATA_LOAD:
+        break;
+    case MMU_DATA_STORE:
+        error_code |= PG_ERROR_W_MASK;
+        break;
+    case MMU_INST_FETCH:
+        if (pg_mode & (PG_MODE_NXE | PG_MODE_SMEP)) {
+            error_code |= PG_ERROR_I_D_MASK;
+        }
+        break;
+    }
+    *err = (X86TranslateFault){
+        .exception_index = EXCP0E_PAGE,
+        .error_code = error_code,
+        .cr2 = addr,
+    };
+    return false;
+}
+
+bool x86_cpu_get_physical_address(CPUX86State *env, vaddr addr,
+                                  MMUAccessType access_type, int mmu_idx,
+                                  X86TranslateResult *out,
+                                  X86TranslateFault *err, uint64_t ra)
+{
+    X86TranslateParams in;
+    bool use_stage2 = env->hflags2 & HF2_NPT_MASK;
+
+    in.addr = addr;
+    in.access_type = access_type;
+
+    switch (mmu_idx) {
+    case MMU_PHYS_IDX:
+        break;
+
+    case MMU_NESTED_IDX:
+        if (likely(use_stage2)) {
+            in.cr3 = env->nested_cr3;
+            in.pg_mode = env->nested_pg_mode;
+            in.mmu_idx =
+                env->nested_pg_mode & PG_MODE_LMA ?
+                MMU_USER64_IDX : MMU_USER32_IDX;
+            in.ptw_idx = MMU_PHYS_IDX;
+
+            if (!x86_mmu_translate(env, &in, out, err, ra)) {
+                err->stage2 = S2_GPA;
+                return false;
+            }
+            return true;
+        }
+        break;
+
+    default:
+        if (is_mmu_index_32(mmu_idx)) {
+            addr = (uint32_t)addr;
+        }
+
+        if (likely(env->cr[0] & CR0_PG_MASK)) {
+            in.cr3 = env->cr[3];
+            in.mmu_idx = mmu_idx;
+            in.ptw_idx = use_stage2 ? MMU_NESTED_IDX : MMU_PHYS_IDX;
+            in.pg_mode = get_pg_mode(env);
+
+            if (in.pg_mode & PG_MODE_LMA) {
+                /* test virtual address sign extension */
+                int shift = in.pg_mode & PG_MODE_LA57 ? 56 : 47;
+                int64_t sext = (int64_t)addr >> shift;
+                if (sext != 0 && sext != -1) {
+                    *err = (X86TranslateFault){
+                        .exception_index = EXCP0D_GPF,
+                        .cr2 = addr,
+                    };
+                    return false;
+                }
+            }
+            return x86_mmu_translate(env, &in, out, err, ra);
+        }
+        break;
+    }
+
+    /* No translation needed. */
+    out->paddr = addr & x86_get_a20_mask(env);
+    out->prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
+    out->page_size = TARGET_PAGE_SIZE;
+    return true;
+}
+
 hwaddr x86_cpu_get_phys_page_attrs_debug(CPUState *cs, vaddr addr,
                                          MemTxAttrs *attrs)
 {
diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c
index 8fb05b1f53..4c48e5a68b 100644
--- a/target/i386/tcg/sysemu/excp_helper.c
+++ b/target/i386/tcg/sysemu/excp_helper.c
@@ -24,487 +24,7 @@
 #include "exec/page-protection.h"
 #include "tcg/helper-tcg.h"
 
-typedef struct TranslateParams {
-    target_ulong addr;
-    target_ulong cr3;
-    int pg_mode;
-    int mmu_idx;
-    int ptw_idx;
-    MMUAccessType access_type;
-} TranslateParams;
-
-typedef struct TranslateResult {
-    hwaddr paddr;
-    int prot;
-    int page_size;
-} TranslateResult;
-
-typedef enum TranslateFaultStage2 {
-    S2_NONE,
-    S2_GPA,
-    S2_GPT,
-} TranslateFaultStage2;
-
-typedef struct TranslateFault {
-    int exception_index;
-    int error_code;
-    target_ulong cr2;
-    TranslateFaultStage2 stage2;
-} TranslateFault;
-
-typedef struct PTETranslate {
-    CPUX86State *env;
-    TranslateFault *err;
-    int ptw_idx;
-    void *haddr;
-    hwaddr gaddr;
-} PTETranslate;
-
-static bool ptw_translate(PTETranslate *inout, hwaddr addr, uint64_t ra)
-{
-    CPUTLBEntryFull *full;
-    int flags;
-
-    inout->gaddr = addr;
-    flags = probe_access_full(inout->env, addr, 0, MMU_DATA_STORE,
-                              inout->ptw_idx, true, &inout->haddr, &full, ra);
-
-    if (unlikely(flags & TLB_INVALID_MASK)) {
-        TranslateFault *err = inout->err;
-
-        assert(inout->ptw_idx == MMU_NESTED_IDX);
-        *err = (TranslateFault){
-            .error_code = inout->env->error_code,
-            .cr2 = addr,
-            .stage2 = S2_GPT,
-        };
-        return false;
-    }
-    return true;
-}
-
-static inline uint32_t ptw_ldl(const PTETranslate *in, uint64_t ra)
-{
-    if (likely(in->haddr)) {
-        return ldl_p(in->haddr);
-    }
-    return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
-}
-
-static inline uint64_t ptw_ldq(const PTETranslate *in, uint64_t ra)
-{
-    if (likely(in->haddr)) {
-        return ldq_p(in->haddr);
-    }
-    return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
-}
-
-/*
- * Note that we can use a 32-bit cmpxchg for all page table entries,
- * even 64-bit ones, because PG_PRESENT_MASK, PG_ACCESSED_MASK and
- * PG_DIRTY_MASK are all in the low 32 bits.
- */
-static bool ptw_setl_slow(const PTETranslate *in, uint32_t old, uint32_t new)
-{
-    uint32_t cmp;
-
-    /* Does x86 really perform a rmw cycle on mmio for ptw? */
-    start_exclusive();
-    cmp = cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
-    if (cmp == old) {
-        cpu_stl_mmuidx_ra(in->env, in->gaddr, new, in->ptw_idx, 0);
-    }
-    end_exclusive();
-    return cmp == old;
-}
-
-static inline bool ptw_setl(const PTETranslate *in, uint32_t old, uint32_t set)
-{
-    if (set & ~old) {
-        uint32_t new = old | set;
-        if (likely(in->haddr)) {
-            old = cpu_to_le32(old);
-            new = cpu_to_le32(new);
-            return qatomic_cmpxchg((uint32_t *)in->haddr, old, new) == old;
-        }
-        return ptw_setl_slow(in, old, new);
-    }
-    return true;
-}
-
-static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
-                          TranslateResult *out, TranslateFault *err,
-                          uint64_t ra)
-{
-    const target_ulong addr = in->addr;
-    const int pg_mode = in->pg_mode;
-    const bool is_user = is_mmu_index_user(in->mmu_idx);
-    const MMUAccessType access_type = in->access_type;
-    uint64_t ptep, pte, rsvd_mask;
-    PTETranslate pte_trans = {
-        .env = env,
-        .err = err,
-        .ptw_idx = in->ptw_idx,
-    };
-    hwaddr pte_addr, paddr;
-    uint32_t pkr;
-    int page_size;
-    int error_code;
-
- restart_all:
-    rsvd_mask = ~MAKE_64BIT_MASK(0, env_archcpu(env)->phys_bits);
-    rsvd_mask &= PG_ADDRESS_MASK;
-    if (!(pg_mode & PG_MODE_NXE)) {
-        rsvd_mask |= PG_NX_MASK;
-    }
-
-    if (pg_mode & PG_MODE_PAE) {
-#ifdef TARGET_X86_64
-        if (pg_mode & PG_MODE_LMA) {
-            if (pg_mode & PG_MODE_LA57) {
-                /*
-                 * Page table level 5
-                 */
-                pte_addr = (in->cr3 & ~0xfff) + (((addr >> 48) & 0x1ff) << 3);
-                if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-                    return false;
-                }
-            restart_5:
-                pte = ptw_ldq(&pte_trans, ra);
-                if (!(pte & PG_PRESENT_MASK)) {
-                    goto do_fault;
-                }
-                if (pte & (rsvd_mask | PG_PSE_MASK)) {
-                    goto do_fault_rsvd;
-                }
-                if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-                    goto restart_5;
-                }
-                ptep = pte ^ PG_NX_MASK;
-            } else {
-                pte = in->cr3;
-                ptep = PG_NX_MASK | PG_USER_MASK | PG_RW_MASK;
-            }
-
-            /*
-             * Page table level 4
-             */
-            pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 39) & 0x1ff) << 3);
-            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-                return false;
-            }
-        restart_4:
-            pte = ptw_ldq(&pte_trans, ra);
-            if (!(pte & PG_PRESENT_MASK)) {
-                goto do_fault;
-            }
-            if (pte & (rsvd_mask | PG_PSE_MASK)) {
-                goto do_fault_rsvd;
-            }
-            if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-                goto restart_4;
-            }
-            ptep &= pte ^ PG_NX_MASK;
-
-            /*
-             * Page table level 3
-             */
-            pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 30) & 0x1ff) << 3);
-            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-                return false;
-            }
-        restart_3_lma:
-            pte = ptw_ldq(&pte_trans, ra);
-            if (!(pte & PG_PRESENT_MASK)) {
-                goto do_fault;
-            }
-            if (pte & rsvd_mask) {
-                goto do_fault_rsvd;
-            }
-            if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-                goto restart_3_lma;
-            }
-            ptep &= pte ^ PG_NX_MASK;
-            if (pte & PG_PSE_MASK) {
-                /* 1 GB page */
-                page_size = 1024 * 1024 * 1024;
-                goto do_check_protect;
-            }
-        } else
-#endif
-        {
-            /*
-             * Page table level 3
-             */
-            pte_addr = (in->cr3 & 0xffffffe0ULL) + ((addr >> 27) & 0x18);
-            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-                return false;
-            }
-            rsvd_mask |= PG_HI_USER_MASK;
-        restart_3_nolma:
-            pte = ptw_ldq(&pte_trans, ra);
-            if (!(pte & PG_PRESENT_MASK)) {
-                goto do_fault;
-            }
-            if (pte & (rsvd_mask | PG_NX_MASK)) {
-                goto do_fault_rsvd;
-            }
-            if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-                goto restart_3_nolma;
-            }
-            ptep = PG_NX_MASK | PG_USER_MASK | PG_RW_MASK;
-        }
-
-        /*
-         * Page table level 2
-         */
-        pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 21) & 0x1ff) << 3);
-        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-            return false;
-        }
-    restart_2_pae:
-        pte = ptw_ldq(&pte_trans, ra);
-        if (!(pte & PG_PRESENT_MASK)) {
-            goto do_fault;
-        }
-        if (pte & rsvd_mask) {
-            goto do_fault_rsvd;
-        }
-        if (pte & PG_PSE_MASK) {
-            /* 2 MB page */
-            page_size = 2048 * 1024;
-            ptep &= pte ^ PG_NX_MASK;
-            goto do_check_protect;
-        }
-        if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-            goto restart_2_pae;
-        }
-        ptep &= pte ^ PG_NX_MASK;
-
-        /*
-         * Page table level 1
-         */
-        pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 12) & 0x1ff) << 3);
-        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-            return false;
-        }
-        pte = ptw_ldq(&pte_trans, ra);
-        if (!(pte & PG_PRESENT_MASK)) {
-            goto do_fault;
-        }
-        if (pte & rsvd_mask) {
-            goto do_fault_rsvd;
-        }
-        /* combine pde and pte nx, user and rw protections */
-        ptep &= pte ^ PG_NX_MASK;
-        page_size = 4096;
-    } else {
-        /*
-         * Page table level 2
-         */
-        pte_addr = (in->cr3 & 0xfffff000ULL) + ((addr >> 20) & 0xffc);
-        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-            return false;
-        }
-    restart_2_nopae:
-        pte = ptw_ldl(&pte_trans, ra);
-        if (!(pte & PG_PRESENT_MASK)) {
-            goto do_fault;
-        }
-        ptep = pte | PG_NX_MASK;
-
-        /* if PSE bit is set, then we use a 4MB page */
-        if ((pte & PG_PSE_MASK) && (pg_mode & PG_MODE_PSE)) {
-            page_size = 4096 * 1024;
-            /*
-             * Bits 20-13 provide bits 39-32 of the address, bit 21 is reserved.
-             * Leave bits 20-13 in place for setting accessed/dirty bits below.
-             */
-            pte = (uint32_t)pte | ((pte & 0x1fe000LL) << (32 - 13));
-            rsvd_mask = 0x200000;
-            goto do_check_protect_pse36;
-        }
-        if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-            goto restart_2_nopae;
-        }
-
-        /*
-         * Page table level 1
-         */
-        pte_addr = (pte & ~0xfffu) + ((addr >> 10) & 0xffc);
-        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-            return false;
-        }
-        pte = ptw_ldl(&pte_trans, ra);
-        if (!(pte & PG_PRESENT_MASK)) {
-            goto do_fault;
-        }
-        /* combine pde and pte user and rw protections */
-        ptep &= pte | PG_NX_MASK;
-        page_size = 4096;
-        rsvd_mask = 0;
-    }
-
-do_check_protect:
-    rsvd_mask |= (page_size - 1) & PG_ADDRESS_MASK & ~PG_PSE_PAT_MASK;
-do_check_protect_pse36:
-    if (pte & rsvd_mask) {
-        goto do_fault_rsvd;
-    }
-    ptep ^= PG_NX_MASK;
-
-    /* can the page can be put in the TLB?  prot will tell us */
-    if (is_user && !(ptep & PG_USER_MASK)) {
-        goto do_fault_protect;
-    }
-
-    int prot = 0;
-    if (!is_mmu_index_smap(in->mmu_idx) || !(ptep & PG_USER_MASK)) {
-        prot |= PAGE_READ;
-        if ((ptep & PG_RW_MASK) || !(is_user || (pg_mode & PG_MODE_WP))) {
-            prot |= PAGE_WRITE;
-        }
-    }
-    if (!(ptep & PG_NX_MASK) &&
-        (is_user ||
-         !((pg_mode & PG_MODE_SMEP) && (ptep & PG_USER_MASK)))) {
-        prot |= PAGE_EXEC;
-    }
-
-    if (ptep & PG_USER_MASK) {
-        pkr = pg_mode & PG_MODE_PKE ? env->pkru : 0;
-    } else {
-        pkr = pg_mode & PG_MODE_PKS ? env->pkrs : 0;
-    }
-    if (pkr) {
-        uint32_t pk = (pte & PG_PKRU_MASK) >> PG_PKRU_BIT;
-        uint32_t pkr_ad = (pkr >> pk * 2) & 1;
-        uint32_t pkr_wd = (pkr >> pk * 2) & 2;
-        uint32_t pkr_prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
-
-        if (pkr_ad) {
-            pkr_prot &= ~(PAGE_READ | PAGE_WRITE);
-        } else if (pkr_wd && (is_user || (pg_mode & PG_MODE_WP))) {
-            pkr_prot &= ~PAGE_WRITE;
-        }
-        if ((pkr_prot & (1 << access_type)) == 0) {
-            goto do_fault_pk_protect;
-        }
-        prot &= pkr_prot;
-    }
-
-    if ((prot & (1 << access_type)) == 0) {
-        goto do_fault_protect;
-    }
-
-    /* yes, it can! */
-    {
-        uint32_t set = PG_ACCESSED_MASK;
-        if (access_type == MMU_DATA_STORE) {
-            set |= PG_DIRTY_MASK;
-        } else if (!(pte & PG_DIRTY_MASK)) {
-            /*
-             * Only set write access if already dirty...
-             * otherwise wait for dirty access.
-             */
-            prot &= ~PAGE_WRITE;
-        }
-        if (!ptw_setl(&pte_trans, pte, set)) {
-            /*
-             * We can arrive here from any of 3 levels and 2 formats.
-             * The only safe thing is to restart the entire lookup.
-             */
-            goto restart_all;
-        }
-    }
-
-    /* merge offset within page */
-    paddr = (pte & PG_ADDRESS_MASK & ~(page_size - 1)) | (addr & (page_size - 1));
-
-    /*
-     * Note that NPT is walked (for both paging structures and final guest
-     * addresses) using the address with the A20 bit set.
-     */
-    if (in->ptw_idx == MMU_NESTED_IDX) {
-        CPUTLBEntryFull *full;
-        int flags, nested_page_size;
-
-        flags = probe_access_full(env, paddr, 0, access_type,
-                                  MMU_NESTED_IDX, true,
-                                  &pte_trans.haddr, &full, 0);
-        if (unlikely(flags & TLB_INVALID_MASK)) {
-            *err = (TranslateFault){
-                .error_code = env->error_code,
-                .cr2 = paddr,
-                .stage2 = S2_GPA,
-            };
-            return false;
-        }
-
-        /* Merge stage1 & stage2 protection bits. */
-        prot &= full->prot;
-
-        /* Re-verify resulting protection. */
-        if ((prot & (1 << access_type)) == 0) {
-            goto do_fault_protect;
-        }
-
-        /* Merge stage1 & stage2 addresses to final physical address. */
-        nested_page_size = 1 << full->lg_page_size;
-        paddr = (full->phys_addr & ~(nested_page_size - 1))
-              | (paddr & (nested_page_size - 1));
-
-        /*
-         * Use the larger of stage1 & stage2 page sizes, so that
-         * invalidation works.
-         */
-        if (nested_page_size > page_size) {
-            page_size = nested_page_size;
-        }
-    }
-
-    out->paddr = paddr & x86_get_a20_mask(env);
-    out->prot = prot;
-    out->page_size = page_size;
-    return true;
-
- do_fault_rsvd:
-    error_code = PG_ERROR_RSVD_MASK;
-    goto do_fault_cont;
- do_fault_protect:
-    error_code = PG_ERROR_P_MASK;
-    goto do_fault_cont;
- do_fault_pk_protect:
-    assert(access_type != MMU_INST_FETCH);
-    error_code = PG_ERROR_PK_MASK | PG_ERROR_P_MASK;
-    goto do_fault_cont;
- do_fault:
-    error_code = 0;
- do_fault_cont:
-    if (is_user) {
-        error_code |= PG_ERROR_U_MASK;
-    }
-    switch (access_type) {
-    case MMU_DATA_LOAD:
-        break;
-    case MMU_DATA_STORE:
-        error_code |= PG_ERROR_W_MASK;
-        break;
-    case MMU_INST_FETCH:
-        if (pg_mode & (PG_MODE_NXE | PG_MODE_SMEP)) {
-            error_code |= PG_ERROR_I_D_MASK;
-        }
-        break;
-    }
-    *err = (TranslateFault){
-        .exception_index = EXCP0E_PAGE,
-        .error_code = error_code,
-        .cr2 = addr,
-    };
-    return false;
-}
-
-static G_NORETURN void raise_stage2(CPUX86State *env, TranslateFault *err,
+static G_NORETURN void raise_stage2(CPUX86State *env, X86TranslateFault *err,
                                     uintptr_t retaddr)
 {
     uint64_t exit_info_1 = err->error_code;
@@ -526,82 +46,17 @@ static G_NORETURN void raise_stage2(CPUX86State *env, TranslateFault *err,
     cpu_vmexit(env, SVM_EXIT_NPF, exit_info_1, retaddr);
 }
 
-static bool get_physical_address(CPUX86State *env, vaddr addr,
-                                 MMUAccessType access_type, int mmu_idx,
-                                 TranslateResult *out, TranslateFault *err,
-                                 uint64_t ra)
-{
-    TranslateParams in;
-    bool use_stage2 = env->hflags2 & HF2_NPT_MASK;
-
-    in.addr = addr;
-    in.access_type = access_type;
-
-    switch (mmu_idx) {
-    case MMU_PHYS_IDX:
-        break;
-
-    case MMU_NESTED_IDX:
-        if (likely(use_stage2)) {
-            in.cr3 = env->nested_cr3;
-            in.pg_mode = env->nested_pg_mode;
-            in.mmu_idx =
-                env->nested_pg_mode & PG_MODE_LMA ? MMU_USER64_IDX : MMU_USER32_IDX;
-            in.ptw_idx = MMU_PHYS_IDX;
-
-            if (!mmu_translate(env, &in, out, err, ra)) {
-                err->stage2 = S2_GPA;
-                return false;
-            }
-            return true;
-        }
-        break;
-
-    default:
-        if (is_mmu_index_32(mmu_idx)) {
-            addr = (uint32_t)addr;
-        }
-
-        if (likely(env->cr[0] & CR0_PG_MASK)) {
-            in.cr3 = env->cr[3];
-            in.mmu_idx = mmu_idx;
-            in.ptw_idx = use_stage2 ? MMU_NESTED_IDX : MMU_PHYS_IDX;
-            in.pg_mode = get_pg_mode(env);
-
-            if (in.pg_mode & PG_MODE_LMA) {
-                /* test virtual address sign extension */
-                int shift = in.pg_mode & PG_MODE_LA57 ? 56 : 47;
-                int64_t sext = (int64_t)addr >> shift;
-                if (sext != 0 && sext != -1) {
-                    *err = (TranslateFault){
-                        .exception_index = EXCP0D_GPF,
-                        .cr2 = addr,
-                    };
-                    return false;
-                }
-            }
-            return mmu_translate(env, &in, out, err, ra);
-        }
-        break;
-    }
-
-    /* No translation needed. */
-    out->paddr = addr & x86_get_a20_mask(env);
-    out->prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
-    out->page_size = TARGET_PAGE_SIZE;
-    return true;
-}
 
 bool x86_cpu_tlb_fill(CPUState *cs, vaddr addr, int size,
                       MMUAccessType access_type, int mmu_idx,
                       bool probe, uintptr_t retaddr)
 {
     CPUX86State *env = cpu_env(cs);
-    TranslateResult out;
-    TranslateFault err;
+    X86TranslateResult out;
+    X86TranslateFault err;
 
-    if (get_physical_address(env, addr, access_type, mmu_idx, &out, &err,
-                             retaddr)) {
+    if (x86_cpu_get_physical_address(env, addr, access_type, mmu_idx, &out,
+                                     &err, retaddr)) {
         /*
          * Even if 4MB pages, we map only one 4KB page in the cache to
          * avoid filling it too fast.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 6/6] Convert x86_mmu_translate() to use common code.
  2024-06-06 14:02 [PATCH v3 0/6] Rework x86 page table walks Don Porter
                   ` (4 preceding siblings ...)
  2024-06-06 14:02 ` [PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code Don Porter
@ 2024-06-06 14:02 ` Don Porter
  2024-06-07 17:28   ` Richard Henderson
  5 siblings, 1 reply; 22+ messages in thread
From: Don Porter @ 2024-06-06 14:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: dave, peter.maydell, nadav.amit, richard.henderson, philmd,
	Don Porter

Signed-off-by: Don Porter <porter@cs.unc.edu>
---
 target/i386/arch_memory_mapping.c    |  44 +++-
 target/i386/cpu.h                    |   5 +-
 target/i386/helper.c                 | 374 +++++++--------------------
 target/i386/tcg/sysemu/excp_helper.c |   2 +-
 4 files changed, 129 insertions(+), 296 deletions(-)

diff --git a/target/i386/arch_memory_mapping.c b/target/i386/arch_memory_mapping.c
index b52e98133c..bccd290b9f 100644
--- a/target/i386/arch_memory_mapping.c
+++ b/target/i386/arch_memory_mapping.c
@@ -228,9 +228,38 @@ static void _mmu_decode_va_parameters(CPUState *cs, int height,
 }
 
 /**
- * get_pte - Copy the contents of the page table entry at node[i] into pt_entry.
- *           Optionally, add the relevant bits to the virtual address in
- *           vaddr_pte.
+ * x86_virtual_to_pte_index - Given a virtual address and height in
+ *       the page table radix tree, return the index that should be
+ *       used to look up the next page table entry (pte) in
+ *       translating an address.
+ *
+ * @cs - CPU state
+ * @vaddr - The virtual address to translate
+ * @height - height of node within the tree (leaves are 1, not 0).
+ *
+ * Example: In 32-bit x86 page tables, the virtual address is split
+ * into 10 bits at height 2, 10 bits at height 1, and 12 offset bits.
+ * So a call with VA and height 2 would return the first 10 bits of va,
+ * right shifted by 22.
+ */
+
+int x86_virtual_to_pte_index(CPUState *cs, target_ulong vaddr, int height)
+{
+    int shift = 0;
+    int width = 0;
+    int mask = 0;
+
+    _mmu_decode_va_parameters(cs, height, &shift, &width);
+
+    mask = (1 << width) - 1;
+
+    return (vaddr >> shift) & mask;
+}
+
+/**
+ * x86_get_pte - Copy the contents of the page table entry at node[i]
+ *               into pt_entry.  Optionally, add the relevant bits to
+ *               the virtual address in vaddr_pte.
  *
  * @cs - CPU state
  * @node - physical address of the current page table node
@@ -249,7 +278,6 @@ void
 x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
             PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
             hwaddr *pte_paddr)
-
 {
     X86CPU *cpu = X86_CPU(cs);
     CPUX86State *env = &cpu->env;
@@ -282,8 +310,8 @@ x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
 }
 
 
-static bool
-mmu_pte_check_bits(CPUState *cs, PTE_t *pte, int64_t mask)
+bool
+x86_pte_check_bits(CPUState *cs, PTE_t *pte, int64_t mask)
 {
     X86CPU *cpu = X86_CPU(cs);
     CPUX86State *env = &cpu->env;
@@ -300,7 +328,7 @@ mmu_pte_check_bits(CPUState *cs, PTE_t *pte, int64_t mask)
 bool
 x86_pte_present(CPUState *cs, PTE_t *pte)
 {
-    return mmu_pte_check_bits(cs, pte, PG_PRESENT_MASK);
+    return x86_pte_check_bits(cs, pte, PG_PRESENT_MASK);
 }
 
 /**
@@ -312,7 +340,7 @@ x86_pte_present(CPUState *cs, PTE_t *pte)
 bool
 x86_pte_leaf(CPUState *cs, int height, PTE_t *pte)
 {
-    return height == 1 || mmu_pte_check_bits(cs, pte, PG_PSE_MASK);
+    return height == 1 || x86_pte_check_bits(cs, pte, PG_PSE_MASK);
 }
 
 /**
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 2c7cfe7901..978841a624 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2198,6 +2198,8 @@ bool x86_pte_present(CPUState *cs, PTE_t *pte);
 bool x86_pte_leaf(CPUState *cs, int height, PTE_t *pte);
 hwaddr x86_pte_child(CPUState *cs, PTE_t *pte, int height);
 uint64_t x86_pte_flags(uint64_t pte);
+bool x86_pte_check_bits(CPUState *cs, PTE_t *pte, int64_t mask);
+int x86_virtual_to_pte_index(CPUState *cs, target_ulong vaddr, int height);
 bool x86_cpu_get_memory_mapping(CPUState *cpu, MemoryMappingList *list,
                                 Error **errp);
 bool x86_mon_init_page_table_iterator(Monitor *mon,
@@ -2220,7 +2222,8 @@ int cpu_x86_support_mca_broadcast(CPUX86State *env);
 bool x86_cpu_get_physical_address(CPUX86State *env, vaddr addr,
                                   MMUAccessType access_type, int mmu_idx,
                                   X86TranslateResult *out,
-                                  X86TranslateFault *err, uint64_t ra);
+                                  X86TranslateFault *err, uint64_t ra,
+                                  bool read_only);
 
 hwaddr x86_cpu_get_phys_page_attrs_debug(CPUState *cpu, vaddr addr,
                                          MemTxAttrs *attrs);
diff --git a/target/i386/helper.c b/target/i386/helper.c
index 746570a442..4e5467ee57 100644
--- a/target/i386/helper.c
+++ b/target/i386/helper.c
@@ -308,7 +308,8 @@ static bool ptw_translate(X86PTETranslate *inout, hwaddr addr, uint64_t ra)
 
 static bool x86_mmu_translate(CPUX86State *env, const X86TranslateParams *in,
                               X86TranslateResult *out,
-                              X86TranslateFault *err, uint64_t ra)
+                              X86TranslateFault *err, uint64_t ra,
+                              bool read_only)
 {
     const target_ulong addr = in->addr;
     const int pg_mode = in->pg_mode;
@@ -324,6 +325,10 @@ static bool x86_mmu_translate(CPUX86State *env, const X86TranslateParams *in,
     uint32_t pkr;
     int page_size;
     int error_code;
+    CPUState *cs = env_cpu(env);
+    int height;
+    bool pae_enabled = env->cr[4] & CR4_PAE_MASK;
+    bool long_mode_enabled = env->hflags & HF_LMA_MASK;
 
  restart_all:
     rsvd_mask = ~MAKE_64BIT_MASK(0, env_archcpu(env)->phys_bits);
@@ -332,194 +337,89 @@ static bool x86_mmu_translate(CPUX86State *env, const X86TranslateParams *in,
         rsvd_mask |= PG_NX_MASK;
     }
 
-    if (pg_mode & PG_MODE_PAE) {
-#ifdef TARGET_X86_64
-        if (pg_mode & PG_MODE_LMA) {
-            if (pg_mode & PG_MODE_LA57) {
-                /*
-                 * Page table level 5
-                 */
-                pte_addr = (in->cr3 & ~0xfff) + (((addr >> 48) & 0x1ff) << 3);
-                if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-                    return false;
-                }
-            restart_5:
-                pte = ptw_ldq(&pte_trans, ra);
-                if (!(pte & PG_PRESENT_MASK)) {
-                    goto do_fault;
-                }
-                if (pte & (rsvd_mask | PG_PSE_MASK)) {
-                    goto do_fault_rsvd;
-                }
-                if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-                    goto restart_5;
-                }
-                ptep = pte ^ PG_NX_MASK;
-            } else {
-                pte = in->cr3;
-                ptep = PG_NX_MASK | PG_USER_MASK | PG_RW_MASK;
-            }
+    /* Get the root of the page table */
 
-            /*
-             * Page table level 4
-             */
-            pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 39) & 0x1ff) << 3);
-            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-                return false;
-            }
-        restart_4:
-            pte = ptw_ldq(&pte_trans, ra);
-            if (!(pte & PG_PRESENT_MASK)) {
-                goto do_fault;
-            }
-            if (pte & (rsvd_mask | PG_PSE_MASK)) {
-                goto do_fault_rsvd;
-            }
-            if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-                goto restart_4;
-            }
-            ptep &= pte ^ PG_NX_MASK;
+    /*
+     * ptep is really an accumulator for the permission bits.
+     * Thus, the xor-ing totally trashes the high bits, and that is
+     * ok - we only care about the low ones.
+     */
+    ptep = PG_NX_MASK | PG_USER_MASK | PG_RW_MASK;
+    hwaddr pt_node = x86_page_table_root(cs, &height);
 
-            /*
-             * Page table level 3
-             */
-            pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 30) & 0x1ff) << 3);
-            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-                return false;
-            }
-        restart_3_lma:
-            pte = ptw_ldq(&pte_trans, ra);
-            if (!(pte & PG_PRESENT_MASK)) {
-                goto do_fault;
-            }
-            if (pte & rsvd_mask) {
-                goto do_fault_rsvd;
-            }
-            if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-                goto restart_3_lma;
-            }
-            ptep &= pte ^ PG_NX_MASK;
-            if (pte & PG_PSE_MASK) {
-                /* 1 GB page */
-                page_size = 1024 * 1024 * 1024;
-                goto do_check_protect;
-            }
-        } else
-#endif
-        {
-            /*
-             * Page table level 3
-             */
-            pte_addr = (in->cr3 & 0xffffffe0ULL) + ((addr >> 27) & 0x18);
-            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-                return false;
-            }
-            rsvd_mask |= PG_HI_USER_MASK;
-        restart_3_nolma:
-            pte = ptw_ldq(&pte_trans, ra);
-            if (!(pte & PG_PRESENT_MASK)) {
-                goto do_fault;
-            }
-            if (pte & (rsvd_mask | PG_NX_MASK)) {
-                goto do_fault_rsvd;
-            }
-            if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-                goto restart_3_nolma;
-            }
-            ptep = PG_NX_MASK | PG_USER_MASK | PG_RW_MASK;
-        }
+    /* Special case for PAE paging */
+    if (height == 3 && pg_mode & PG_MODE_PAE) {
+        rsvd_mask |= PG_HI_USER_MASK;
+    }
 
-        /*
-         * Page table level 2
-         */
-        pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 21) & 0x1ff) << 3);
+    int i = height;
+    do {
+        int index = x86_virtual_to_pte_index(cs, addr, i);
+        PTE_t pt_entry;
+        uint64_t my_rsvd_mask = rsvd_mask;
+
+        x86_get_pte(cs, pt_node, index, i, &pt_entry, 0, NULL, &pte_addr);
+        /* Check that we can access the page table entry */
         if (!ptw_translate(&pte_trans, pte_addr, ra)) {
             return false;
         }
-    restart_2_pae:
-        pte = ptw_ldq(&pte_trans, ra);
-        if (!(pte & PG_PRESENT_MASK)) {
+
+    restart:
+        if (!x86_pte_present(cs, &pt_entry)) {
             goto do_fault;
         }
-        if (pte & rsvd_mask) {
-            goto do_fault_rsvd;
-        }
-        if (pte & PG_PSE_MASK) {
-            /* 2 MB page */
-            page_size = 2048 * 1024;
-            ptep &= pte ^ PG_NX_MASK;
-            goto do_check_protect;
-        }
-        if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-            goto restart_2_pae;
-        }
-        ptep &= pte ^ PG_NX_MASK;
 
-        /*
-         * Page table level 1
-         */
-        pte_addr = (pte & PG_ADDRESS_MASK) + (((addr >> 12) & 0x1ff) << 3);
-        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-            return false;
+        /* For height > 3, check and reject PSE mask */
+        if (i > 3) {
+            my_rsvd_mask |= PG_PSE_MASK;
         }
-        pte = ptw_ldq(&pte_trans, ra);
-        if (!(pte & PG_PRESENT_MASK)) {
-            goto do_fault;
-        }
-        if (pte & rsvd_mask) {
+
+        if (x86_pte_check_bits(cs, &pt_entry, my_rsvd_mask)) {
             goto do_fault_rsvd;
         }
-        /* combine pde and pte nx, user and rw protections */
-        ptep &= pte ^ PG_NX_MASK;
-        page_size = 4096;
-    } else {
-        /*
-         * Page table level 2
-         */
-        pte_addr = (in->cr3 & 0xfffff000ULL) + ((addr >> 20) & 0xffc);
-        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-            return false;
-        }
-    restart_2_nopae:
-        pte = ptw_ldl(&pte_trans, ra);
-        if (!(pte & PG_PRESENT_MASK)) {
-            goto do_fault;
-        }
-        ptep = pte | PG_NX_MASK;
 
-        /* if PSE bit is set, then we use a 4MB page */
-        if ((pte & PG_PSE_MASK) && (pg_mode & PG_MODE_PSE)) {
-            page_size = 4096 * 1024;
-            /*
-             * Bits 20-13 provide bits 39-32 of the address, bit 21 is reserved.
-             * Leave bits 20-13 in place for setting accessed/dirty bits below.
-             */
-            pte = (uint32_t)pte | ((pte & 0x1fe000LL) << (32 - 13));
-            rsvd_mask = 0x200000;
-            goto do_check_protect_pse36;
-        }
-        if (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK)) {
-            goto restart_2_nopae;
+        if (long_mode_enabled) {
+            pte = pt_entry.pte64_t;
+        } else {
+            pte = pt_entry.pte32_t;
         }
 
-        /*
-         * Page table level 1
-         */
-        pte_addr = (pte & ~0xfffu) + ((addr >> 10) & 0xffc);
-        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
-            return false;
+        /* Check if we have hit a leaf.  Won't happen (yet) at heights > 3. */
+        if (x86_pte_leaf(cs, i, &pt_entry)) {
+            assert(i < 4);
+            page_size = x86_pte_leaf_page_size(cs, i);
+            ptep &= pte ^ PG_NX_MASK;
+
+            if (!pae_enabled) {
+                if (i == 2) {
+                    /*
+                     * Bits 20-13 provide bits 39-32 of the address,
+                     * bit 21 is reserved.  Leave bits 20-13 in place
+                     * for setting accessed/dirty bits below.
+                     */
+                    pte = (uint32_t)pte | ((pte & 0x1fe000LL) << (32 - 13));
+                    rsvd_mask = 0x200000;
+                    goto do_check_protect_pse36;
+                } else if (i == 1) {
+                    rsvd_mask = 0;
+                }
+            }
+            break; /* goto do_check_protect; */
         }
-        pte = ptw_ldl(&pte_trans, ra);
-        if (!(pte & PG_PRESENT_MASK)) {
-            goto do_fault;
+
+        if ((!read_only) &&
+            (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK))) {
+            goto restart;
         }
-        /* combine pde and pte user and rw protections */
-        ptep &= pte | PG_NX_MASK;
-        page_size = 4096;
-        rsvd_mask = 0;
-    }
 
-do_check_protect:
+        ptep &= pte ^ PG_NX_MASK;
+
+        /* Move to the child node */
+        assert(i > 1);
+        pt_node = x86_pte_child(cs, &pt_entry, i - 1);
+        i--;
+    } while (i > 0);
+
     rsvd_mask |= (page_size - 1) & PG_ADDRESS_MASK & ~PG_PSE_PAT_MASK;
 do_check_protect_pse36:
     if (pte & rsvd_mask) {
@@ -679,10 +579,16 @@ do_check_protect_pse36:
     return false;
 }
 
+/**
+ * The read-only argument indicates whether this access should
+ * trigger exceptions or otherwise disrupt TLB/MMU state.
+ * It should be true for monitor access, and false for tcg access.
+ */
 bool x86_cpu_get_physical_address(CPUX86State *env, vaddr addr,
                                   MMUAccessType access_type, int mmu_idx,
                                   X86TranslateResult *out,
-                                  X86TranslateFault *err, uint64_t ra)
+                                  X86TranslateFault *err, uint64_t ra,
+                                  bool read_only)
 {
     X86TranslateParams in;
     bool use_stage2 = env->hflags2 & HF2_NPT_MASK;
@@ -703,7 +609,7 @@ bool x86_cpu_get_physical_address(CPUX86State *env, vaddr addr,
                 MMU_USER64_IDX : MMU_USER32_IDX;
             in.ptw_idx = MMU_PHYS_IDX;
 
-            if (!x86_mmu_translate(env, &in, out, err, ra)) {
+            if (!x86_mmu_translate(env, &in, out, err, ra, read_only)) {
                 err->stage2 = S2_GPA;
                 return false;
             }
@@ -734,7 +640,7 @@ bool x86_cpu_get_physical_address(CPUX86State *env, vaddr addr,
                     return false;
                 }
             }
-            return x86_mmu_translate(env, &in, out, err, ra);
+            return x86_mmu_translate(env, &in, out, err, ra, read_only);
         }
         break;
     }
@@ -751,123 +657,19 @@ hwaddr x86_cpu_get_phys_page_attrs_debug(CPUState *cs, vaddr addr,
 {
     X86CPU *cpu = X86_CPU(cs);
     CPUX86State *env = &cpu->env;
-    target_ulong pde_addr, pte_addr;
-    uint64_t pte;
-    int32_t a20_mask;
-    uint32_t page_offset;
-    int page_size;
+    X86TranslateResult out;
+    X86TranslateFault err;
 
     *attrs = cpu_get_mem_attrs(env);
 
-    a20_mask = x86_get_a20_mask(env);
-    if (!(env->cr[0] & CR0_PG_MASK)) {
-        pte = addr & a20_mask;
-        page_size = 4096;
-    } else if (env->cr[4] & CR4_PAE_MASK) {
-        target_ulong pdpe_addr;
-        uint64_t pde, pdpe;
-
-#ifdef TARGET_X86_64
-        if (env->hflags & HF_LMA_MASK) {
-            bool la57 = env->cr[4] & CR4_LA57_MASK;
-            uint64_t pml5e_addr, pml5e;
-            uint64_t pml4e_addr, pml4e;
-            int32_t sext;
-
-            /* test virtual address sign extension */
-            sext = la57 ? (int64_t)addr >> 56 : (int64_t)addr >> 47;
-            if (sext != 0 && sext != -1) {
-                return -1;
-            }
-
-            if (la57) {
-                pml5e_addr = ((env->cr[3] & ~0xfff) +
-                        (((addr >> 48) & 0x1ff) << 3)) & a20_mask;
-                pml5e = x86_ldq_phys(cs, pml5e_addr);
-                if (!(pml5e & PG_PRESENT_MASK)) {
-                    return -1;
-                }
-            } else {
-                pml5e = env->cr[3];
-            }
-
-            pml4e_addr = ((pml5e & PG_ADDRESS_MASK) +
-                    (((addr >> 39) & 0x1ff) << 3)) & a20_mask;
-            pml4e = x86_ldq_phys(cs, pml4e_addr);
-            if (!(pml4e & PG_PRESENT_MASK)) {
-                return -1;
-            }
-            pdpe_addr = ((pml4e & PG_ADDRESS_MASK) +
-                         (((addr >> 30) & 0x1ff) << 3)) & a20_mask;
-            pdpe = x86_ldq_phys(cs, pdpe_addr);
-            if (!(pdpe & PG_PRESENT_MASK)) {
-                return -1;
-            }
-            if (pdpe & PG_PSE_MASK) {
-                page_size = 1024 * 1024 * 1024;
-                pte = pdpe;
-                goto out;
-            }
-
-        } else
-#endif
-        {
-            pdpe_addr = ((env->cr[3] & ~0x1f) + ((addr >> 27) & 0x18)) &
-                a20_mask;
-            pdpe = x86_ldq_phys(cs, pdpe_addr);
-            if (!(pdpe & PG_PRESENT_MASK))
-                return -1;
-        }
-
-        pde_addr = ((pdpe & PG_ADDRESS_MASK) +
-                    (((addr >> 21) & 0x1ff) << 3)) & a20_mask;
-        pde = x86_ldq_phys(cs, pde_addr);
-        if (!(pde & PG_PRESENT_MASK)) {
-            return -1;
-        }
-        if (pde & PG_PSE_MASK) {
-            /* 2 MB page */
-            page_size = 2048 * 1024;
-            pte = pde;
-        } else {
-            /* 4 KB page */
-            pte_addr = ((pde & PG_ADDRESS_MASK) +
-                        (((addr >> 12) & 0x1ff) << 3)) & a20_mask;
-            page_size = 4096;
-            pte = x86_ldq_phys(cs, pte_addr);
-        }
-        if (!(pte & PG_PRESENT_MASK)) {
-            return -1;
-        }
-    } else {
-        uint32_t pde;
-
-        /* page directory entry */
-        pde_addr = ((env->cr[3] & ~0xfff) + ((addr >> 20) & 0xffc)) & a20_mask;
-        pde = x86_ldl_phys(cs, pde_addr);
-        if (!(pde & PG_PRESENT_MASK))
-            return -1;
-        if ((pde & PG_PSE_MASK) && (env->cr[4] & CR4_PSE_MASK)) {
-            pte = pde | ((pde & 0x1fe000LL) << (32 - 13));
-            page_size = 4096 * 1024;
-        } else {
-            /* page directory entry */
-            pte_addr = ((pde & ~0xfff) + ((addr >> 10) & 0xffc)) & a20_mask;
-            pte = x86_ldl_phys(cs, pte_addr);
-            if (!(pte & PG_PRESENT_MASK)) {
-                return -1;
-            }
-            page_size = 4096;
-        }
-        pte = pte & a20_mask;
+    /* This function merges the offset bits for us */
+    if (!x86_cpu_get_physical_address(env, addr, MMU_DATA_LOAD,
+                                      cpu_mmu_index(cs, false),
+                                      &out, &err, 0, true)) {
+        return -1;
     }
 
-#ifdef TARGET_X86_64
-out:
-#endif
-    pte &= PG_ADDRESS_MASK & ~(page_size - 1);
-    page_offset = (addr & TARGET_PAGE_MASK) & (page_size - 1);
-    return pte | page_offset;
+    return out.paddr;
 }
 
 typedef struct MCEInjectionParams {
diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c
index 4c48e5a68b..c85db11f05 100644
--- a/target/i386/tcg/sysemu/excp_helper.c
+++ b/target/i386/tcg/sysemu/excp_helper.c
@@ -56,7 +56,7 @@ bool x86_cpu_tlb_fill(CPUState *cs, vaddr addr, int size,
     X86TranslateFault err;
 
     if (x86_cpu_get_physical_address(env, addr, access_type, mmu_idx, &out,
-                                     &err, retaddr)) {
+                                     &err, retaddr, false)) {
         /*
          * Even if 4MB pages, we map only one 4KB page in the cache to
          * avoid filling it too fast.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/6] Convert 'info tlb' to use generic iterator
  2024-06-06 14:02 ` [PATCH v3 2/6] Convert 'info tlb' to use generic iterator Don Porter
@ 2024-06-07  6:02   ` Philippe Mathieu-Daudé
  2024-06-10  9:13     ` Daniel P. Berrangé
  0 siblings, 1 reply; 22+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-06-07  6:02 UTC (permalink / raw)
  To: Don Porter, qemu-devel
  Cc: dave, peter.maydell, nadav.amit, richard.henderson,
	Daniel P. Berrangé

Hi Don,

(Cc'ing Daniel for HumanReadableText)

On 6/6/24 16:02, Don Porter wrote:
> Signed-off-by: Don Porter <porter@cs.unc.edu>
> ---
>   include/hw/core/sysemu-cpu-ops.h |   7 +
>   monitor/hmp-cmds-target.c        |   1 +
>   target/i386/cpu.h                |   2 +
>   target/i386/monitor.c            | 217 ++++++-------------------------
>   4 files changed, 53 insertions(+), 174 deletions(-)
> 
> diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
> index eb16a1c3e2..bf3de3e004 100644
> --- a/include/hw/core/sysemu-cpu-ops.h
> +++ b/include/hw/core/sysemu-cpu-ops.h
> @@ -243,6 +243,13 @@ typedef struct SysemuCPUOps {
>       bool (*mon_flush_page_print_state)(CPUState *cs,
>                                          struct mem_print_state *state);
>   
> +    /**
> +     * @mon_print_pte: Hook called by the monitor to print a page
> +     * table entry at address addr, with contents pte.
> +     */
> +    void (*mon_print_pte) (Monitor *mon, CPUArchState *env, hwaddr addr,
> +                           hwaddr pte);

IMO the SysemuCPUOps prototype should not use the monitor and return
a HumanReadableText:

       HumanReadableText *(*mon_print_pte)(CPUArchState *env,
                                           hwaddr addr, hwaddr pte);

Then define a QMP handler, itself registered to the monitor using
monitor_register_hmp_info_hrt().

Otherwise the cleanup is nice!

Regards,

Phil.

> +
>   } SysemuCPUOps;
>   
>   #endif /* SYSEMU_CPU_OPS_H */
> diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
> index 60a8bd0c37..3393e5ad0b 100644
> --- a/monitor/hmp-cmds-target.c
> +++ b/monitor/hmp-cmds-target.c
> @@ -318,6 +318,7 @@ void hmp_info_pg(Monitor *mon, const QDict *qdict)
>       /* Print last entry, if one present */
>       cc->sysemu_ops->mon_flush_page_print_state(cs, &state);
>   }
> +
>   static void memory_dump(Monitor *mon, int count, int format, int wsize,
>                           hwaddr addr, int is_physical)
>   {
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index cbb6f6fc4d..1346ec0033 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -2167,6 +2167,8 @@ bool x86_mon_init_page_table_iterator(Monitor *mon,
>                                         struct mem_print_state *state);
>   void x86_mon_info_pg_print_header(Monitor *mon, struct mem_print_state *state);
>   bool x86_mon_flush_print_pg_state(CPUState *cs, struct mem_print_state *state);
> +void x86_mon_print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
> +                       hwaddr pte);
>   
>   void x86_cpu_dump_state(CPUState *cs, FILE *f, int flags);
>   
> diff --git a/target/i386/monitor.c b/target/i386/monitor.c
> index 65e82e73e8..ecde164857 100644
> --- a/target/i386/monitor.c
> +++ b/target/i386/monitor.c
> @@ -214,202 +214,71 @@ static hwaddr addr_canonical(CPUArchState *env, hwaddr addr)
>       return addr;
>   }
>   
> -static void print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
> -                      hwaddr pte, hwaddr mask)
> +void x86_mon_print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
> +                       hwaddr pte)
>   {
> +    char buf[128];
> +    char *pos = buf, *end = buf + sizeof(buf);
> +
>       addr = addr_canonical(env, addr);
>   
> -    monitor_printf(mon, HWADDR_FMT_plx ": " HWADDR_FMT_plx
> -                   " %c%c%c%c%c%c%c%c%c\n",
> -                   addr,
> -                   pte & mask,
> -                   pte & PG_NX_MASK ? 'X' : '-',
> -                   pte & PG_GLOBAL_MASK ? 'G' : '-',
> -                   pte & PG_PSE_MASK ? 'P' : '-',
> -                   pte & PG_DIRTY_MASK ? 'D' : '-',
> -                   pte & PG_ACCESSED_MASK ? 'A' : '-',
> -                   pte & PG_PCD_MASK ? 'C' : '-',
> -                   pte & PG_PWT_MASK ? 'T' : '-',
> -                   pte & PG_USER_MASK ? 'U' : '-',
> -                   pte & PG_RW_MASK ? 'W' : '-');
> -}
> +    pos += snprintf(pos, end - pos, HWADDR_FMT_plx ": " HWADDR_FMT_plx " ",
> +                    addr, (hwaddr) (pte & PG_ADDRESS_MASK));
>   
> -static void tlb_info_32(Monitor *mon, CPUArchState *env)
> -{
> -    unsigned int l1, l2;
> -    uint32_t pgd, pde, pte;
> +    pos += snprintf(pos, end - pos, " %s", pg_bits(pte));
>   
> -    pgd = env->cr[3] & ~0xfff;
> -    for(l1 = 0; l1 < 1024; l1++) {
> -        cpu_physical_memory_read(pgd + l1 * 4, &pde, 4);
> -        pde = le32_to_cpu(pde);
> -        if (pde & PG_PRESENT_MASK) {
> -            if ((pde & PG_PSE_MASK) && (env->cr[4] & CR4_PSE_MASK)) {
> -                /* 4M pages */
> -                print_pte(mon, env, (l1 << 22), pde, ~((1 << 21) - 1));
> -            } else {
> -                for(l2 = 0; l2 < 1024; l2++) {
> -                    cpu_physical_memory_read((pde & ~0xfff) + l2 * 4, &pte, 4);
> -                    pte = le32_to_cpu(pte);
> -                    if (pte & PG_PRESENT_MASK) {
> -                        print_pte(mon, env, (l1 << 22) + (l2 << 12),
> -                                  pte & ~PG_PSE_MASK,
> -                                  ~0xfff);
> -                    }
> -                }
> -            }
> -        }
> +    /* Trim line to fit screen */
> +    if (pos - buf > 79) {
> +        strcpy(buf + 77, "..");
>       }
> -}
>   
> -static void tlb_info_pae32(Monitor *mon, CPUArchState *env)
> -{
> -    unsigned int l1, l2, l3;
> -    uint64_t pdpe, pde, pte;
> -    uint64_t pdp_addr, pd_addr, pt_addr;
> -
> -    pdp_addr = env->cr[3] & ~0x1f;
> -    for (l1 = 0; l1 < 4; l1++) {
> -        cpu_physical_memory_read(pdp_addr + l1 * 8, &pdpe, 8);
> -        pdpe = le64_to_cpu(pdpe);
> -        if (pdpe & PG_PRESENT_MASK) {
> -            pd_addr = pdpe & 0x3fffffffff000ULL;
> -            for (l2 = 0; l2 < 512; l2++) {
> -                cpu_physical_memory_read(pd_addr + l2 * 8, &pde, 8);
> -                pde = le64_to_cpu(pde);
> -                if (pde & PG_PRESENT_MASK) {
> -                    if (pde & PG_PSE_MASK) {
> -                        /* 2M pages with PAE, CR4.PSE is ignored */
> -                        print_pte(mon, env, (l1 << 30) + (l2 << 21), pde,
> -                                  ~((hwaddr)(1 << 20) - 1));
> -                    } else {
> -                        pt_addr = pde & 0x3fffffffff000ULL;
> -                        for (l3 = 0; l3 < 512; l3++) {
> -                            cpu_physical_memory_read(pt_addr + l3 * 8, &pte, 8);
> -                            pte = le64_to_cpu(pte);
> -                            if (pte & PG_PRESENT_MASK) {
> -                                print_pte(mon, env, (l1 << 30) + (l2 << 21)
> -                                          + (l3 << 12),
> -                                          pte & ~PG_PSE_MASK,
> -                                          ~(hwaddr)0xfff);
> -                            }
> -                        }
> -                    }
> -                }
> -            }
> -        }
> -    }
> +    monitor_printf(mon, "%s\n", buf);
>   }
>   
> -#ifdef TARGET_X86_64
> -static void tlb_info_la48(Monitor *mon, CPUArchState *env,
> -        uint64_t l0, uint64_t pml4_addr)
> +static
> +int mem_print_tlb(CPUState *cs, void *data, PTE_t *pte,
> +                  vaddr vaddr_in, int height, int offset)
>   {
> -    uint64_t l1, l2, l3, l4;
> -    uint64_t pml4e, pdpe, pde, pte;
> -    uint64_t pdp_addr, pd_addr, pt_addr;
> -
> -    for (l1 = 0; l1 < 512; l1++) {
> -        cpu_physical_memory_read(pml4_addr + l1 * 8, &pml4e, 8);
> -        pml4e = le64_to_cpu(pml4e);
> -        if (!(pml4e & PG_PRESENT_MASK)) {
> -            continue;
> -        }
> +    struct mem_print_state *state = (struct mem_print_state *) data;
> +    CPUClass *cc = CPU_GET_CLASS(cs);
>   
> -        pdp_addr = pml4e & 0x3fffffffff000ULL;
> -        for (l2 = 0; l2 < 512; l2++) {
> -            cpu_physical_memory_read(pdp_addr + l2 * 8, &pdpe, 8);
> -            pdpe = le64_to_cpu(pdpe);
> -            if (!(pdpe & PG_PRESENT_MASK)) {
> -                continue;
> -            }
> -
> -            if (pdpe & PG_PSE_MASK) {
> -                /* 1G pages, CR4.PSE is ignored */
> -                print_pte(mon, env, (l0 << 48) + (l1 << 39) + (l2 << 30),
> -                        pdpe, 0x3ffffc0000000ULL);
> -                continue;
> -            }
> -
> -            pd_addr = pdpe & 0x3fffffffff000ULL;
> -            for (l3 = 0; l3 < 512; l3++) {
> -                cpu_physical_memory_read(pd_addr + l3 * 8, &pde, 8);
> -                pde = le64_to_cpu(pde);
> -                if (!(pde & PG_PRESENT_MASK)) {
> -                    continue;
> -                }
> -
> -                if (pde & PG_PSE_MASK) {
> -                    /* 2M pages, CR4.PSE is ignored */
> -                    print_pte(mon, env, (l0 << 48) + (l1 << 39) + (l2 << 30) +
> -                            (l3 << 21), pde, 0x3ffffffe00000ULL);
> -                    continue;
> -                }
> -
> -                pt_addr = pde & 0x3fffffffff000ULL;
> -                for (l4 = 0; l4 < 512; l4++) {
> -                    cpu_physical_memory_read(pt_addr
> -                            + l4 * 8,
> -                            &pte, 8);
> -                    pte = le64_to_cpu(pte);
> -                    if (pte & PG_PRESENT_MASK) {
> -                        print_pte(mon, env, (l0 << 48) + (l1 << 39) +
> -                                (l2 << 30) + (l3 << 21) + (l4 << 12),
> -                                pte & ~PG_PSE_MASK, 0x3fffffffff000ULL);
> -                    }
> -                }
> -            }
> -        }
> -    }
> +    cc->sysemu_ops->mon_print_pte(state->mon, state->env, vaddr_in,
> +                                  pte->pte64_t);
> +    return 0;
>   }
>   
> -static void tlb_info_la57(Monitor *mon, CPUArchState *env)
> +void hmp_info_tlb(Monitor *mon, const QDict *qdict)
>   {
> -    uint64_t l0;
> -    uint64_t pml5e;
> -    uint64_t pml5_addr;
> +    struct mem_print_state state;
>   
> -    pml5_addr = env->cr[3] & 0x3fffffffff000ULL;
> -    for (l0 = 0; l0 < 512; l0++) {
> -        cpu_physical_memory_read(pml5_addr + l0 * 8, &pml5e, 8);
> -        pml5e = le64_to_cpu(pml5e);
> -        if (pml5e & PG_PRESENT_MASK) {
> -            tlb_info_la48(mon, env, l0, pml5e & 0x3fffffffff000ULL);
> -        }
> +    CPUState *cs = mon_get_cpu(mon);
> +    if (!cs) {
> +        monitor_printf(mon, "Unable to get CPUState.  Internal error\n");
> +        return;
>       }
> -}
> -#endif /* TARGET_X86_64 */
>   
> -void hmp_info_tlb(Monitor *mon, const QDict *qdict)
> -{
> -    CPUArchState *env;
> +    CPUClass *cc = CPU_GET_CLASS(cs);
>   
> -    env = mon_get_cpu_env(mon);
> -    if (!env) {
> -        monitor_printf(mon, "No CPU available\n");
> -        return;
> +    if ((!cc->sysemu_ops->pte_child)
> +        || (!cc->sysemu_ops->pte_leaf)
> +        || (!cc->sysemu_ops->pte_leaf_page_size)
> +        || (!cc->sysemu_ops->page_table_entries_per_node)
> +        || (!cc->sysemu_ops->pte_flags)
> +        || (!cc->sysemu_ops->mon_print_pte)
> +        || (!cc->sysemu_ops->mon_init_page_table_iterator)) {
> +        monitor_printf(mon, "Info tlb unsupported on this ISA\n");
>       }
>   
> -    if (!(env->cr[0] & CR0_PG_MASK)) {
> -        monitor_printf(mon, "PG disabled\n");
> +    if (!cc->sysemu_ops->mon_init_page_table_iterator(mon, &state)) {
> +        monitor_printf(mon, "Unable to initialize page table iterator\n");
>           return;
>       }
> -    if (env->cr[4] & CR4_PAE_MASK) {
> -#ifdef TARGET_X86_64
> -        if (env->hflags & HF_LMA_MASK) {
> -            if (env->cr[4] & CR4_LA57_MASK) {
> -                tlb_info_la57(mon, env);
> -            } else {
> -                tlb_info_la48(mon, env, 0, env->cr[3] & 0x3fffffffff000ULL);
> -            }
> -        } else
> -#endif
> -        {
> -            tlb_info_pae32(mon, env);
> -        }
> -    } else {
> -        tlb_info_32(mon, env);
> -    }
> +
> +    /**
> +     * 'info tlb' visits only leaf PTEs marked present.
> +     * It does not check other protection bits.
> +     */
> +    for_each_pte(cs, &mem_print_tlb, &state, false, false);
>   }
>   
>   static void mem_print(Monitor *mon, CPUArchState *env,



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/6] Add an "info pg" command that prints the current page tables
  2024-06-06 14:02 ` [PATCH v3 1/6] Add an "info pg" command that prints the current page tables Don Porter
@ 2024-06-07  6:09   ` Philippe Mathieu-Daudé
  2024-06-07  7:16   ` Daniel P. Berrangé
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-06-07  6:09 UTC (permalink / raw)
  To: Don Porter, qemu-devel
  Cc: dave, peter.maydell, nadav.amit, richard.henderson,
	Daniel P. Berrangé

On 6/6/24 16:02, Don Porter wrote:
> The new "info pg" monitor command prints the current page table,
> including virtual address ranges, flag bits, and snippets of physical
> page numbers.  Completely filled regions of the page table with
> compatible flags are "folded", with the result that the complete
> output for a freshly booted x86-64 Linux VM can fit in a single
> terminal window.  The output looks like this:
> 
> VPN range             Entry         Flags            Physical page
> [7f0000000-7f0000000] PML4[0fe]     ---DA--UWP
>    [7f28c0000-7f28fffff]  PDP[0a3]     ---DA--UWP
>      [7f28c4600-7f28c47ff]  PDE[023]     ---DA--UWP
>        [7f28c4655-7f28c4656]  PTE[055-056] X--D---U-P 0000007f14-0000007f15
>        [7f28c465b-7f28c465b]  PTE[05b]     ----A--U-P 0000001cfc
> ...
> [ff8000000-ff8000000] PML4[1ff]     ---DA--UWP
>    [ffff80000-ffffbffff]  PDP[1fe]     ---DA---WP
>      [ffff81000-ffff81dff]  PDE[008-00e] -GSDA---WP 0000001000-0000001dff
>    [ffffc0000-fffffffff]  PDP[1ff]     ---DA--UWP
>      [ffffff400-ffffff5ff]  PDE[1fa]     ---DA--UWP
>        [ffffff5fb-ffffff5fc]  PTE[1fb-1fc] XG-DACT-WP 00000fec00 00000fee00
>      [ffffff600-ffffff7ff]  PDE[1fb]     ---DA--UWP
>        [ffffff600-ffffff600]  PTE[000]     -G-DA--U-P 0000001467
> 
> This draws heavy inspiration from Austin Clements' original patch.
> 
> This also adds a generic page table walker, which other monitor
> and execution commands will be migrated to in subsequent patches.
> 
> Signed-off-by: Don Porter <porter@cs.unc.edu>
> ---
>   hmp-commands-info.hx              |  13 ++
>   hw/core/cpu-sysemu.c              | 140 ++++++++++++
>   include/hw/core/cpu.h             |  34 ++-
>   include/hw/core/sysemu-cpu-ops.h  | 156 +++++++++++++
>   include/monitor/hmp-target.h      |   1 +
>   monitor/hmp-cmds-target.c         | 198 +++++++++++++++++
>   target/i386/arch_memory_mapping.c | 351 +++++++++++++++++++++++++++++-
>   target/i386/cpu.c                 |  11 +
>   target/i386/cpu.h                 |  15 ++
>   target/i386/monitor.c             | 165 ++++++++++++++
>   10 files changed, 1082 insertions(+), 2 deletions(-)
> 
> diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
> index 20a9835ea8..a873841920 100644
> --- a/hmp-commands-info.hx
> +++ b/hmp-commands-info.hx
> @@ -242,6 +242,19 @@ SRST
>       Show memory tree.
>   ERST
>   
> +    {
> +        .name       = "pg",
> +        .args_type  = "",
> +        .params     = "",
> +        .help       = "show the page table",
> +        .cmd        = hmp_info_pg,
> +    },
> +
> +SRST
> +  ``info pg``
> +    Show the active page table.
> +ERST
> +
>   #if defined(CONFIG_TCG)
>       {
>           .name       = "jit",
> diff --git a/hw/core/cpu-sysemu.c b/hw/core/cpu-sysemu.c
> index 2a9a2a4eb5..fd936fa90c 100644
> --- a/hw/core/cpu-sysemu.c
> +++ b/hw/core/cpu-sysemu.c
> @@ -142,3 +142,143 @@ GuestPanicInformation *cpu_get_crash_info(CPUState *cpu)
>       }
>       return res;
>   }
> +
> +/**
> + * _for_each_pte - recursive helper function
> + *
> + * @cs - CPU state
> + * @fn(cs, data, pte, vaddr, height) - User-provided function to call on each
> + *                                     pte.
> + *   * @cs - pass through cs
> + *   * @data - user-provided, opaque pointer
> + *   * @pte - current pte
> + *   * @vaddr_in - virtual address translated by pte
> + *   * @height - height in the tree of pte
> + * @data - user-provided, opaque pointer, passed to fn()
> + * @visit_interior_nodes - if true, call fn() on page table entries in
> + *                         interior nodes.  If false, only call fn() on page
> + *                         table entries in leaves.
> + * @visit_not_present - if true, call fn() on entries that are not present.
> + *                         if false, visit only present entries.
> + * @node - The physical address of the current page table radix tree node
> + * @vaddr_in - The virtual address bits translated in walking the page
> + *          table to node
> + * @height - The height of node in the radix tree
> + *
> + * height starts at the max and counts down.
> + * In a 4 level x86 page table, pml4e is level 4, pdpe is level 3,
> + *  pde is level 2, and pte is level 1
> + *
> + * Returns true on success, false on error.
> + */
> +static bool
> +_for_each_pte(CPUState *cs,

Please avoid '_' prefix.

> +              int (*fn)(CPUState *cs, void *data, PTE_t *pte,
> +                        vaddr vaddr_in, int height, int offset),
> +              void *data, bool visit_interior_nodes,
> +              bool visit_not_present, hwaddr node,
> +              vaddr vaddr_in, int height)
> +{
> +    int ptes_per_node;
> +    int i;
> +
> +    assert(height > 0);
> +
> +    CPUClass *cc = CPU_GET_CLASS(cs);
> +
> +    if ((!cc->sysemu_ops->page_table_entries_per_node)
> +        || (!cc->sysemu_ops->get_pte)
> +        || (!cc->sysemu_ops->pte_present)
> +        || (!cc->sysemu_ops->pte_leaf)
> +        || (!cc->sysemu_ops->pte_child)) {
> +        return false;

Since this function has local scope, it shouldn't be called with
any of these unset. If you are unsure, we can assert() on them.

> +    }
> +
> +    ptes_per_node = cc->sysemu_ops->page_table_entries_per_node(cs, height);
> +
> +    for (i = 0; i < ptes_per_node; i++) {
> +        PTE_t pt_entry;
> +        vaddr vaddr_i;
> +        bool pte_present;
> +
> +        cc->sysemu_ops->get_pte(cs, node, i, height, &pt_entry, vaddr_in,
> +                                &vaddr_i, NULL);
> +        pte_present = cc->sysemu_ops->pte_present(cs, &pt_entry);
> +
> +        if (pte_present || visit_not_present) {
> +            if ((!pte_present) || cc->sysemu_ops->pte_leaf(cs, height,
> +                                                           &pt_entry)) {
> +                if (fn(cs, data, &pt_entry, vaddr_i, height, i)) {
> +                    /* Error */
> +                    return false;
> +                }
> +            } else { /* Non-leaf */
> +                if (visit_interior_nodes) {
> +                    if (fn(cs, data, &pt_entry, vaddr_i, height, i)) {
> +                        /* Error */
> +                        return false;
> +                    }
> +                }
> +                hwaddr child = cc->sysemu_ops->pte_child(cs, &pt_entry, height);
> +                assert(height > 1);
> +                if (!_for_each_pte(cs, fn, data, visit_interior_nodes,
> +                                   visit_not_present, child, vaddr_i,
> +                                   height - 1)) {
> +                    return false;
> +                }
> +            }
> +        }
> +    }
> +
> +    return true;
> +}
> +
> +/**
> + * for_each_pte - iterate over a page table, and
> + *                call fn on each entry
> + *
> + * @cs - CPU state
> + * @fn(cs, data, pte, vaddr, height) - User-provided function to call on each
> + *                                     pte.
> + *   * @cs - pass through cs
> + *   * @data - user-provided, opaque pointer
> + *   * @pte - current pte
> + *   * @vaddr - virtual address translated by pte
> + *   * @height - height in the tree of pte
> + * @data - opaque pointer; passed through to fn
> + * @visit_interior_nodes - if true, call fn() on interior entries in
> + *                         page table; if false, visit only leaf entries.
> + * @visit_not_present - if true, call fn() on entries that are not present.
> + *                         if false, visit only present entries.
> + *
> + * Returns true on success, false on error.
> + *
> + */
> +bool for_each_pte(CPUState *cs,
> +                  int (*fn)(CPUState *cs, void *data, PTE_t *pte,
> +                            vaddr vaddr, int height, int offset),
> +                  void *data, bool visit_interior_nodes,
> +                  bool visit_not_present)
> +{
> +    int height;
> +    vaddr vaddr = 0;
> +    hwaddr root;
> +    CPUClass *cc = CPU_GET_CLASS(cs);
> +
> +    if (!cpu_paging_enabled(cs)) {
> +        /* paging is disabled */
> +        return true;
> +    }
> +
> +    if (!cc->sysemu_ops->page_table_root) {
> +        return false;
> +    }
> +
> +    root = cc->sysemu_ops->page_table_root(cs, &height);
> +
> +    assert(height > 1);
> +
> +    /* Recursively call a helper to walk the page table */
> +    return _for_each_pte(cs, fn, data, visit_interior_nodes, visit_not_present,
> +                         root, vaddr, height);
> +}
> diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
> index a2c8536943..00d7162795 100644
> --- a/include/hw/core/cpu.h
> +++ b/include/hw/core/cpu.h
> @@ -671,9 +671,41 @@ int cpu_write_elf32_qemunote(WriteCoreDumpFunction f, CPUState *cpu,
>    * Caller is responsible for freeing the data.
>    */
>   GuestPanicInformation *cpu_get_crash_info(CPUState *cpu);
> -
>   #endif /* !CONFIG_USER_ONLY */
>   
> +/* Intended to become a generic PTE type */
> +typedef union PTE {
> +    uint64_t pte64_t;
> +    uint32_t pte32_t;
> +} PTE_t;
> +
> +/**
> + * for_each_pte - iterate over a page table, and
> + *                call fn on each entry
> + *
> + * @cs - CPU state
> + * @fn(cs, data, pte, vaddr, height) - User-provided function to call on each
> + *                                     pte.
> + *   * @cs - pass through cs
> + *   * @data - user-provided, opaque pointer
> + *   * @pte - current pte
> + *   * @vaddr - virtual address translated by pte
> + *   * @height - height in the tree of pte
> + * @data - opaque pointer; passed through to fn
> + * @visit_interior_nodes - if true, call fn() on interior entries in
> + *                         page table; if false, visit only leaf entries.
> + * @visit_not_present - if true, call fn() on entries that are not present.
> + *                         if false, visit only present entries.
> + *
> + * Returns true on success, false on error.
> + *
> + */
> +bool for_each_pte(CPUState *cs,
> +                  int (*fn)(CPUState *cs, void *data, PTE_t *pte, vaddr vaddr,
> +                            int height, int offset), void *data,
> +                  bool visit_interior_nodes, bool visit_not_present);
> +
> +
>   /**
>    * CPUDumpFlags:
>    * @CPU_DUMP_CODE:
> diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
> index 24d003fe04..eb16a1c3e2 100644
> --- a/include/hw/core/sysemu-cpu-ops.h
> +++ b/include/hw/core/sysemu-cpu-ops.h
> @@ -12,6 +12,39 @@
>   
>   #include "hw/core/cpu.h"
>   
> +/* Maximum supported page table height - currently x86 at 5 */
> +#define MAX_HEIGHT 5
> +
> +/*
> + * struct mem_print_state: Used by monitor in walking page tables.
> + */
> +struct mem_print_state {
> +    Monitor *mon;

I'd rather keep the monitor out of this state.

> +    CPUArchState *env;
> +    int vaw, paw; /* VA and PA width in characters */
> +    int max_height;
> +    bool (*flusher)(CPUState *cs, struct mem_print_state *state);
> +    bool flush_interior; /* If false, only call flusher() on leaves */
> +    bool require_physical_contiguity;
> +    /*
> +     * The height at which we started accumulating ranges, i.e., the
> +     * next height we need to print once we hit the end of a
> +     * contiguous range.
> +     */
> +    int start_height;
> +    /*
> +     * For compressing contiguous ranges, track the
> +     * start and end of the range
> +     */
> +    hwaddr vstart[MAX_HEIGHT + 1]; /* Starting virt. addr. of open pte range */
> +    hwaddr vend[MAX_HEIGHT + 1]; /* Ending virtual address of open pte range */
> +    hwaddr pstart; /* Starting physical address of open pte range */
> +    hwaddr pend; /* Ending physical address of open pte range */
> +    int64_t ent[MAX_HEIGHT + 1]; /* PTE contents on current root->leaf path */
> +    int offset[MAX_HEIGHT + 1]; /* PTE range starting offsets */
> +    int last_offset[MAX_HEIGHT + 1]; /* PTE range ending offsets */
> +};
> +
>   /*
>    * struct SysemuCPUOps: System operations specific to a CPU class
>    */
> @@ -87,6 +120,129 @@ typedef struct SysemuCPUOps {
>        */
>       const VMStateDescription *legacy_vmsd;
>   
> +    /**
> +     * page_table_root - Given a CPUState, return the physical address
> +     *                    of the current page table root, as well as
> +     *                    write the height of the tree into *height.
> +     *
> +     * @cs - CPU state
> +
> +     * @height - a pointer to an integer, to store the page table tree
> +     *           height
> +     *
> +     * Returns a hardware address on success.  Should not fail (i.e.,
> +     * caller is responsible to ensure that a page table is actually
> +     * present).
> +     */
> +    hwaddr (*page_table_root)(CPUState *cs, int *height);
> +
> +    /**
> +     * page_table_entries_per_node - Return the number of entries in a
> +     *                                   page table node for the CPU
> +     *                                   at a given height.
> +     *
> +     * @cs - CPU state
> +     * @height - height of the page table tree to query, where the leaves
> +     *          are 1.
> +     *
> +     * Returns a value greater than zero on success, -1 on error.
> +     */
> +    int (*page_table_entries_per_node)(CPUState *cs, int height);
> +
> +    /**
> +     * get_pte - Copy the contents of the page table entry at node[i]
> +     *           into pt_entry.  Optionally, add the relevant bits to
> +     *           the virtual address in vaddr_pte.
> +     *
> +     * @cs - CPU state
> +     * @node - physical address of the current page table node
> +     * @i - index (in page table entries, not bytes) of the page table
> +     *      entry, within node
> +     * @height - height of node within the tree (leaves are 1, not 0)
> +     * @pt_entry - Pointer to a PTE_t, stores the contents of the page
> +     *             table entry
> +     * @vaddr_parent - The virtual address bits already translated in
> +     *                 walking the page table to node.  Optional: only
> +     *                 used if vaddr_pte is set.
> +     * @vaddr_pte - Optional pointer to a variable storing the virtual
> +     *              address bits translated by node[i].
> +     * @pte_paddr - Pointer to the physical address of the PTE within node.
> +     *              Optional parameter.
> +     */
> +
> +    void (*get_pte)(CPUState *cs, hwaddr node, int i, int height,
> +                    PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
> +                    hwaddr *pte_paddr);
> +
> +    /**
> +     * pte_present - Return true if the pte is marked 'present'
> +     */
> +    bool (*pte_present)(CPUState *cs, PTE_t *pte);
> +
> +    /**
> +     * pte_leaf - Return true if the pte is a page table leaf, false
> +     *                if the pte points to another node in the radix
> +     *                tree.
> +     */
> +    bool (*pte_leaf)(CPUState *cs, int height, PTE_t *pte);
> +
> +    /**
> +     * pte_child - Returns the physical address of a radix tree node
> +     *                 pointed to by pte.
> +     *
> +     * @cs - CPU state
> +     * @pte - The page table entry
> +     * @height - The height in the tree of pte
> +     *
> +     * Returns the physical address stored in pte on success, -1 on
> +     * error.
> +     */
> +    hwaddr (*pte_child)(CPUState *cs, PTE_t *pte, int height);
> +
> +    /**
> +     * pte_leaf_page_size - Return the page size of a leaf entry,
> +     *                          given the height and CPU state
> +     *
> +     * @cs - CPU state
> +     * @height - height of the page table tree to query, where the leaves
> +     *          are 1.
> +     *
> +     * Returns a value greater than zero on success, -1 on error.
> +     */
> +    uint64_t (*pte_leaf_page_size)(CPUState *cs, int height);
> +
> +    /**
> +     * pte_flags - Return the flag bits of the page table entry.
> +     *
> +     * @pte - the contents of the PTE, not the address.
> +     *
> +     * Returns pte with the non-flag bits masked out.
> +     */
> +    uint64_t (*pte_flags)(uint64_t pte);
> +
> +    /**
> +     * @mon_init_page_table_iterator: Callback to configure a page table
> +     * iterator for use by a monitor function.
> +     * Returns true on success, false if not supported (e.g., no paging disabled
> +     * or not implemented on this CPU).
> +     */
> +    bool (*mon_init_page_table_iterator)(Monitor *mon,
> +                                         struct mem_print_state *state);

Here too, I'm reluctant to wire the monitor to this API.

If we want text, we should use HumanReadableText.

> +    /**
> +     * @mon_info_pg_print_header: Prints the header line for 'info pg'.
> +     */
> +    void (*mon_info_pg_print_header)(Monitor *mon,
> +                                     struct mem_print_state *state);
> +
> +    /**
> +     * @flush_page_table_iterator_state: Prints the last entry,
> +     * if one is present.  Useful for iterators that aggregate information
> +     * across page table entries.
> +     */
> +    bool (*mon_flush_page_print_state)(CPUState *cs,
> +                                       struct mem_print_state *state);
> +
>   } SysemuCPUOps;
>   
>   #endif /* SYSEMU_CPU_OPS_H */
> diff --git a/include/monitor/hmp-target.h b/include/monitor/hmp-target.h
> index b679aaebbf..9af72ea58d 100644
> --- a/include/monitor/hmp-target.h
> +++ b/include/monitor/hmp-target.h
> @@ -50,6 +50,7 @@ CPUState *mon_get_cpu(Monitor *mon);
>   void hmp_info_mem(Monitor *mon, const QDict *qdict);
>   void hmp_info_tlb(Monitor *mon, const QDict *qdict);
>   void hmp_mce(Monitor *mon, const QDict *qdict);
> +void hmp_info_pg(Monitor *mon, const QDict *qdict);
>   void hmp_info_local_apic(Monitor *mon, const QDict *qdict);
>   void hmp_info_sev(Monitor *mon, const QDict *qdict);
>   void hmp_info_sgx(Monitor *mon, const QDict *qdict);
> diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
> index ff01cf9d8d..60a8bd0c37 100644
> --- a/monitor/hmp-cmds-target.c
> +++ b/monitor/hmp-cmds-target.c
> @@ -31,6 +31,7 @@
>   #include "qapi/error.h"
>   #include "qapi/qmp/qdict.h"
>   #include "sysemu/hw_accel.h"
> +#include "hw/core/sysemu-cpu-ops.h"
>   
>   /* Set the current CPU defined by the user. Callers must hold BQL. */
>   int monitor_set_cpu(Monitor *mon, int cpu_index)
> @@ -120,6 +121,203 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict)
>       }
>   }
>   
> +/* Assume only called on present entries */
> +static
> +int compressing_iterator(CPUState *cs, void *data, PTE_t *pte,
> +                         vaddr vaddr_in, int height, int offset)
> +{
> +    CPUClass *cc = CPU_GET_CLASS(cs);
> +    struct mem_print_state *state = (struct mem_print_state *) data;
> +    hwaddr paddr = cc->sysemu_ops->pte_child(cs, pte, height);
> +    target_ulong size = cc->sysemu_ops->pte_leaf_page_size(cs, height);
> +    bool start_new_run = false, flush = false;
> +    bool is_leaf = cc->sysemu_ops->pte_leaf(cs, height, pte);
> +
> +    int entries_per_node = cc->sysemu_ops->page_table_entries_per_node(cs,
> +                                                                       height);
> +
> +    /* Prot of current pte */
> +    int prot = cc->sysemu_ops->pte_flags(pte->pte64_t);
> +
> +    /* If there is a prior run, first try to extend it. */
> +    if (state->start_height != 0) {
> +
> +        /*
> +         * If we aren't flushing interior nodes, raise the start height.
> +         * We don't need to detect non-compressible interior nodes.
> +         */
> +        if ((!state->flush_interior) && state->start_height < height) {
> +            state->start_height = height;
> +            state->vstart[height] = vaddr_in;
> +            state->vend[height] = vaddr_in;
> +            state->ent[height] = pte->pte64_t;
> +            if (offset == 0) {
> +                state->last_offset[height] = entries_per_node - 1;
> +            } else {
> +                state->last_offset[height] = offset - 1;
> +            }
> +        }
> +
> +        /* Detect when we are walking down the "left edge" of a range */
> +        if (state->vstart[height] == -1
> +            && (height + 1) <= state->start_height
> +            && state->vstart[height + 1] == vaddr_in) {
> +
> +            state->vstart[height] = vaddr_in;
> +            state->vend[height] = vaddr_in;
> +            state->ent[height] = pte->pte64_t;
> +            state->offset[height] = offset;
> +            state->last_offset[height] = offset;
> +
> +            if (is_leaf) {
> +                state->pstart = paddr;
> +                state->pend = paddr;
> +            }
> +
> +            /* Detect contiguous entries at same level */
> +        } else if ((state->vstart[height] != -1)
> +                   && (state->start_height >= height)
> +                   && cc->sysemu_ops->pte_flags(state->ent[height]) == prot
> +                   && (((state->last_offset[height] + 1) % entries_per_node)
> +                       == offset)
> +                   && ((!is_leaf)
> +                       || (!state->require_physical_contiguity)
> +                       || (state->pend + size == paddr))) {
> +
> +
> +            /*
> +             * If there are entries at the levels below, make sure we
> +             * completed them.  We only compress interior nodes
> +             * without holes in the mappings.
> +             */
> +            if (height != 1) {
> +                for (int i = height - 1; i >= 1; i--) {
> +                    int entries = cc->sysemu_ops->page_table_entries_per_node(
> +                        cs, i);
> +
> +                    /* Stop if we hit large pages before level 1 */
> +                    if (state->vstart[i] == -1) {
> +                        break;
> +                    }
> +
> +                    if ((state->last_offset[i] + 1) != entries) {
> +                        flush = true;
> +                        start_new_run = true;
> +                        break;
> +                    }
> +                }
> +            }
> +
> +
> +            if (!flush) {
> +
> +                /* We can compress these entries */
> +                state->ent[height] = pte->pte64_t;
> +                state->vend[height] = vaddr_in;
> +                state->last_offset[height] = offset;
> +
> +                /* Only update the physical range on leaves */
> +                if (is_leaf) {
> +                    state->pend = paddr;
> +                }
> +            }
> +            /* Let PTEs accumulate... */
> +        } else {
> +            flush = true;
> +        }
> +
> +        if (flush) {
> +            /*
> +             * We hit dicontiguous permissions or pages.
> +             * Print the old entries, then start accumulating again
> +             *
> +             * Some clients only want the flusher called on a leaf.
> +             * Check that too.
> +             *
> +             * We can infer whether the accumulated range includes a
> +             * leaf based on whether pstart is -1.
> +             */
> +            if (state->flush_interior || (state->pstart != -1)) {
> +                if (state->flusher(cs, state)) {
> +                    start_new_run = true;
> +                }
> +            } else {
> +                start_new_run = true;
> +            }
> +        }
> +    } else {
> +        start_new_run = true;
> +    }
> +
> +    if (start_new_run) {
> +        /* start a new run with this PTE */
> +        for (int i = state->start_height; i > 0; i--) {
> +            if (state->vstart[i] != -1) {
> +                state->ent[i] = 0;
> +                state->last_offset[i] = 0;
> +                state->vstart[i] = -1;
> +            }
> +        }
> +        state->pstart = -1;
> +        state->vstart[height] = vaddr_in;
> +        state->vend[height] = vaddr_in;
> +        state->ent[height] = pte->pte64_t;
> +        state->offset[height] = offset;
> +        state->last_offset[height] = offset;
> +        if (is_leaf) {
> +            state->pstart = paddr;
> +            state->pend = paddr;
> +        }
> +        state->start_height = height;
> +    }
> +
> +    return 0;
> +}
> +
> +void hmp_info_pg(Monitor *mon, const QDict *qdict)
> +{
> +    struct mem_print_state state;
> +
> +    CPUState *cs = mon_get_cpu(mon);
> +    if (!cs) {
> +        monitor_printf(mon, "Unable to get CPUState.  Internal error\n");
> +        return;
> +    }
> +
> +    CPUClass *cc = CPU_GET_CLASS(cs);
> +
> +    if ((!cc->sysemu_ops->pte_child)
> +        || (!cc->sysemu_ops->pte_leaf)
> +        || (!cc->sysemu_ops->pte_leaf_page_size)
> +        || (!cc->sysemu_ops->page_table_entries_per_node)
> +        || (!cc->sysemu_ops->pte_flags)
> +        || (!cc->sysemu_ops->mon_init_page_table_iterator)
> +        || (!cc->sysemu_ops->mon_info_pg_print_header)
> +        || (!cc->sysemu_ops->mon_flush_page_print_state)) {
> +        monitor_printf(mon, "Info pg unsupported on this ISA\n");
> +        return;
> +    }
> +
> +    if (!cc->sysemu_ops->mon_init_page_table_iterator(mon, &state)) {
> +        monitor_printf(mon, "Unable to initialize page table iterator\n");
> +        return;
> +    }
> +
> +    state.flush_interior = true;
> +    state.require_physical_contiguity = true;
> +    state.flusher = cc->sysemu_ops->mon_flush_page_print_state;
> +
> +    cc->sysemu_ops->mon_info_pg_print_header(mon, &state);
> +
> +    /*
> +     * We must visit interior entries to get the hierarchy, but
> +     * can skip not present mappings
> +     */
> +    for_each_pte(cs, &compressing_iterator, &state, true, false);
> +
> +    /* Print last entry, if one present */
> +    cc->sysemu_ops->mon_flush_page_print_state(cs, &state);
> +}
>   static void memory_dump(Monitor *mon, int count, int format, int wsize,
>                           hwaddr addr, int is_physical)
>   {
> diff --git a/target/i386/arch_memory_mapping.c b/target/i386/arch_memory_mapping.c
> index d1ff659128..562a00b5a7 100644
> --- a/target/i386/arch_memory_mapping.c
> +++ b/target/i386/arch_memory_mapping.c
> @@ -15,6 +15,356 @@
>   #include "cpu.h"
>   #include "sysemu/memory_mapping.h"
>   
> +/**
> + ************** code hook implementations for x86 ***********
> + */
> +
> +#define PML4_ADDR_MASK 0xffffffffff000ULL /* selects bits 51:12 */
> +
> +/**
> + * x86_page_table_root - Given a CPUState, return the physical address
> + *                       of the current page table root, as well as
> + *                       write the height of the tree into *height.
> + *
> + * @cs - CPU state
> + * @height - a pointer to an integer, to store the page table tree height
> + *
> + * Returns a hardware address on success.  Should not fail (i.e., caller is
> + * responsible to ensure that a page table is actually present).
> + */
> +hwaddr x86_page_table_root(CPUState *cs, int *height)
> +{
> +    X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
> +    /*
> +     * DEP 5/15/24: Some original page table walking code sets the a20
> +     * mask as a 32 bit integer and checks it on each level of hte
> +     * page table walk; some only checks it against the final result.
> +     * For 64 bits, I think we need to sign extend in the common case
> +     * it is not set (and returns -1), or we will lose bits.
> +     */
> +    int64_t a20_mask;
> +
> +    assert(cpu_paging_enabled(cs));
> +    a20_mask = x86_get_a20_mask(env);
> +
> +    if (env->cr[4] & CR4_PAE_MASK) {
> +#ifdef TARGET_X86_64
> +        if (env->hflags & HF_LMA_MASK) {
> +            if (env->cr[4] & CR4_LA57_MASK) {
> +                *height = 5;
> +            } else {
> +                *height = 4;
> +            }
> +            return (env->cr[3] & PML4_ADDR_MASK) & a20_mask;
> +        } else
> +#endif
> +        {
> +            *height = 3;
> +            return (env->cr[3] & ~0x1f) & a20_mask;
> +        }
> +    } else {
> +        *height = 2;
> +        return (env->cr[3] & ~0xfff) & a20_mask;
> +    }
> +}
> +
> +
> +/**
> + * x86_page_table_entries_per_node - Return the number of entries in a
> + *                                   page table node for the CPU at a
> + *                                   given height.
> + *
> + * @cs - CPU state
> + * @height - height of the page table tree to query, where the leaves
> + *          are 1.
> + *
> + * Returns a value greater than zero on success, -1 on error.
> + */
> +int x86_page_table_entries_per_node(CPUState *cs, int height)
> +{
> +    X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
> +    bool pae_enabled = env->cr[4] & CR4_PAE_MASK;
> +
> +    assert(height < 6);
> +    assert(height > 0);
> +
> +    switch (height) {
> +#ifdef TARGET_X86_64
> +    case 5:
> +        assert(env->cr[4] & CR4_LA57_MASK);
> +    case 4:
> +        assert(env->hflags & HF_LMA_MASK);
> +        assert(pae_enabled);
> +        return 512;
> +#endif
> +    case 3:
> +        assert(pae_enabled);
> +#ifdef TARGET_X86_64
> +        if (env->hflags & HF_LMA_MASK) {
> +            return 512;
> +        } else
> +#endif
> +        {
> +            return 4;
> +        }
> +    case 2:
> +    case 1:
> +        return pae_enabled ? 512 : 1024;
> +    default:
> +        g_assert_not_reached();
> +    }
> +    return -1;
> +}
> +
> +/**
> + * x86_pte_leaf_page_size - Return the page size of a leaf entry,
> + *                          given the height and CPU state
> + *
> + * @cs - CPU state
> + * @height - height of the page table tree to query, where the leaves
> + *          are 1.
> + *
> + * Returns a value greater than zero on success, -1 on error.
> + */
> +uint64_t x86_pte_leaf_page_size(CPUState *cs, int height)
> +{
> +    X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
> +    bool pae_enabled = env->cr[4] & CR4_PAE_MASK;
> +
> +    assert(height < 6);
> +    assert(height > 0);
> +
> +    switch (height) {
> +#ifdef TARGET_X86_64
> +    case 5:
> +        assert(pae_enabled);
> +        assert(env->cr[4] & CR4_LA57_MASK);
> +        assert(env->hflags & HF_LMA_MASK);
> +        return 1ULL << 48;
> +    case 4:
> +        assert(pae_enabled);
> +        assert(env->hflags & HF_LMA_MASK);
> +        return 1ULL << 39;
> +#endif
> +    case 3:
> +        assert(pae_enabled);
> +        return 1 << 30;
> +    case 2:
> +        if (pae_enabled) {
> +            return 1 << 21;
> +        } else {
> +            return 1 << 22;
> +        }
> +    case 1:
> +        return 4096;
> +    default:
> +        g_assert_not_reached();
> +    }
> +    return -1;
> +}
> +
> +/*
> + * Given a CPU state and height, return the number of bits
> + * to shift right/left in going from virtual to PTE index
> + * and vice versa, the number of useful bits.
> + */
> +static void _mmu_decode_va_parameters(CPUState *cs, int height,
> +                                      int *shift, int *width)
> +{
> +    X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
> +    int _shift = 0;
> +    int _width = 0;
> +    bool pae_enabled = env->cr[4] & CR4_PAE_MASK;
> +
> +    switch (height) {
> +    case 5:
> +        _shift = 48;
> +        _width = 9;
> +        break;
> +    case 4:
> +        _shift = 39;
> +        _width = 9;
> +        break;
> +    case 3:
> +        _shift = 30;
> +        _width = 9;
> +        break;
> +    case 2:
> +        /* 64 bit page tables shift from 30->21 bits here */
> +        if (pae_enabled) {
> +            _shift = 21;
> +            _width = 9;
> +        } else {
> +            /* 32 bit page tables shift from 32->22 bits */
> +            _shift = 22;
> +            _width = 10;
> +        }
> +        break;
> +    case 1:
> +        _shift = 12;
> +        if (pae_enabled) {
> +            _width = 9;
> +        } else {
> +            _width = 10;
> +        }
> +
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +
> +    if (shift) {
> +        *shift = _shift;
> +    }
> +
> +    if (width) {
> +        *width = _width;
> +    }
> +}
> +
> +/**
> + * get_pte - Copy the contents of the page table entry at node[i] into pt_entry.
> + *           Optionally, add the relevant bits to the virtual address in
> + *           vaddr_pte.
> + *
> + * @cs - CPU state
> + * @node - physical address of the current page table node
> + * @i - index (in page table entries, not bytes) of the page table
> + *      entry, within node
> + * @height - height of node within the tree (leaves are 1, not 0)
> + * @pt_entry - Poiter to a PTE_t, stores the contents of the page table entry
> + * @vaddr_parent - The virtual address bits already translated in walking the
> + *                 page table to node.  Optional: only used if vaddr_pte is set.
> + * @vaddr_pte - Optional pointer to a variable storing the virtual address bits
> + *              translated by node[i].
> + * @pte_paddr - Pointer to the physical address of the PTE within node.
> + *              Optional parameter.
> + */
> +void
> +x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
> +            PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
> +            hwaddr *pte_paddr)
> +
> +{
> +    X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
> +    int32_t a20_mask = x86_get_a20_mask(env);
> +    hwaddr pte;
> +
> +    if (env->hflags & HF_LMA_MASK) {
> +        /* 64 bit */
> +        int pte_width = 8;
> +        pte = (node + (i * pte_width)) & a20_mask;
> +        pt_entry->pte64_t = address_space_ldq(cs->as, pte,
> +                                              MEMTXATTRS_UNSPECIFIED, NULL);
> +    } else {
> +        /* 32 bit */
> +        int pte_width = 4;
> +        pte = (node + (i * pte_width)) & a20_mask;
> +        pt_entry->pte32_t = address_space_ldl(cs->as, pte,
> +                                              MEMTXATTRS_UNSPECIFIED, NULL);
> +    }
> +
> +    if (vaddr_pte) {
> +        int shift = 0;
> +        _mmu_decode_va_parameters(cs, height, &shift, NULL);
> +        *vaddr_pte = vaddr_parent | ((i & 0x1ffULL) << shift);
> +    }
> +
> +    if (pte_paddr) {
> +        *pte_paddr = pte;
> +    }
> +}
> +
> +
> +static bool
> +mmu_pte_check_bits(CPUState *cs, PTE_t *pte, int64_t mask)
> +{
> +    X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
> +    if (env->hflags & HF_LMA_MASK) {
> +        return pte->pte64_t & mask;
> +    } else {
> +        return pte->pte32_t & mask;
> +    }
> +}
> +
> +/**
> + * x86_pte_present - Return true if the pte is marked 'present'
> + */
> +bool
> +x86_pte_present(CPUState *cs, PTE_t *pte)
> +{
> +    return mmu_pte_check_bits(cs, pte, PG_PRESENT_MASK);
> +}
> +
> +/**
> + * x86_pte_leaf - Return true if the pte is
> + *                a page table leaf, false if
> + *                the pte points to another
> + *                node in the radix tree.
> + */
> +bool
> +x86_pte_leaf(CPUState *cs, int height, PTE_t *pte)
> +{
> +    return height == 1 || mmu_pte_check_bits(cs, pte, PG_PSE_MASK);
> +}
> +
> +/**
> + * x86_pte_child - Returns the physical address
> + *                 of a radix tree node pointed to by pte.
> + *
> + * @cs - CPU state
> + * @pte - The page table entry
> + * @height - The height in the tree of pte
> + *
> + * Returns the physical address stored in pte on success,
> + *     -1 on error.
> + */
> +hwaddr
> +x86_pte_child(CPUState *cs, PTE_t *pte, int height)
> +{
> +    X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
> +    bool pae_enabled = env->cr[4] & CR4_PAE_MASK;
> +    int32_t a20_mask = x86_get_a20_mask(env);
> +
> +    switch (height) {
> +#ifdef TARGET_X86_64
> +    case 5:
> +        assert(env->cr[4] & CR4_LA57_MASK);
> +    case 4:
> +        assert(env->hflags & HF_LMA_MASK);
> +        /* assert(pae_enabled); */
> +        /* Fall through */
> +#endif
> +    case 3:
> +        assert(pae_enabled);
> +#ifdef TARGET_X86_64
> +        if (env->hflags & HF_LMA_MASK) {
> +            return (pte->pte64_t & PG_ADDRESS_MASK) & a20_mask;
> +        } else
> +#endif
> +        {
> +            return (pte->pte64_t & ~0xfff) & a20_mask;
> +        }
> +    case 2:
> +    case 1:
> +        if (pae_enabled) {
> +            return (pte->pte64_t & PG_ADDRESS_MASK) & a20_mask;
> +        } else {
> +            return (pte->pte32_t & ~0xfff) & a20_mask;
> +        }
> +    default:
> +        g_assert_not_reached();
> +    }
> +    return -1;
> +}
> +
>   /* PAE Paging or IA-32e Paging */
>   static void walk_pte(MemoryMappingList *list, AddressSpace *as,
>                        hwaddr pte_start_addr,
> @@ -313,4 +663,3 @@ bool x86_cpu_get_memory_mapping(CPUState *cs, MemoryMappingList *list,
>   
>       return true;
>   }
> -
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 914bef442c..8bd6164b68 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -8305,6 +8305,17 @@ static const struct SysemuCPUOps i386_sysemu_ops = {
>       .write_elf32_qemunote = x86_cpu_write_elf32_qemunote,
>       .write_elf64_qemunote = x86_cpu_write_elf64_qemunote,
>       .legacy_vmsd = &vmstate_x86_cpu,
> +    .page_table_root = &x86_page_table_root,
> +    .page_table_entries_per_node = &x86_page_table_entries_per_node,
> +    .get_pte = &x86_get_pte,
> +    .pte_present = &x86_pte_present,
> +    .pte_leaf = &x86_pte_leaf,
> +    .pte_child = &x86_pte_child,
> +    .pte_leaf_page_size = &x86_pte_leaf_page_size,
> +    .pte_flags = &x86_pte_flags,
> +    .mon_init_page_table_iterator = &x86_mon_init_page_table_iterator,
> +    .mon_info_pg_print_header = &x86_mon_info_pg_print_header,
> +    .mon_flush_page_print_state = &x86_mon_flush_print_pg_state,
>   };
>   #endif
>   
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index c64ef0c1a2..cbb6f6fc4d 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -21,6 +21,7 @@
>   #define I386_CPU_H
>   
>   #include "sysemu/tcg.h"
> +#include "hw/core/sysemu-cpu-ops.h"
>   #include "cpu-qom.h"
>   #include "kvm/hyperv-proto.h"
>   #include "exec/cpu-defs.h"
> @@ -2150,8 +2151,22 @@ int x86_cpu_write_elf64_qemunote(WriteCoreDumpFunction f, CPUState *cpu,
>   int x86_cpu_write_elf32_qemunote(WriteCoreDumpFunction f, CPUState *cpu,
>                                    DumpState *s);
>   
> +hwaddr x86_page_table_root(CPUState *cs, int *height);
> +int x86_page_table_entries_per_node(CPUState *cs, int height);
> +uint64_t x86_pte_leaf_page_size(CPUState *cs, int height);
> +void x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
> +                 PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
> +                 hwaddr *pte_paddr);
> +bool x86_pte_present(CPUState *cs, PTE_t *pte);
> +bool x86_pte_leaf(CPUState *cs, int height, PTE_t *pte);
> +hwaddr x86_pte_child(CPUState *cs, PTE_t *pte, int height);
> +uint64_t x86_pte_flags(uint64_t pte);
>   bool x86_cpu_get_memory_mapping(CPUState *cpu, MemoryMappingList *list,
>                                   Error **errp);
> +bool x86_mon_init_page_table_iterator(Monitor *mon,
> +                                      struct mem_print_state *state);
> +void x86_mon_info_pg_print_header(Monitor *mon, struct mem_print_state *state);
> +bool x86_mon_flush_print_pg_state(CPUState *cs, struct mem_print_state *state);
>   
>   void x86_cpu_dump_state(CPUState *cs, FILE *f, int flags);
>   
> diff --git a/target/i386/monitor.c b/target/i386/monitor.c
> index 2d766b2637..65e82e73e8 100644
> --- a/target/i386/monitor.c
> +++ b/target/i386/monitor.c
> @@ -32,6 +32,171 @@
>   #include "qapi/qapi-commands-misc-target.h"
>   #include "qapi/qapi-commands-misc.h"
>   
> +
> +/********************* x86 specific hooks for printing page table stuff ****/
> +
> +const char *names[7] = {(char *)NULL, "PTE", "PDE", "PDP", "PML4", "Pml5",
> +                        (char *)NULL};
> +static char *pg_bits(hwaddr ent)
> +{
> +    static char buf[32];
> +    snprintf(buf, 32, "%c%c%c%c%c%c%c%c%c%c",
> +            ent & PG_NX_MASK ? 'X' : '-',
> +            ent & PG_GLOBAL_MASK ? 'G' : '-',
> +            ent & PG_PSE_MASK ? 'S' : '-',
> +            ent & PG_DIRTY_MASK ? 'D' : '-',
> +            ent & PG_ACCESSED_MASK ? 'A' : '-',
> +            ent & PG_PCD_MASK ? 'C' : '-',
> +            ent & PG_PWT_MASK ? 'T' : '-',
> +            ent & PG_USER_MASK ? 'U' : '-',
> +            ent & PG_RW_MASK ? 'W' : '-',
> +            ent & PG_PRESENT_MASK ? 'P' : '-');
> +    return buf;
> +}
> +
> +bool x86_mon_init_page_table_iterator(Monitor *mon,
> +                                      struct mem_print_state *state)
> +{
> +    CPUArchState *env;
> +    state->mon = mon;
> +    state->flush_interior = false;
> +    state->require_physical_contiguity = false;
> +
> +    for (int i = 0; i < MAX_HEIGHT; i++) {
> +        state->vstart[i] = -1;
> +        state->last_offset[i] = 0;
> +    }
> +    state->start_height = 0;
> +
> +    env = mon_get_cpu_env(mon);
> +    if (!env) {
> +        monitor_printf(mon, "No CPU available\n");
> +        return false;
> +    }
> +    state->env = env;
> +
> +    if (!(env->cr[0] & CR0_PG_MASK)) {
> +        monitor_printf(mon, "PG disabled\n");
> +        return false;
> +    }
> +
> +    /* set va and pa width */
> +    if (env->cr[4] & CR4_PAE_MASK) {
> +        state->paw = 13;
> +#ifdef TARGET_X86_64
> +        if (env->hflags & HF_LMA_MASK) {
> +            if (env->cr[4] & CR4_LA57_MASK) {
> +                state->vaw = 15;
> +                state->max_height = 5;
> +            } else {
> +                state->vaw = 12;
> +                state->max_height = 4;
> +            }
> +        } else
> +#endif
> +        {
> +            state->vaw = 8;
> +            state->max_height = 3;
> +        }
> +    } else {
> +        state->max_height = 2;
> +        state->vaw = 8;
> +        state->paw = 8;
> +    }
> +
> +    return true;
> +}
> +
> +void x86_mon_info_pg_print_header(Monitor *mon, struct mem_print_state *state)
> +{
> +    /* Header line */
> +    monitor_printf(mon, "%-*s %-13s %-10s %*s%s\n",
> +                   3 + 2 * (state->vaw - 3), "VPN range",
> +                   "Entry", "Flags",
> +                   2 * (state->max_height - 1), "", "Physical page(s)");
> +}
> +
> +
> +static void pg_print(CPUState *cs, Monitor *mon, uint64_t pt_ent,
> +                     target_ulong vaddr_s, target_ulong vaddr_l,
> +                     hwaddr paddr_s, hwaddr paddr_l,
> +                     int offset_s, int offset_l,
> +                     int height, int max_height, int vaw, int paw,
> +                     bool is_leaf)
> +
> +{
> +    char buf[128];
> +    char *pos = buf, *end = buf + sizeof(buf);
> +    target_ulong size = x86_pte_leaf_page_size(cs, height);
> +
> +    /* VFN range */
> +    pos += snprintf(pos, end - pos, "%*s[%0*"PRIx64"-%0*"PRIx64"] ",
> +                    (max_height - height) * 2, "",
> +                    vaw - 3, (uint64_t)vaddr_s >> 12,
> +                    vaw - 3, ((uint64_t)vaddr_l + size - 1) >> 12);
> +
> +    /* Slot */
> +    if (vaddr_s == vaddr_l) {
> +        pos += snprintf(pos, end - pos, "%4s[%03x]    ",
> +                       names[height], offset_s);
> +    } else {
> +        pos += snprintf(pos, end - pos, "%4s[%03x-%03x]",
> +                       names[height], offset_s, offset_l);
> +    }
> +
> +    /* Flags */
> +    pos += snprintf(pos, end - pos, " %s", pg_bits(pt_ent));
> +
> +
> +    /* Range-compressed PFN's */
> +    if (is_leaf) {
> +        if (vaddr_s == vaddr_l) {
> +            pos += snprintf(pos, end - pos, " %0*"PRIx64,
> +                            paw - 3, (uint64_t)paddr_s >> 12);
> +        } else {
> +            pos += snprintf(pos, end - pos, " %0*"PRIx64"-%0*"PRIx64,
> +                            paw - 3, (uint64_t)paddr_s >> 12,
> +                            paw - 3, (uint64_t)paddr_l >> 12);
> +        }
> +        pos = MIN(pos, end);
> +    }
> +
> +    /* Trim line to fit screen */
> +    if (pos - buf > 79) {
> +        strcpy(buf + 77, "..");
> +    }
> +
> +    monitor_printf(mon, "%s\n", buf);
> +}
> +
> +uint64_t x86_pte_flags(uint64_t pte)
> +{
> +    return pte & (PG_USER_MASK | PG_RW_MASK |
> +                  PG_PRESENT_MASK);
> +}
> +
> +/* Returns true if it emitted anything */
> +bool x86_mon_flush_print_pg_state(CPUState *cs, struct mem_print_state *state)
> +{
> +    bool ret = false;
> +    for (int i = state->start_height; i > 0; i--) {
> +        if (state->vstart[i] == -1) {
> +            break;
> +        }
> +        PTE_t my_pte;
> +        my_pte.pte64_t = state->ent[i];
> +        ret = true;
> +        pg_print(cs, state->mon, state->ent[i],
> +                 state->vstart[i], state->vend[i],
> +                 state->pstart, state->pend,
> +                 state->offset[i], state->last_offset[i],
> +                 i, state->max_height, state->vaw, state->paw,
> +                 x86_pte_leaf(cs, i, &my_pte));
> +    }
> +
> +    return ret;
> +}
> +
>   /* Perform linear address sign extension */
>   static hwaddr addr_canonical(CPUArchState *env, hwaddr addr)
>   {



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code.
  2024-06-06 14:02 ` [PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code Don Porter
@ 2024-06-07  6:12   ` Philippe Mathieu-Daudé
  2024-06-07 17:03   ` Richard Henderson
  1 sibling, 0 replies; 22+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-06-07  6:12 UTC (permalink / raw)
  To: Don Porter, qemu-devel; +Cc: dave, peter.maydell, nadav.amit, richard.henderson

On 6/6/24 16:02, Don Porter wrote:
> Signed-off-by: Don Porter <porter@cs.unc.edu>
> ---
>   target/i386/cpu.h                    |  42 ++
>   target/i386/helper.c                 | 515 +++++++++++++++++++++++++
>   target/i386/tcg/sysemu/excp_helper.c | 555 +--------------------------
>   3 files changed, 562 insertions(+), 550 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/6] Add an "info pg" command that prints the current page tables
  2024-06-06 14:02 ` [PATCH v3 1/6] Add an "info pg" command that prints the current page tables Don Porter
  2024-06-07  6:09   ` Philippe Mathieu-Daudé
@ 2024-06-07  7:16   ` Daniel P. Berrangé
  2024-06-11 18:49     ` Don Porter
  2024-06-07 16:57   ` Richard Henderson
  2024-06-07 17:43   ` Richard Henderson
  3 siblings, 1 reply; 22+ messages in thread
From: Daniel P. Berrangé @ 2024-06-07  7:16 UTC (permalink / raw)
  To: Don Porter
  Cc: qemu-devel, dave, peter.maydell, nadav.amit, richard.henderson,
	philmd

On Thu, Jun 06, 2024 at 10:02:48AM -0400, Don Porter wrote:
> The new "info pg" monitor command prints the current page table,
> including virtual address ranges, flag bits, and snippets of physical
> page numbers.  Completely filled regions of the page table with
> compatible flags are "folded", with the result that the complete
> output for a freshly booted x86-64 Linux VM can fit in a single
> terminal window.  The output looks like this:
> 
> VPN range             Entry         Flags            Physical page
> [7f0000000-7f0000000] PML4[0fe]     ---DA--UWP
>   [7f28c0000-7f28fffff]  PDP[0a3]     ---DA--UWP
>     [7f28c4600-7f28c47ff]  PDE[023]     ---DA--UWP
>       [7f28c4655-7f28c4656]  PTE[055-056] X--D---U-P 0000007f14-0000007f15
>       [7f28c465b-7f28c465b]  PTE[05b]     ----A--U-P 0000001cfc
> ...
> [ff8000000-ff8000000] PML4[1ff]     ---DA--UWP
>   [ffff80000-ffffbffff]  PDP[1fe]     ---DA---WP
>     [ffff81000-ffff81dff]  PDE[008-00e] -GSDA---WP 0000001000-0000001dff
>   [ffffc0000-fffffffff]  PDP[1ff]     ---DA--UWP
>     [ffffff400-ffffff5ff]  PDE[1fa]     ---DA--UWP
>       [ffffff5fb-ffffff5fc]  PTE[1fb-1fc] XG-DACT-WP 00000fec00 00000fee00
>     [ffffff600-ffffff7ff]  PDE[1fb]     ---DA--UWP
>       [ffffff600-ffffff600]  PTE[000]     -G-DA--U-P 0000001467
> 
> This draws heavy inspiration from Austin Clements' original patch.
> 
> This also adds a generic page table walker, which other monitor
> and execution commands will be migrated to in subsequent patches.
> 
> Signed-off-by: Don Porter <porter@cs.unc.edu>
> ---
>  hmp-commands-info.hx              |  13 ++
>  hw/core/cpu-sysemu.c              | 140 ++++++++++++
>  include/hw/core/cpu.h             |  34 ++-
>  include/hw/core/sysemu-cpu-ops.h  | 156 +++++++++++++
>  include/monitor/hmp-target.h      |   1 +
>  monitor/hmp-cmds-target.c         | 198 +++++++++++++++++
>  target/i386/arch_memory_mapping.c | 351 +++++++++++++++++++++++++++++-
>  target/i386/cpu.c                 |  11 +
>  target/i386/cpu.h                 |  15 ++
>  target/i386/monitor.c             | 165 ++++++++++++++
>  10 files changed, 1082 insertions(+), 2 deletions(-)
> +
> +void hmp_info_pg(Monitor *mon, const QDict *qdict)
> +{
> +    struct mem_print_state state;
> +
> +    CPUState *cs = mon_get_cpu(mon);
> +    if (!cs) {
> +        monitor_printf(mon, "Unable to get CPUState.  Internal error\n");
> +        return;
> +    }
> +
> +    CPUClass *cc = CPU_GET_CLASS(cs);
> +
> +    if ((!cc->sysemu_ops->pte_child)
> +        || (!cc->sysemu_ops->pte_leaf)
> +        || (!cc->sysemu_ops->pte_leaf_page_size)
> +        || (!cc->sysemu_ops->page_table_entries_per_node)
> +        || (!cc->sysemu_ops->pte_flags)
> +        || (!cc->sysemu_ops->mon_init_page_table_iterator)
> +        || (!cc->sysemu_ops->mon_info_pg_print_header)
> +        || (!cc->sysemu_ops->mon_flush_page_print_state)) {
> +        monitor_printf(mon, "Info pg unsupported on this ISA\n");
> +        return;
> +    }
> +
> +    if (!cc->sysemu_ops->mon_init_page_table_iterator(mon, &state)) {
> +        monitor_printf(mon, "Unable to initialize page table iterator\n");
> +        return;
> +    }
> +
> +    state.flush_interior = true;
> +    state.require_physical_contiguity = true;
> +    state.flusher = cc->sysemu_ops->mon_flush_page_print_state;
> +
> +    cc->sysemu_ops->mon_info_pg_print_header(mon, &state);
> +
> +    /*
> +     * We must visit interior entries to get the hierarchy, but
> +     * can skip not present mappings
> +     */
> +    for_each_pte(cs, &compressing_iterator, &state, true, false);
> +
> +    /* Print last entry, if one present */
> +    cc->sysemu_ops->mon_flush_page_print_state(cs, &state);
> +}

Please don't add new HMP commands that don't have a QMP
equivalent.

This should be adding an 'x-query-pg' QMP command, which
returns HumanReadableText, and then call that from the HMP

There is guidance on this here:

  https://www.qemu.org/docs/master/devel/writing-monitor-commands.html#writing-a-debugging-aid-returning-unstructured-text

If you need more real examples, look at the various
'x-query-XXXX' commands in qapi/machine.json  and
their impl.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/6] Add an "info pg" command that prints the current page tables
  2024-06-06 14:02 ` [PATCH v3 1/6] Add an "info pg" command that prints the current page tables Don Porter
  2024-06-07  6:09   ` Philippe Mathieu-Daudé
  2024-06-07  7:16   ` Daniel P. Berrangé
@ 2024-06-07 16:57   ` Richard Henderson
  2024-06-14 18:16     ` Don Porter
  2024-06-07 17:43   ` Richard Henderson
  3 siblings, 1 reply; 22+ messages in thread
From: Richard Henderson @ 2024-06-07 16:57 UTC (permalink / raw)
  To: Don Porter, qemu-devel; +Cc: dave, peter.maydell, nadav.amit, philmd

On 6/6/24 07:02, Don Porter wrote:
> +/**
> + * _for_each_pte - recursive helper function
> + *
> + * @cs - CPU state
> + * @fn(cs, data, pte, vaddr, height) - User-provided function to call on each
> + *                                     pte.
> + *   * @cs - pass through cs
> + *   * @data - user-provided, opaque pointer
> + *   * @pte - current pte
> + *   * @vaddr_in - virtual address translated by pte
> + *   * @height - height in the tree of pte
> + * @data - user-provided, opaque pointer, passed to fn()
> + * @visit_interior_nodes - if true, call fn() on page table entries in
> + *                         interior nodes.  If false, only call fn() on page
> + *                         table entries in leaves.
> + * @visit_not_present - if true, call fn() on entries that are not present.
> + *                         if false, visit only present entries.
> + * @node - The physical address of the current page table radix tree node
> + * @vaddr_in - The virtual address bits translated in walking the page
> + *          table to node
> + * @height - The height of node in the radix tree
> + *
> + * height starts at the max and counts down.
> + * In a 4 level x86 page table, pml4e is level 4, pdpe is level 3,
> + *  pde is level 2, and pte is level 1
> + *
> + * Returns true on success, false on error.
> + */
> +static bool
> +_for_each_pte(CPUState *cs,

External identifiers beginning with "_[a-z]" are reserved.  While this is not external, 
and so is technically ok, it is still not a great idea.  Use for_each_pte_$SUFFIX, with 
SUFFIX as int, internal, recurse or something.


> +              int (*fn)(CPUState *cs, void *data, PTE_t *pte,
> +                        vaddr vaddr_in, int height, int offset),
> +              void *data, bool visit_interior_nodes,
> +              bool visit_not_present, hwaddr node,
> +              vaddr vaddr_in, int height)
> +{
> +    int ptes_per_node;
> +    int i;
> +
> +    assert(height > 0);
> +
> +    CPUClass *cc = CPU_GET_CLASS(cs);
> +
> +    if ((!cc->sysemu_ops->page_table_entries_per_node)
> +        || (!cc->sysemu_ops->get_pte)
> +        || (!cc->sysemu_ops->pte_present)
> +        || (!cc->sysemu_ops->pte_leaf)
> +        || (!cc->sysemu_ops->pte_child)) {
> +        return false;
> +    }

Probably move this check to the non-recursive function, since it only needs to be done 
once.  Following a check for page_table_root, should the rest be asserts?  I.e. if a 
target implements one hook, it must implement them all?

CPU_GET_CLASS is non-trivial: it is cached in cs->cc.
It seems like you should cache cs->cc->sysemu_ops in a local to avoid constantly reloading 
the pointer across the loop.

> +    ptes_per_node = cc->sysemu_ops->page_table_entries_per_node(cs, height);

It would be better if the entire page table format were computed by page_table_root. 
Perhaps fill in a whole structure rather than just height.

> +    for (i = 0; i < ptes_per_node; i++) {
> +        PTE_t pt_entry;
> +        vaddr vaddr_i;
> +        bool pte_present;
> +
> +        cc->sysemu_ops->get_pte(cs, node, i, height, &pt_entry, vaddr_in,
> +                                &vaddr_i, NULL);
> +        pte_present = cc->sysemu_ops->pte_present(cs, &pt_entry);
> +
> +        if (pte_present || visit_not_present) {
> +            if ((!pte_present) || cc->sysemu_ops->pte_leaf(cs, height,
> +                                                           &pt_entry)) {

Again, better to fill in a structure in get_pte rather than make 4 calls for vaddr, 
present, leaf, child.

Drop the unnecessary (), e.g. around !pte_present.

> +/* Intended to become a generic PTE type */
> +typedef union PTE {
> +    uint64_t pte64_t;
> +    uint32_t pte32_t;
> +} PTE_t;

While not yet supported by qemu, the latest Arm v9.4 document contains a 128-bit PTE.
Don't give the tag and the typedef different names.
Avoid "_t" as that's POSIX reserved namespace.

I know naming is hard, but there are many kinds of PTE in qemu -- better to use something 
more descriptive.  Perhaps "QemuPageWalkerPTE"?

> +bool for_each_pte(CPUState *cs,
> +                  int (*fn)(CPUState *cs, void *data, PTE_t *pte, vaddr vaddr,
> +                            int height, int offset), void *data,
> +                  bool visit_interior_nodes, bool visit_not_present);

Use a typedef for the callback function.
qemu_page_walker_for_each?

> +
> +
>   /**
>    * CPUDumpFlags:
>    * @CPU_DUMP_CODE:
> diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
> index 24d003fe04..eb16a1c3e2 100644
> --- a/include/hw/core/sysemu-cpu-ops.h
> +++ b/include/hw/core/sysemu-cpu-ops.h
> @@ -12,6 +12,39 @@
>   
>   #include "hw/core/cpu.h"
>   
> +/* Maximum supported page table height - currently x86 at 5 */
> +#define MAX_HEIGHT 5
> +
> +/*
> + * struct mem_print_state: Used by monitor in walking page tables.
> + */
> +struct mem_print_state {
> +    Monitor *mon;
> +    CPUArchState *env;
> +    int vaw, paw; /* VA and PA width in characters */
> +    int max_height;
> +    bool (*flusher)(CPUState *cs, struct mem_print_state *state);
> +    bool flush_interior; /* If false, only call flusher() on leaves */
> +    bool require_physical_contiguity;
> +    /*
> +     * The height at which we started accumulating ranges, i.e., the
> +     * next height we need to print once we hit the end of a
> +     * contiguous range.
> +     */
> +    int start_height;
> +    /*
> +     * For compressing contiguous ranges, track the
> +     * start and end of the range
> +     */
> +    hwaddr vstart[MAX_HEIGHT + 1]; /* Starting virt. addr. of open pte range */
> +    hwaddr vend[MAX_HEIGHT + 1]; /* Ending virtual address of open pte range */
> +    hwaddr pstart; /* Starting physical address of open pte range */
> +    hwaddr pend; /* Ending physical address of open pte range */
> +    int64_t ent[MAX_HEIGHT + 1]; /* PTE contents on current root->leaf path */

PTE_t (or the rename)?

> +    int offset[MAX_HEIGHT + 1]; /* PTE range starting offsets */
> +    int last_offset[MAX_HEIGHT + 1]; /* PTE range ending offsets */
> +};
> +
>   /*
>    * struct SysemuCPUOps: System operations specific to a CPU class
>    */
> @@ -87,6 +120,129 @@ typedef struct SysemuCPUOps {
>        */
>       const VMStateDescription *legacy_vmsd;
>   
> +    /**
> +     * page_table_root - Given a CPUState, return the physical address
> +     *                    of the current page table root, as well as
> +     *                    write the height of the tree into *height.
> +     *
> +     * @cs - CPU state
> +
> +     * @height - a pointer to an integer, to store the page table tree
> +     *           height
> +     *
> +     * Returns a hardware address on success.  Should not fail (i.e.,
> +     * caller is responsible to ensure that a page table is actually
> +     * present).
> +     */
> +    hwaddr (*page_table_root)(CPUState *cs, int *height);

There may be two page table roots on Arm, depending on the cpu state and the virtual 
address.  The shape of the two page tables are independent so...


> +
> +    /**
> +     * page_table_entries_per_node - Return the number of entries in a
> +     *                                   page table node for the CPU
> +     *                                   at a given height.
> +     *
> +     * @cs - CPU state
> +     * @height - height of the page table tree to query, where the leaves
> +     *          are 1.
> +     *
> +     * Returns a value greater than zero on success, -1 on error.
> +     */
> +    int (*page_table_entries_per_node)(CPUState *cs, int height);

... implementing this independently from page_table_root is not possible.


> +
> +    /**
> +     * @mon_init_page_table_iterator: Callback to configure a page table
> +     * iterator for use by a monitor function.
> +     * Returns true on success, false if not supported (e.g., no paging disabled
> +     * or not implemented on this CPU).
> +     */
> +    bool (*mon_init_page_table_iterator)(Monitor *mon,
> +                                         struct mem_print_state *state);

I don't understand the purpose of this one as a target-specific hook.

> +    /**
> +     * @flush_page_table_iterator_state: Prints the last entry,
> +     * if one is present.  Useful for iterators that aggregate information
> +     * across page table entries.
> +     */
> +    bool (*mon_flush_page_print_state)(CPUState *cs,
> +                                       struct mem_print_state *state);

Is this specific to "info pg" or not?
It appears to be, but the description suggests it is not.


r~


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code.
  2024-06-06 14:02 ` [PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code Don Porter
  2024-06-07  6:12   ` Philippe Mathieu-Daudé
@ 2024-06-07 17:03   ` Richard Henderson
  2024-06-15 12:49     ` Don Porter
  1 sibling, 1 reply; 22+ messages in thread
From: Richard Henderson @ 2024-06-07 17:03 UTC (permalink / raw)
  To: Don Porter, qemu-devel; +Cc: dave, peter.maydell, nadav.amit, philmd

On 6/6/24 07:02, Don Porter wrote:
> Signed-off-by: Don Porter <porter@cs.unc.edu>
> ---
>   target/i386/cpu.h                    |  42 ++
>   target/i386/helper.c                 | 515 +++++++++++++++++++++++++
>   target/i386/tcg/sysemu/excp_helper.c | 555 +--------------------------
>   3 files changed, 562 insertions(+), 550 deletions(-)

Why do you want to move this code out of a tcg specific file?
I think this is definitely wrong.


r~


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 6/6] Convert x86_mmu_translate() to use common code.
  2024-06-06 14:02 ` [PATCH v3 6/6] Convert x86_mmu_translate() to use common code Don Porter
@ 2024-06-07 17:28   ` Richard Henderson
  0 siblings, 0 replies; 22+ messages in thread
From: Richard Henderson @ 2024-06-07 17:28 UTC (permalink / raw)
  To: Don Porter, qemu-devel; +Cc: dave, peter.maydell, nadav.amit, philmd

On 6/6/24 07:02, Don Porter wrote:
> Signed-off-by: Don Porter <porter@cs.unc.edu>
> ---
>   target/i386/arch_memory_mapping.c    |  44 +++-
>   target/i386/cpu.h                    |   5 +-
>   target/i386/helper.c                 | 374 +++++++--------------------
>   target/i386/tcg/sysemu/excp_helper.c |   2 +-
>   4 files changed, 129 insertions(+), 296 deletions(-)
> 
> diff --git a/target/i386/arch_memory_mapping.c b/target/i386/arch_memory_mapping.c
> index b52e98133c..bccd290b9f 100644
> --- a/target/i386/arch_memory_mapping.c
> +++ b/target/i386/arch_memory_mapping.c
> @@ -228,9 +228,38 @@ static void _mmu_decode_va_parameters(CPUState *cs, int height,
>   }
>   
>   /**
> - * get_pte - Copy the contents of the page table entry at node[i] into pt_entry.
> - *           Optionally, add the relevant bits to the virtual address in
> - *           vaddr_pte.
> + * x86_virtual_to_pte_index - Given a virtual address and height in
> + *       the page table radix tree, return the index that should be
> + *       used to look up the next page table entry (pte) in
> + *       translating an address.
> + *
> + * @cs - CPU state
> + * @vaddr - The virtual address to translate
> + * @height - height of node within the tree (leaves are 1, not 0).
> + *
> + * Example: In 32-bit x86 page tables, the virtual address is split
> + * into 10 bits at height 2, 10 bits at height 1, and 12 offset bits.
> + * So a call with VA and height 2 would return the first 10 bits of va,
> + * right shifted by 22.
> + */
> +
> +int x86_virtual_to_pte_index(CPUState *cs, target_ulong vaddr, int height)
> +{
> +    int shift = 0;
> +    int width = 0;
> +    int mask = 0;
> +
> +    _mmu_decode_va_parameters(cs, height, &shift, &width);
> +
> +    mask = (1 << width) - 1;
> +
> +    return (vaddr >> shift) & mask;
> +}
> +
> +/**
> + * x86_get_pte - Copy the contents of the page table entry at node[i]
> + *               into pt_entry.  Optionally, add the relevant bits to
> + *               the virtual address in vaddr_pte.
>    *
>    * @cs - CPU state
>    * @node - physical address of the current page table node
> @@ -249,7 +278,6 @@ void
>   x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
>               PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
>               hwaddr *pte_paddr)
> -
>   {
>       X86CPU *cpu = X86_CPU(cs);
>       CPUX86State *env = &cpu->env;

Some fixes to be merged back into previous patches.


> --- a/target/i386/helper.c
> +++ b/target/i386/helper.c
> @@ -308,7 +308,8 @@ static bool ptw_translate(X86PTETranslate *inout, hwaddr addr, uint64_t ra)
>   
>   static bool x86_mmu_translate(CPUX86State *env, const X86TranslateParams *in,
>                                 X86TranslateResult *out,
> -                              X86TranslateFault *err, uint64_t ra)
> +                              X86TranslateFault *err, uint64_t ra,
> +                              bool read_only)
>   {
>       const target_ulong addr = in->addr;
>       const int pg_mode = in->pg_mode;
> @@ -324,6 +325,10 @@ static bool x86_mmu_translate(CPUX86State *env, const X86TranslateParams *in,
>       uint32_t pkr;
>       int page_size;
>       int error_code;
> +    CPUState *cs = env_cpu(env);
> +    int height;
> +    bool pae_enabled = env->cr[4] & CR4_PAE_MASK;
> +    bool long_mode_enabled = env->hflags & HF_LMA_MASK;

Incorrect.  These bits are in pg_mode...

> -    if (pg_mode & PG_MODE_PAE) {
> -#ifdef TARGET_X86_64
> -        if (pg_mode & PG_MODE_LMA) {
> -            if (pg_mode & PG_MODE_LA57) {

... like so.

> +    /*
> +     * ptep is really an accumulator for the permission bits.
> +     * Thus, the xor-ing totally trashes the high bits, and that is
> +     * ok - we only care about the low ones.
> +     */
> +    ptep = PG_NX_MASK | PG_USER_MASK | PG_RW_MASK;
> +    hwaddr pt_node = x86_page_table_root(cs, &height);
>   
> +    /* Special case for PAE paging */
> +    if (height == 3 && pg_mode & PG_MODE_PAE) {
> +        rsvd_mask |= PG_HI_USER_MASK;
> +    }
>   
> +    int i = height;
> +    do {
> +        int index = x86_virtual_to_pte_index(cs, addr, i);
> +        PTE_t pt_entry;
> +        uint64_t my_rsvd_mask = rsvd_mask;
> +
> +        x86_get_pte(cs, pt_node, index, i, &pt_entry, 0, NULL, &pte_addr);
> +        /* Check that we can access the page table entry */
>           if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>               return false;
>           }

You "get" the pte and only afterward you check that it is accessible.
I think you've missed the point of ptw_translate.


> +
> +    restart:
> +        if (!x86_pte_present(cs, &pt_entry)) {
>               goto do_fault;
>           }
>   
> +        /* For height > 3, check and reject PSE mask */
> +        if (i > 3) {
> +            my_rsvd_mask |= PG_PSE_MASK;
>           }
> +
> +        if (x86_pte_check_bits(cs, &pt_entry, my_rsvd_mask)) {
>               goto do_fault_rsvd;
>           }

Surely the reserved bit checking should be part of the generic walker.
Is there some reason those should be ignored for "info pg", for example?

> +        if (long_mode_enabled) {
> +            pte = pt_entry.pte64_t;
> +        } else {
> +            pte = pt_entry.pte32_t;
>           }

This is pretty ugly.  Ignoring 128-bit ptes for the moment, surely we should just 
zero-extend 32-bit ptes into the 64-bit entry data slot right from the start.

> +            break; /* goto do_check_protect; */

What's with the comment.

> +        if ((!read_only) &&
> +            (!ptw_setl(&pte_trans, pte, PG_ACCESSED_MASK))) {

Again with the extra ().


r~


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/6] Add an "info pg" command that prints the current page tables
  2024-06-06 14:02 ` [PATCH v3 1/6] Add an "info pg" command that prints the current page tables Don Porter
                     ` (2 preceding siblings ...)
  2024-06-07 16:57   ` Richard Henderson
@ 2024-06-07 17:43   ` Richard Henderson
  2024-06-14 21:14     ` Don Porter
  3 siblings, 1 reply; 22+ messages in thread
From: Richard Henderson @ 2024-06-07 17:43 UTC (permalink / raw)
  To: Don Porter, qemu-devel; +Cc: dave, peter.maydell, nadav.amit, philmd

On 6/6/24 07:02, Don Porter wrote:
> +/**
> + * get_pte - Copy the contents of the page table entry at node[i] into pt_entry.
> + *           Optionally, add the relevant bits to the virtual address in
> + *           vaddr_pte.
> + *
> + * @cs - CPU state
> + * @node - physical address of the current page table node
> + * @i - index (in page table entries, not bytes) of the page table
> + *      entry, within node
> + * @height - height of node within the tree (leaves are 1, not 0)
> + * @pt_entry - Poiter to a PTE_t, stores the contents of the page table entry
> + * @vaddr_parent - The virtual address bits already translated in walking the
> + *                 page table to node.  Optional: only used if vaddr_pte is set.
> + * @vaddr_pte - Optional pointer to a variable storing the virtual address bits
> + *              translated by node[i].
> + * @pte_paddr - Pointer to the physical address of the PTE within node.
> + *              Optional parameter.
> + */
> +void
> +x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
> +            PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
> +            hwaddr *pte_paddr)
> +
> +{
> +    X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
> +    int32_t a20_mask = x86_get_a20_mask(env);
> +    hwaddr pte;
> +
> +    if (env->hflags & HF_LMA_MASK) {
> +        /* 64 bit */
> +        int pte_width = 8;
> +        pte = (node + (i * pte_width)) & a20_mask;
> +        pt_entry->pte64_t = address_space_ldq(cs->as, pte,
> +                                              MEMTXATTRS_UNSPECIFIED, NULL);
> +    } else {
> +        /* 32 bit */
> +        int pte_width = 4;
> +        pte = (node + (i * pte_width)) & a20_mask;
> +        pt_entry->pte32_t = address_space_ldl(cs->as, pte,
> +                                              MEMTXATTRS_UNSPECIFIED, NULL);
> +    }
> +
> +    if (vaddr_pte) {
> +        int shift = 0;
> +        _mmu_decode_va_parameters(cs, height, &shift, NULL);
> +        *vaddr_pte = vaddr_parent | ((i & 0x1ffULL) << shift);
> +    }
> +
> +    if (pte_paddr) {
> +        *pte_paddr = pte;
> +    }
> +}

This fails to recurse with nested page tables, which definitely breaks the TCG walker.


r~



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/6] Convert 'info tlb' to use generic iterator
  2024-06-07  6:02   ` Philippe Mathieu-Daudé
@ 2024-06-10  9:13     ` Daniel P. Berrangé
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel P. Berrangé @ 2024-06-10  9:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Don Porter, qemu-devel, dave, peter.maydell, nadav.amit,
	richard.henderson

On Fri, Jun 07, 2024 at 08:02:51AM +0200, Philippe Mathieu-Daudé wrote:
> Hi Don,
> 
> (Cc'ing Daniel for HumanReadableText)
> 
> On 6/6/24 16:02, Don Porter wrote:
> > Signed-off-by: Don Porter <porter@cs.unc.edu>
> > ---
> >   include/hw/core/sysemu-cpu-ops.h |   7 +
> >   monitor/hmp-cmds-target.c        |   1 +
> >   target/i386/cpu.h                |   2 +
> >   target/i386/monitor.c            | 217 ++++++-------------------------
> >   4 files changed, 53 insertions(+), 174 deletions(-)
> > 
> > diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
> > index eb16a1c3e2..bf3de3e004 100644
> > --- a/include/hw/core/sysemu-cpu-ops.h
> > +++ b/include/hw/core/sysemu-cpu-ops.h
> > @@ -243,6 +243,13 @@ typedef struct SysemuCPUOps {
> >       bool (*mon_flush_page_print_state)(CPUState *cs,
> >                                          struct mem_print_state *state);
> > +    /**
> > +     * @mon_print_pte: Hook called by the monitor to print a page
> > +     * table entry at address addr, with contents pte.
> > +     */
> > +    void (*mon_print_pte) (Monitor *mon, CPUArchState *env, hwaddr addr,
> > +                           hwaddr pte);
> 
> IMO the SysemuCPUOps prototype should not use the monitor and return
> a HumanReadableText:
> 
>       HumanReadableText *(*mon_print_pte)(CPUArchState *env,
>                                           hwaddr addr, hwaddr pte);
> 
> Then define a QMP handler, itself registered to the monitor using
> monitor_register_hmp_info_hrt().

IIUC, this method is called repeatedly from an iterator, so I
think you would need to be instead relpacing the 'Monitor *mon'
parmeter with "GString *str", and print into that buffer.
The HumanReadableText would be created from teh GString higher
up in the call path.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/6] Convert 'info mem' to use generic iterator
  2024-06-06 14:02 ` [PATCH v3 3/6] Convert 'info mem' " Don Porter
@ 2024-06-10  9:15   ` Daniel P. Berrangé
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel P. Berrangé @ 2024-06-10  9:15 UTC (permalink / raw)
  To: Don Porter
  Cc: qemu-devel, dave, peter.maydell, nadav.amit, richard.henderson,
	philmd

On Thu, Jun 06, 2024 at 10:02:50AM -0400, Don Porter wrote:
> Signed-off-by: Don Porter <porter@cs.unc.edu>
> ---
>  include/hw/core/sysemu-cpu-ops.h |   6 +
>  include/monitor/monitor.h        |   4 +
>  monitor/hmp-cmds-target.c        |   5 +-
>  target/i386/cpu.c                |   1 +
>  target/i386/cpu.h                |   1 +
>  target/i386/monitor.c            | 354 ++++---------------------------
>  6 files changed, 60 insertions(+), 311 deletions(-)
> 
> diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
> index bf3de3e004..3bef129460 100644
> --- a/include/hw/core/sysemu-cpu-ops.h
> +++ b/include/hw/core/sysemu-cpu-ops.h
> @@ -250,6 +250,12 @@ typedef struct SysemuCPUOps {
>      void (*mon_print_pte) (Monitor *mon, CPUArchState *env, hwaddr addr,
>                             hwaddr pte);
>  
> +    /**
> +     * @mon_print_mem: Hook called by the monitor to print a range
> +     * of memory mappings in 'info mem'
> +     */
> +    bool (*mon_print_mem)(CPUState *cs, struct mem_print_state *state);
> +
Similar to the suggestion on the previus patch. I'd suggest this method
gains a 'GString *str' parameter, which it will print into. Then add
a QMP command that returns HumandReadableText, and call that from the
HMP command. This completely separates the architecture code from the
monitor APIs.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/6] Add an "info pg" command that prints the current page tables
  2024-06-07  7:16   ` Daniel P. Berrangé
@ 2024-06-11 18:49     ` Don Porter
  0 siblings, 0 replies; 22+ messages in thread
From: Don Porter @ 2024-06-11 18:49 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, dave, peter.maydell, nadav.amit, richard.henderson,
	philmd

On 6/7/24 3:16 AM, Daniel P. Berrangé wrote:
> On Thu, Jun 06, 2024 at 10:02:48AM -0400, Don Porter wrote:
> Please don't add new HMP commands that don't have a QMP
> equivalent.
>
> This should be adding an 'x-query-pg' QMP command, which
> returns HumanReadableText, and then call that from the HMP
>
> There is guidance on this here:
>
>    https://www.qemu.org/docs/master/devel/writing-monitor-commands.html#writing-a-debugging-aid-returning-unstructured-text
>
> If you need more real examples, look at the various
> 'x-query-XXXX' commands in qapi/machine.json  and
> their impl.

Thank you both for the pointers.  This makes sense to me;
outputting a string is much cleaner.  Will implement in v4...

-dp



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/6] Add an "info pg" command that prints the current page tables
  2024-06-07 16:57   ` Richard Henderson
@ 2024-06-14 18:16     ` Don Porter
  0 siblings, 0 replies; 22+ messages in thread
From: Don Porter @ 2024-06-14 18:16 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: dave, peter.maydell, nadav.amit, philmd

Hi Richard,

Thank you for all of the helpful comments!  v4 will be much cleaner as a 
result.

A few follow-ups inline.

On 6/7/24 12:57 PM, Richard Henderson wrote:
> On 6/6/24 07:02, Don Porter wrote:
>> +
>> +    /**
>> +     * @mon_init_page_table_iterator: Callback to configure a page 
>> table
>> +     * iterator for use by a monitor function.
>> +     * Returns true on success, false if not supported (e.g., no 
>> paging disabled
>> +     * or not implemented on this CPU).
>> +     */
>> +    bool (*mon_init_page_table_iterator)(Monitor *mon,
>> +                                         struct mem_print_state 
>> *state);
>
> I don't understand the purpose of this one as a target-specific hook.

The iterator needs some architecture-specific initialization, such as 
getting the virtual
and physical address width.

>
>> +    /**
>> +     * @flush_page_table_iterator_state: Prints the last entry,
>> +     * if one is present.  Useful for iterators that aggregate 
>> information
>> +     * across page table entries.
>> +     */
>> +    bool (*mon_flush_page_print_state)(CPUState *cs,
>> +                                       struct mem_print_state *state);
>
> Is this specific to "info pg" or not?
> It appears to be, but the description suggests it is not.

It is.

Thank you,

Don





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/6] Add an "info pg" command that prints the current page tables
  2024-06-07 17:43   ` Richard Henderson
@ 2024-06-14 21:14     ` Don Porter
  2024-06-15 15:34       ` Richard Henderson
  0 siblings, 1 reply; 22+ messages in thread
From: Don Porter @ 2024-06-14 21:14 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: dave, peter.maydell, nadav.amit, philmd

On 6/7/24 1:43 PM, Richard Henderson wrote:
> On 6/6/24 07:02, Don Porter wrote:
>> +/**
>> + * get_pte - Copy the contents of the page table entry at node[i] 
>> into pt_entry.
>> + *           Optionally, add the relevant bits to the virtual 
>> address in
>> + *           vaddr_pte.
>> + *
>> + * @cs - CPU state
>> + * @node - physical address of the current page table node
>> + * @i - index (in page table entries, not bytes) of the page table
>> + *      entry, within node
>> + * @height - height of node within the tree (leaves are 1, not 0)
>> + * @pt_entry - Poiter to a PTE_t, stores the contents of the page 
>> table entry
>> + * @vaddr_parent - The virtual address bits already translated in 
>> walking the
>> + *                 page table to node.  Optional: only used if 
>> vaddr_pte is set.
>> + * @vaddr_pte - Optional pointer to a variable storing the virtual 
>> address bits
>> + *              translated by node[i].
>> + * @pte_paddr - Pointer to the physical address of the PTE within node.
>> + *              Optional parameter.
>> + */
>> +void
>> +x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
>> +            PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
>> +            hwaddr *pte_paddr)
>> +
>> +{
>> +    X86CPU *cpu = X86_CPU(cs);
>> +    CPUX86State *env = &cpu->env;
>> +    int32_t a20_mask = x86_get_a20_mask(env);
>> +    hwaddr pte;
>> +
>> +    if (env->hflags & HF_LMA_MASK) {
>> +        /* 64 bit */
>> +        int pte_width = 8;
>> +        pte = (node + (i * pte_width)) & a20_mask;
>> +        pt_entry->pte64_t = address_space_ldq(cs->as, pte,
>> + MEMTXATTRS_UNSPECIFIED, NULL);
>> +    } else {
>> +        /* 32 bit */
>> +        int pte_width = 4;
>> +        pte = (node + (i * pte_width)) & a20_mask;
>> +        pt_entry->pte32_t = address_space_ldl(cs->as, pte,
>> + MEMTXATTRS_UNSPECIFIED, NULL);
>> +    }
>> +
>> +    if (vaddr_pte) {
>> +        int shift = 0;
>> +        _mmu_decode_va_parameters(cs, height, &shift, NULL);
>> +        *vaddr_pte = vaddr_parent | ((i & 0x1ffULL) << shift);
>> +    }
>> +
>> +    if (pte_paddr) {
>> +        *pte_paddr = pte;
>> +    }
>> +}
>
> This fails to recurse with nested page tables, which definitely breaks 
> the TCG walker.

Hi Richard,

Thank you again for all of the advice and feedback.

What do you think the correct semantics should be for nested paging?

My understanding is that the current 'info mem' command on x86 does not 
recur on nested page tables, but this does seem like a useful 
extension.  Same for 'info tlb'.

In the case of 'info pg', I might want to print each page table 
separately, rather than a combined/shadow view.

My reading of the tcg code is that it also walks the guest page tables 
first, then uses probe_access_full() to query the guest->host physical 
translation, but I may be misunderstanding the code: 
https://github.com/qemu/qemu/blob/046a64b9801343e2e89eef10c7a48eec8d8c0d4f/target/i386/tcg/sysemu/excp_helper.c#L432

In patch 6 of the series, I replace the chunk of mmu_translate() in tcg 
that walks the guest page tables, but leave the portion alone that does 
the nested page walk.

I'm open to implementing a nested walker, but it might be better not to 
add more functionality/changes to this patch series.

Thank you,

Don



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code.
  2024-06-07 17:03   ` Richard Henderson
@ 2024-06-15 12:49     ` Don Porter
  0 siblings, 0 replies; 22+ messages in thread
From: Don Porter @ 2024-06-15 12:49 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: dave, peter.maydell, nadav.amit, philmd

On 6/7/24 1:03 PM, Richard Henderson wrote:
> On 6/6/24 07:02, Don Porter wrote:
>> Signed-off-by: Don Porter <porter@cs.unc.edu>
>> ---
>>   target/i386/cpu.h                    |  42 ++
>>   target/i386/helper.c                 | 515 +++++++++++++++++++++++++
>>   target/i386/tcg/sysemu/excp_helper.c | 555 +--------------------------
>>   3 files changed, 562 insertions(+), 550 deletions(-)
>
> Why do you want to move this code out of a tcg specific file?
> I think this is definitely wrong. 

I want to share the get_physical_address() code with 
x86_cpu_get_phys_page_attrs_debug().

I assume that newly shared code should migrate into a common file, but 
happy to adopt another strategy.

Best,

Don



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/6] Add an "info pg" command that prints the current page tables
  2024-06-14 21:14     ` Don Porter
@ 2024-06-15 15:34       ` Richard Henderson
  0 siblings, 0 replies; 22+ messages in thread
From: Richard Henderson @ 2024-06-15 15:34 UTC (permalink / raw)
  To: Don Porter, qemu-devel; +Cc: dave, peter.maydell, nadav.amit, philmd

On 6/14/24 14:14, Don Porter wrote:
> On 6/7/24 1:43 PM, Richard Henderson wrote:
>> On 6/6/24 07:02, Don Porter wrote:
>>> +/**
>>> + * get_pte - Copy the contents of the page table entry at node[i] into pt_entry.
>>> + *           Optionally, add the relevant bits to the virtual address in
>>> + *           vaddr_pte.
>>> + *
>>> + * @cs - CPU state
>>> + * @node - physical address of the current page table node
>>> + * @i - index (in page table entries, not bytes) of the page table
>>> + *      entry, within node
>>> + * @height - height of node within the tree (leaves are 1, not 0)
>>> + * @pt_entry - Poiter to a PTE_t, stores the contents of the page table entry
>>> + * @vaddr_parent - The virtual address bits already translated in walking the
>>> + *                 page table to node.  Optional: only used if vaddr_pte is set.
>>> + * @vaddr_pte - Optional pointer to a variable storing the virtual address bits
>>> + *              translated by node[i].
>>> + * @pte_paddr - Pointer to the physical address of the PTE within node.
>>> + *              Optional parameter.
>>> + */
>>> +void
>>> +x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
>>> +            PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
>>> +            hwaddr *pte_paddr)
>>> +
>>> +{
>>> +    X86CPU *cpu = X86_CPU(cs);
>>> +    CPUX86State *env = &cpu->env;
>>> +    int32_t a20_mask = x86_get_a20_mask(env);
>>> +    hwaddr pte;
>>> +
>>> +    if (env->hflags & HF_LMA_MASK) {
>>> +        /* 64 bit */
>>> +        int pte_width = 8;
>>> +        pte = (node + (i * pte_width)) & a20_mask;
>>> +        pt_entry->pte64_t = address_space_ldq(cs->as, pte,
>>> + MEMTXATTRS_UNSPECIFIED, NULL);
>>> +    } else {
>>> +        /* 32 bit */
>>> +        int pte_width = 4;
>>> +        pte = (node + (i * pte_width)) & a20_mask;
>>> +        pt_entry->pte32_t = address_space_ldl(cs->as, pte,
>>> + MEMTXATTRS_UNSPECIFIED, NULL);
>>> +    }
>>> +
>>> +    if (vaddr_pte) {
>>> +        int shift = 0;
>>> +        _mmu_decode_va_parameters(cs, height, &shift, NULL);
>>> +        *vaddr_pte = vaddr_parent | ((i & 0x1ffULL) << shift);
>>> +    }
>>> +
>>> +    if (pte_paddr) {
>>> +        *pte_paddr = pte;
>>> +    }
>>> +}
>>
>> This fails to recurse with nested page tables, which definitely breaks the TCG walker.
> 
> Hi Richard,
> 
> Thank you again for all of the advice and feedback.
> 
> What do you think the correct semantics should be for nested paging?

That it works like hardware does.

> My understanding is that the current 'info mem' command on x86 does not recur on nested 
> page tables, but this does seem like a useful extension.  Same for 'info tlb'.

My understanding is that the two existing hmp commands are broken, and that the *only* 
correct page table walker is get_physical_address.

> In the case of 'info pg', I might want to print each page table separately, rather than a 
> combined/shadow view.

Without doing the nested page translation for the page tables themselves, you're not 
looking at the correct page from which to read the PTEs, and reading garbage.

> My reading of the tcg code is that it also walks the guest page tables first, then uses 
> probe_access_full() to query the guest->host physical translation, but I may be 
> misunderstanding the code: 
> https://github.com/qemu/qemu/blob/046a64b9801343e2e89eef10c7a48eec8d8c0d4f/target/i386/tcg/sysemu/excp_helper.c#L432

I'm talking about this, for the page tables themselves:

https://github.com/qemu/qemu/blob/046a64b9801343e2e89eef10c7a48eec8d8c0d4f/target/i386/tcg/sysemu/excp_helper.c#L69

A better example is for ARM, which has had its debug page table walker folded in:

https://github.com/qemu/qemu/blob/046a64b9801343e2e89eef10c7a48eec8d8c0d4f/target/arm/ptw.c#L553

If in_debug, don't use probe_access_full_mmu as a cache (and host page resolution), but 
explicitly recurse on get_phys_addr_gpc.


r~


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-06-15 15:35 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-06 14:02 [PATCH v3 0/6] Rework x86 page table walks Don Porter
2024-06-06 14:02 ` [PATCH v3 1/6] Add an "info pg" command that prints the current page tables Don Porter
2024-06-07  6:09   ` Philippe Mathieu-Daudé
2024-06-07  7:16   ` Daniel P. Berrangé
2024-06-11 18:49     ` Don Porter
2024-06-07 16:57   ` Richard Henderson
2024-06-14 18:16     ` Don Porter
2024-06-07 17:43   ` Richard Henderson
2024-06-14 21:14     ` Don Porter
2024-06-15 15:34       ` Richard Henderson
2024-06-06 14:02 ` [PATCH v3 2/6] Convert 'info tlb' to use generic iterator Don Porter
2024-06-07  6:02   ` Philippe Mathieu-Daudé
2024-06-10  9:13     ` Daniel P. Berrangé
2024-06-06 14:02 ` [PATCH v3 3/6] Convert 'info mem' " Don Porter
2024-06-10  9:15   ` Daniel P. Berrangé
2024-06-06 14:02 ` [PATCH v3 4/6] Convert x86_cpu_get_memory_mapping() to use generic iterators Don Porter
2024-06-06 14:02 ` [PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code Don Porter
2024-06-07  6:12   ` Philippe Mathieu-Daudé
2024-06-07 17:03   ` Richard Henderson
2024-06-15 12:49     ` Don Porter
2024-06-06 14:02 ` [PATCH v3 6/6] Convert x86_mmu_translate() to use common code Don Porter
2024-06-07 17:28   ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).