* [Qemu-devel] [PATCH v2 0/7] making address spaces 64 bit wide
@ 2013-11-13 19:23 Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 1/7] split definitions for exec.c and translate-all.c radix trees Michael S. Tsirkin
` (6 more replies)
0 siblings, 7 replies; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-13 19:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini, Luiz Capitulino
At the moment, exec ignores high bits in each address,
for efficiency.
This is incorrect: devices can do full 64 bit DMA, it's
only the CPU that is limited by target address space.
Resolving such addresses can actually corrupt the pagetables,
so using full 64 bit addresses is called for.
However, using full 64 bit addresses was clocked at 12% performance
hit on a microbenchmark.
To solve, teach pagetables to skip bits at any level
and not just the lowest level.
This solves the performance problem (only one line of code changed on the data
path). In fact we even gain a bit of speed:
Before:
portio-no-eventfd:pci-io 3225
After:
portio-no-eventfd:pci-io 3123
Changes from v2:
lots of bugfixes if you read v1 you'll have to re-read,
although the basic algorithm is still the same
minor tweaks suggested by Eric Blake
Michael S. Tsirkin (5):
exec: replace leaf with skip
exec: extend skip field to 6 bit, page entry to 32 bit
exec: pass hw address to phys_page_find
exec: memory radix tree page level compression
exec: reduce L2_PAGE_SIZE
Paolo Bonzini (2):
split definitions for exec.c and translate-all.c radix trees
exec: make address spaces 64-bit wide
translate-all.h | 7 ---
exec.c | 135 +++++++++++++++++++++++++++++++++++++++++++++-----------
translate-all.c | 32 ++++++++------
3 files changed, 127 insertions(+), 47 deletions(-)
--
MST
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Qemu-devel] [PATCH v2 1/7] split definitions for exec.c and translate-all.c radix trees
2013-11-13 19:23 [Qemu-devel] [PATCH v2 0/7] making address spaces 64 bit wide Michael S. Tsirkin
@ 2013-11-13 19:23 ` Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 2/7] exec: replace leaf with skip Michael S. Tsirkin
` (5 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-13 19:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini, Luiz Capitulino
From: Paolo Bonzini <pbonzini@redhat.com>
The exec.c and translate-all.c radix trees are quite different, and
the exec.c one in particular is not limited to the CPU---it can be
used also by devices that do DMA, and in that case the address space
is not limited to TARGET_PHYS_ADDR_SPACE_BITS bits.
We want to make exec.c's radix trees 64-bit wide. As a first step,
stop sharing the constants between exec.c and translate-all.c.
exec.c gets P_L2_* constants, translate-all.c gets V_L2_*, for
consistency with the existing V_L1_* symbols. Though actually
in the softmmu case translate-all.c is also indexed by physical
addresses...
This patch has no semantic change.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
translate-all.h | 7 -------
exec.c | 29 +++++++++++++++++++++--------
translate-all.c | 32 ++++++++++++++++++--------------
3 files changed, 39 insertions(+), 29 deletions(-)
diff --git a/translate-all.h b/translate-all.h
index 5c38819..f7e5932 100644
--- a/translate-all.h
+++ b/translate-all.h
@@ -19,13 +19,6 @@
#ifndef TRANSLATE_ALL_H
#define TRANSLATE_ALL_H
-/* Size of the L2 (and L3, etc) page tables. */
-#define L2_BITS 10
-#define L2_SIZE (1 << L2_BITS)
-
-#define P_L2_LEVELS \
- (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1)
-
/* translate-all.c */
void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len);
void cpu_unlink_tb(CPUState *cpu);
diff --git a/exec.c b/exec.c
index b453713..9e2fc4b 100644
--- a/exec.c
+++ b/exec.c
@@ -88,7 +88,15 @@ struct PhysPageEntry {
uint16_t ptr : 15;
};
-typedef PhysPageEntry Node[L2_SIZE];
+/* Size of the L2 (and L3, etc) page tables. */
+#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
+
+#define P_L2_BITS 10
+#define P_L2_SIZE (1 << P_L2_BITS)
+
+#define P_L2_LEVELS (((ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / P_L2_BITS) + 1)
+
+typedef PhysPageEntry Node[P_L2_SIZE];
struct AddressSpaceDispatch {
/* This is a multi-level map on the physical address space.
@@ -155,7 +163,7 @@ static uint16_t phys_map_node_alloc(void)
ret = next_map.nodes_nb++;
assert(ret != PHYS_MAP_NODE_NIL);
assert(ret != next_map.nodes_nb_alloc);
- for (i = 0; i < L2_SIZE; ++i) {
+ for (i = 0; i < P_L2_SIZE; ++i) {
next_map.nodes[ret][i].is_leaf = 0;
next_map.nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
}
@@ -168,13 +176,13 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
{
PhysPageEntry *p;
int i;
- hwaddr step = (hwaddr)1 << (level * L2_BITS);
+ hwaddr step = (hwaddr)1 << (level * P_L2_BITS);
if (!lp->is_leaf && lp->ptr == PHYS_MAP_NODE_NIL) {
lp->ptr = phys_map_node_alloc();
p = next_map.nodes[lp->ptr];
if (level == 0) {
- for (i = 0; i < L2_SIZE; i++) {
+ for (i = 0; i < P_L2_SIZE; i++) {
p[i].is_leaf = 1;
p[i].ptr = PHYS_SECTION_UNASSIGNED;
}
@@ -182,9 +190,9 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
} else {
p = next_map.nodes[lp->ptr];
}
- lp = &p[(*index >> (level * L2_BITS)) & (L2_SIZE - 1)];
+ lp = &p[(*index >> (level * P_L2_BITS)) & (P_L2_SIZE - 1)];
- while (*nb && lp < &p[L2_SIZE]) {
+ while (*nb && lp < &p[P_L2_SIZE]) {
if ((*index & (step - 1)) == 0 && *nb >= step) {
lp->is_leaf = true;
lp->ptr = leaf;
@@ -218,7 +226,7 @@ static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr index,
return §ions[PHYS_SECTION_UNASSIGNED];
}
p = nodes[lp.ptr];
- lp = p[(index >> (i * L2_BITS)) & (L2_SIZE - 1)];
+ lp = p[(index >> (i * P_L2_BITS)) & (P_L2_SIZE - 1)];
}
return §ions[lp.ptr];
}
@@ -1741,7 +1749,12 @@ void address_space_destroy_dispatch(AddressSpace *as)
static void memory_map_init(void)
{
system_memory = g_malloc(sizeof(*system_memory));
- memory_region_init(system_memory, NULL, "system", INT64_MAX);
+
+ assert(ADDR_SPACE_BITS <= 64);
+
+ memory_region_init(system_memory, NULL, "system",
+ ADDR_SPACE_BITS == 64 ?
+ UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
address_space_init(&address_space_memory, system_memory, "memory");
system_io = g_malloc(sizeof(*system_io));
diff --git a/translate-all.c b/translate-all.c
index aeda54d..1c63d78 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -96,12 +96,16 @@ typedef struct PageDesc {
# define L1_MAP_ADDR_SPACE_BITS TARGET_VIRT_ADDR_SPACE_BITS
#endif
+/* Size of the L2 (and L3, etc) page tables. */
+#define V_L2_BITS 10
+#define V_L2_SIZE (1 << V_L2_BITS)
+
/* The bits remaining after N lower levels of page tables. */
#define V_L1_BITS_REM \
- ((L1_MAP_ADDR_SPACE_BITS - TARGET_PAGE_BITS) % L2_BITS)
+ ((L1_MAP_ADDR_SPACE_BITS - TARGET_PAGE_BITS) % V_L2_BITS)
#if V_L1_BITS_REM < 4
-#define V_L1_BITS (V_L1_BITS_REM + L2_BITS)
+#define V_L1_BITS (V_L1_BITS_REM + V_L2_BITS)
#else
#define V_L1_BITS V_L1_BITS_REM
#endif
@@ -395,18 +399,18 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
lp = l1_map + ((index >> V_L1_SHIFT) & (V_L1_SIZE - 1));
/* Level 2..N-1. */
- for (i = V_L1_SHIFT / L2_BITS - 1; i > 0; i--) {
+ for (i = V_L1_SHIFT / V_L2_BITS - 1; i > 0; i--) {
void **p = *lp;
if (p == NULL) {
if (!alloc) {
return NULL;
}
- ALLOC(p, sizeof(void *) * L2_SIZE);
+ ALLOC(p, sizeof(void *) * V_L2_SIZE);
*lp = p;
}
- lp = p + ((index >> (i * L2_BITS)) & (L2_SIZE - 1));
+ lp = p + ((index >> (i * V_L2_BITS)) & (V_L2_SIZE - 1));
}
pd = *lp;
@@ -414,13 +418,13 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
if (!alloc) {
return NULL;
}
- ALLOC(pd, sizeof(PageDesc) * L2_SIZE);
+ ALLOC(pd, sizeof(PageDesc) * V_L2_SIZE);
*lp = pd;
}
#undef ALLOC
- return pd + (index & (L2_SIZE - 1));
+ return pd + (index & (V_L2_SIZE - 1));
}
static inline PageDesc *page_find(tb_page_addr_t index)
@@ -655,14 +659,14 @@ static void page_flush_tb_1(int level, void **lp)
if (level == 0) {
PageDesc *pd = *lp;
- for (i = 0; i < L2_SIZE; ++i) {
+ for (i = 0; i < V_L2_SIZE; ++i) {
pd[i].first_tb = NULL;
invalidate_page_bitmap(pd + i);
}
} else {
void **pp = *lp;
- for (i = 0; i < L2_SIZE; ++i) {
+ for (i = 0; i < V_L2_SIZE; ++i) {
page_flush_tb_1(level - 1, pp + i);
}
}
@@ -673,7 +677,7 @@ static void page_flush_tb(void)
int i;
for (i = 0; i < V_L1_SIZE; i++) {
- page_flush_tb_1(V_L1_SHIFT / L2_BITS - 1, l1_map + i);
+ page_flush_tb_1(V_L1_SHIFT / V_L2_BITS - 1, l1_map + i);
}
}
@@ -1600,7 +1604,7 @@ static int walk_memory_regions_1(struct walk_memory_regions_data *data,
if (level == 0) {
PageDesc *pd = *lp;
- for (i = 0; i < L2_SIZE; ++i) {
+ for (i = 0; i < V_L2_SIZE; ++i) {
int prot = pd[i].flags;
pa = base | (i << TARGET_PAGE_BITS);
@@ -1614,9 +1618,9 @@ static int walk_memory_regions_1(struct walk_memory_regions_data *data,
} else {
void **pp = *lp;
- for (i = 0; i < L2_SIZE; ++i) {
+ for (i = 0; i < V_L2_SIZE; ++i) {
pa = base | ((abi_ulong)i <<
- (TARGET_PAGE_BITS + L2_BITS * level));
+ (TARGET_PAGE_BITS + V_L2_BITS * level));
rc = walk_memory_regions_1(data, pa, level - 1, pp + i);
if (rc != 0) {
return rc;
@@ -1639,7 +1643,7 @@ int walk_memory_regions(void *priv, walk_memory_regions_fn fn)
for (i = 0; i < V_L1_SIZE; i++) {
int rc = walk_memory_regions_1(&data, (abi_ulong)i << V_L1_SHIFT,
- V_L1_SHIFT / L2_BITS - 1, l1_map + i);
+ V_L1_SHIFT / V_L2_BITS - 1, l1_map + i);
if (rc != 0) {
return rc;
--
MST
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [Qemu-devel] [PATCH v2 2/7] exec: replace leaf with skip
2013-11-13 19:23 [Qemu-devel] [PATCH v2 0/7] making address spaces 64 bit wide Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 1/7] split definitions for exec.c and translate-all.c radix trees Michael S. Tsirkin
@ 2013-11-13 19:23 ` Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 3/7] exec: extend skip field to 6 bit, page entry to 32 bit Michael S. Tsirkin
` (4 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-13 19:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini, Luiz Capitulino
In preparation for dynamic radix tree depth support, rename is_leaf
field to skip, telling us how many bits to skip to next level.
Set to 0 for leaf.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
exec.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/exec.c b/exec.c
index 9e2fc4b..51e4953 100644
--- a/exec.c
+++ b/exec.c
@@ -83,8 +83,9 @@ int use_icount;
typedef struct PhysPageEntry PhysPageEntry;
struct PhysPageEntry {
- uint16_t is_leaf : 1;
- /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */
+ /* How many bits skip to next level (in units of L2_SIZE). 0 for a leaf. */
+ uint16_t skip : 1;
+ /* index into phys_sections (!skip) or phys_map_nodes (skip) */
uint16_t ptr : 15;
};
@@ -164,7 +165,7 @@ static uint16_t phys_map_node_alloc(void)
assert(ret != PHYS_MAP_NODE_NIL);
assert(ret != next_map.nodes_nb_alloc);
for (i = 0; i < P_L2_SIZE; ++i) {
- next_map.nodes[ret][i].is_leaf = 0;
+ next_map.nodes[ret][i].skip = 1;
next_map.nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
}
return ret;
@@ -178,12 +179,12 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
int i;
hwaddr step = (hwaddr)1 << (level * P_L2_BITS);
- if (!lp->is_leaf && lp->ptr == PHYS_MAP_NODE_NIL) {
+ if (lp->skip && lp->ptr == PHYS_MAP_NODE_NIL) {
lp->ptr = phys_map_node_alloc();
p = next_map.nodes[lp->ptr];
if (level == 0) {
for (i = 0; i < P_L2_SIZE; i++) {
- p[i].is_leaf = 1;
+ p[i].skip = 0;
p[i].ptr = PHYS_SECTION_UNASSIGNED;
}
}
@@ -194,7 +195,7 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
while (*nb && lp < &p[P_L2_SIZE]) {
if ((*index & (step - 1)) == 0 && *nb >= step) {
- lp->is_leaf = true;
+ lp->skip = 0;
lp->ptr = leaf;
*index += step;
*nb -= step;
@@ -221,7 +222,7 @@ static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr index,
PhysPageEntry *p;
int i;
- for (i = P_L2_LEVELS - 1; i >= 0 && !lp.is_leaf; i--) {
+ for (i = P_L2_LEVELS; lp.skip && (i -= lp.skip) >= 0;) {
if (lp.ptr == PHYS_MAP_NODE_NIL) {
return §ions[PHYS_SECTION_UNASSIGNED];
}
@@ -1644,7 +1645,7 @@ static void mem_begin(MemoryListener *listener)
AddressSpace *as = container_of(listener, AddressSpace, dispatch_listener);
AddressSpaceDispatch *d = g_new(AddressSpaceDispatch, 1);
- d->phys_map = (PhysPageEntry) { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
+ d->phys_map = (PhysPageEntry) { .ptr = PHYS_MAP_NODE_NIL, .skip = 1 };
d->as = as;
as->next_dispatch = d;
}
--
MST
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [Qemu-devel] [PATCH v2 3/7] exec: extend skip field to 6 bit, page entry to 32 bit
2013-11-13 19:23 [Qemu-devel] [PATCH v2 0/7] making address spaces 64 bit wide Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 1/7] split definitions for exec.c and translate-all.c radix trees Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 2/7] exec: replace leaf with skip Michael S. Tsirkin
@ 2013-11-13 19:23 ` Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 4/7] exec: pass hw address to phys_page_find Michael S. Tsirkin
` (3 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-13 19:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini, Luiz Capitulino
Extend skip to 6 bit. As page entry doesn't fit in 16 bit
any longer anyway, extend it to 32 bit.
This doubles node map memory requirements, but follow-up
patches will save this memory.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
exec.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/exec.c b/exec.c
index 51e4953..ec46aea 100644
--- a/exec.c
+++ b/exec.c
@@ -84,11 +84,13 @@ typedef struct PhysPageEntry PhysPageEntry;
struct PhysPageEntry {
/* How many bits skip to next level (in units of L2_SIZE). 0 for a leaf. */
- uint16_t skip : 1;
+ uint32_t skip : 6;
/* index into phys_sections (!skip) or phys_map_nodes (skip) */
- uint16_t ptr : 15;
+ uint32_t ptr : 26;
};
+#define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
+
/* Size of the L2 (and L3, etc) page tables. */
#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
@@ -134,8 +136,6 @@ typedef struct PhysPageMap {
static PhysPageMap *prev_map;
static PhysPageMap next_map;
-#define PHYS_MAP_NODE_NIL (((uint16_t)~0) >> 1)
-
static void io_mem_init(void);
static void memory_map_init(void);
@@ -156,10 +156,10 @@ static void phys_map_node_reserve(unsigned nodes)
}
}
-static uint16_t phys_map_node_alloc(void)
+static uint32_t phys_map_node_alloc(void)
{
unsigned i;
- uint16_t ret;
+ uint32_t ret;
ret = next_map.nodes_nb++;
assert(ret != PHYS_MAP_NODE_NIL);
--
MST
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [Qemu-devel] [PATCH v2 4/7] exec: pass hw address to phys_page_find
2013-11-13 19:23 [Qemu-devel] [PATCH v2 0/7] making address spaces 64 bit wide Michael S. Tsirkin
` (2 preceding siblings ...)
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 3/7] exec: extend skip field to 6 bit, page entry to 32 bit Michael S. Tsirkin
@ 2013-11-13 19:23 ` Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression Michael S. Tsirkin
` (2 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-13 19:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini, Luiz Capitulino
callers always shift by target page bits so let's just do this
internally.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
exec.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/exec.c b/exec.c
index ec46aea..edb577f 100644
--- a/exec.c
+++ b/exec.c
@@ -216,10 +216,11 @@ static void phys_page_set(AddressSpaceDispatch *d,
phys_page_set_level(&d->phys_map, &index, &nb, leaf, P_L2_LEVELS - 1);
}
-static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr index,
+static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr addr,
Node *nodes, MemoryRegionSection *sections)
{
PhysPageEntry *p;
+ hwaddr index = addr >> TARGET_PAGE_BITS;
int i;
for (i = P_L2_LEVELS; lp.skip && (i -= lp.skip) >= 0;) {
@@ -245,8 +246,7 @@ static MemoryRegionSection *address_space_lookup_region(AddressSpaceDispatch *d,
MemoryRegionSection *section;
subpage_t *subpage;
- section = phys_page_find(d->phys_map, addr >> TARGET_PAGE_BITS,
- d->nodes, d->sections);
+ section = phys_page_find(d->phys_map, addr, d->nodes, d->sections);
if (resolve_subpage && section->mr->subpage) {
subpage = container_of(section->mr, subpage_t, iomem);
section = &d->sections[subpage->sub_section[SUBPAGE_IDX(addr)]];
@@ -800,7 +800,7 @@ static void register_subpage(AddressSpaceDispatch *d, MemoryRegionSection *secti
subpage_t *subpage;
hwaddr base = section->offset_within_address_space
& TARGET_PAGE_MASK;
- MemoryRegionSection *existing = phys_page_find(d->phys_map, base >> TARGET_PAGE_BITS,
+ MemoryRegionSection *existing = phys_page_find(d->phys_map, base,
next_map.nodes, next_map.sections);
MemoryRegionSection subsection = {
.offset_within_address_space = base,
--
MST
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression
2013-11-13 19:23 [Qemu-devel] [PATCH v2 0/7] making address spaces 64 bit wide Michael S. Tsirkin
` (3 preceding siblings ...)
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 4/7] exec: pass hw address to phys_page_find Michael S. Tsirkin
@ 2013-11-13 19:23 ` Michael S. Tsirkin
2013-11-14 8:54 ` Avi Kivity
2013-11-13 19:24 ` [Qemu-devel] [PATCH v2 6/7] exec: make address spaces 64-bit wide Michael S. Tsirkin
2013-11-13 19:24 ` [Qemu-devel] [PATCH v2 7/7] exec: reduce L2_PAGE_SIZE Michael S. Tsirkin
6 siblings, 1 reply; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-13 19:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini, Luiz Capitulino
At the moment, memory radix tree is already variable width, but it can
only skip the low bits of address.
This is efficient if we have huge memory regions but inefficient if we
are only using a tiny portion of the address space.
After we have built up the map, detect
configurations where a single L2 entry is valid.
We then speed up the lookup by skipping one or more levels.
In case any levels were skipped, we might end up in a valid section
instead of erroring out. We handle this by checking that
the address is in range of the resulting section.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
exec.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 74 insertions(+), 1 deletion(-)
diff --git a/exec.c b/exec.c
index edb577f..474d80d 100644
--- a/exec.c
+++ b/exec.c
@@ -51,6 +51,8 @@
#include "exec/memory-internal.h"
+#include "qemu/range.h"
+
//#define DEBUG_SUBPAGE
#if !defined(CONFIG_USER_ONLY)
@@ -216,6 +218,68 @@ static void phys_page_set(AddressSpaceDispatch *d,
phys_page_set_level(&d->phys_map, &index, &nb, leaf, P_L2_LEVELS - 1);
}
+/* Compact a non leaf page entry. Simply detect that the entry has a single child,
+ * and update our entry so we can skip it and go directly to the destination.
+ */
+static void phys_page_compact(PhysPageEntry *lp, Node *nodes, unsigned long *compacted)
+{
+ unsigned valid_ptr = P_L2_SIZE;
+ int valid = 0;
+ PhysPageEntry *p;
+ int i;
+
+ if (lp->ptr == PHYS_MAP_NODE_NIL) {
+ return;
+ }
+
+ p = nodes[lp->ptr];
+ for (i = 0; i < P_L2_SIZE; i++) {
+ if (p[i].ptr == PHYS_MAP_NODE_NIL) {
+ continue;
+ }
+
+ valid_ptr = i;
+ valid++;
+ if (p[i].skip) {
+ phys_page_compact(&p[i], nodes, compacted);
+ }
+ }
+
+ /* We can only compress if there's only one child. */
+ if (valid != 1) {
+ return;
+ }
+
+ assert(valid_ptr < P_L2_SIZE);
+
+ /* Don't compress if it won't fit in the # of bits we have. */
+ if (lp->skip + p[valid_ptr].skip >= (1 << 3)) {
+ return;
+ }
+
+ lp->ptr = p[valid_ptr].ptr;
+ if (!p[valid_ptr].skip) {
+ /* If our only child is a leaf, make this a leaf. */
+ /* By design, we should have made this node a leaf to begin with so we
+ * should never reach here.
+ * But since it's so simple to handle this, let's do it just in case we
+ * change this rule.
+ */
+ lp->skip = 0;
+ } else {
+ lp->skip += p[valid_ptr].skip;
+ }
+}
+
+static void phys_page_compact_all(AddressSpaceDispatch *d, int nodes_nb)
+{
+ DECLARE_BITMAP(compacted, nodes_nb);
+
+ if (d->phys_map.skip) {
+ phys_page_compact(&d->phys_map, d->nodes, compacted);
+ }
+}
+
static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr addr,
Node *nodes, MemoryRegionSection *sections)
{
@@ -230,7 +294,14 @@ static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr addr,
p = nodes[lp.ptr];
lp = p[(index >> (i * P_L2_BITS)) & (P_L2_SIZE - 1)];
}
- return §ions[lp.ptr];
+
+ if (sections[lp.ptr].size.hi ||
+ range_covers_byte(sections[lp.ptr].offset_within_address_space,
+ sections[lp.ptr].size.lo, addr)) {
+ return §ions[lp.ptr];
+ } else {
+ return §ions[PHYS_SECTION_UNASSIGNED];
+ }
}
bool memory_region_is_unassigned(MemoryRegion *mr)
@@ -1659,6 +1730,8 @@ static void mem_commit(MemoryListener *listener)
next->nodes = next_map.nodes;
next->sections = next_map.sections;
+ phys_page_compact_all(next, next_map.nodes_nb);
+
as->dispatch = next;
g_free(cur);
}
--
MST
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [Qemu-devel] [PATCH v2 6/7] exec: make address spaces 64-bit wide
2013-11-13 19:23 [Qemu-devel] [PATCH v2 0/7] making address spaces 64 bit wide Michael S. Tsirkin
` (4 preceding siblings ...)
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression Michael S. Tsirkin
@ 2013-11-13 19:24 ` Michael S. Tsirkin
2013-11-13 19:24 ` [Qemu-devel] [PATCH v2 7/7] exec: reduce L2_PAGE_SIZE Michael S. Tsirkin
6 siblings, 0 replies; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-13 19:24 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini, Luiz Capitulino
From: Paolo Bonzini <pbonzini@redhat.com>
As an alternative to commit 818f86b (exec: limit system memory
size, 2013-11-04) let's just make all address spaces 64-bit wide.
This eliminates problems with phys_page_find ignoring bits above
TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
consequently messing up the computations.
In Luiz's reported crash, at startup gdb attempts to read from address
0xffffffffffffffe6 to 0xffffffffffffffff inclusive. The region it gets
is the newly introduced master abort region, which is as big as the PCI
address space (see pci_bus_init). Due to a typo that's only 2^63-1,
not 2^64. But we get it anyway because phys_page_find ignores the upper
bits of the physical address. In address_space_translate_internal then
diff = int128_sub(section->mr->size, int128_make64(addr));
*plen = int128_get64(int128_min(diff, int128_make64(*plen)));
diff becomes negative, and int128_get64 booms.
The size of the PCI address space region should be fixed anyway.
Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
exec.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/exec.c b/exec.c
index 474d80d..6e64d27 100644
--- a/exec.c
+++ b/exec.c
@@ -94,7 +94,7 @@ struct PhysPageEntry {
#define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
/* Size of the L2 (and L3, etc) page tables. */
-#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
+#define ADDR_SPACE_BITS 64
#define P_L2_BITS 10
#define P_L2_SIZE (1 << P_L2_BITS)
@@ -1824,11 +1824,7 @@ static void memory_map_init(void)
{
system_memory = g_malloc(sizeof(*system_memory));
- assert(ADDR_SPACE_BITS <= 64);
-
- memory_region_init(system_memory, NULL, "system",
- ADDR_SPACE_BITS == 64 ?
- UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
+ memory_region_init(system_memory, NULL, "system", UINT64_MAX);
address_space_init(&address_space_memory, system_memory, "memory");
system_io = g_malloc(sizeof(*system_io));
--
MST
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [Qemu-devel] [PATCH v2 7/7] exec: reduce L2_PAGE_SIZE
2013-11-13 19:23 [Qemu-devel] [PATCH v2 0/7] making address spaces 64 bit wide Michael S. Tsirkin
` (5 preceding siblings ...)
2013-11-13 19:24 ` [Qemu-devel] [PATCH v2 6/7] exec: make address spaces 64-bit wide Michael S. Tsirkin
@ 2013-11-13 19:24 ` Michael S. Tsirkin
6 siblings, 0 replies; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-13 19:24 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini, Luiz Capitulino
With the single exception of ppc with 16M pages,
we get the same number of levels
with L2_PAGE_SIZE = 10 as with L2_PAGE_SIZE = 9.
by doing this we reduce memory footprint of a single level
in the node memory map by 2x without runtime overhead.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
exec.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/exec.c b/exec.c
index 6e64d27..7d3f743 100644
--- a/exec.c
+++ b/exec.c
@@ -96,7 +96,7 @@ struct PhysPageEntry {
/* Size of the L2 (and L3, etc) page tables. */
#define ADDR_SPACE_BITS 64
-#define P_L2_BITS 10
+#define P_L2_BITS 9
#define P_L2_SIZE (1 << P_L2_BITS)
#define P_L2_LEVELS (((ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / P_L2_BITS) + 1)
--
MST
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression Michael S. Tsirkin
@ 2013-11-14 8:54 ` Avi Kivity
2013-11-14 14:40 ` [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression Michael S. Tsirkin
0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2013-11-14 8:54 UTC (permalink / raw)
To: qemu-devel
Michael S. Tsirkin <mst <at> redhat.com> writes:
>
> At the moment, memory radix tree is already variable width, but it can
> only skip the low bits of address.
>
> This is efficient if we have huge memory regions but inefficient if we
> are only using a tiny portion of the address space.
>
> After we have built up the map, detect
> configurations where a single L2 entry is valid.
>
> We then speed up the lookup by skipping one or more levels.
> In case any levels were skipped, we might end up in a valid section
> instead of erroring out. We handle this by checking that
> the address is in range of the resulting section.
>
I think this is overkill. It can be done in a simpler way as follows:
phys_page_find(RadixTree* tr, hwaddr index, ...)
{
if (index & rt->invalid_index_mask) {
// not found
}
lp = rt->root;
for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
...
This exploits the fact the lower portion of the address space is always
filled, at least in the cases that matter to us.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
2013-11-14 8:54 ` Avi Kivity
@ 2013-11-14 14:40 ` Michael S. Tsirkin
2013-11-14 14:56 ` Avi Kivity
0 siblings, 1 reply; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-14 14:40 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel
On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
> Michael S. Tsirkin <mst <at> redhat.com> writes:
>
> >
> > At the moment, memory radix tree is already variable width, but it can
> > only skip the low bits of address.
> >
> > This is efficient if we have huge memory regions but inefficient if we
> > are only using a tiny portion of the address space.
> >
> > After we have built up the map, detect
> > configurations where a single L2 entry is valid.
> >
> > We then speed up the lookup by skipping one or more levels.
> > In case any levels were skipped, we might end up in a valid section
> > instead of erroring out. We handle this by checking that
> > the address is in range of the resulting section.
> >
>
>
> I think this is overkill. It can be done in a simpler way as follows:
>
>
> phys_page_find(RadixTree* tr, hwaddr index, ...)
> {
> if (index & rt->invalid_index_mask) {
> // not found
> }
> lp = rt->root;
> for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
> ...
>
> This exploits the fact the lower portion of the address space is always
> filled, at least in the cases that matter to us.
>
>
>
>
>
Basically skip unused high bits?
Sure.
In fact I think both optimizations can be combined.
--
MST
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
2013-11-14 14:40 ` [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression Michael S. Tsirkin
@ 2013-11-14 14:56 ` Avi Kivity
2013-11-14 15:37 ` Michael S. Tsirkin
0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2013-11-14 14:56 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: qemu-devel
On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
>> Michael S. Tsirkin <mst <at> redhat.com> writes:
>>
>>> At the moment, memory radix tree is already variable width, but it can
>>> only skip the low bits of address.
>>>
>>> This is efficient if we have huge memory regions but inefficient if we
>>> are only using a tiny portion of the address space.
>>>
>>> After we have built up the map, detect
>>> configurations where a single L2 entry is valid.
>>>
>>> We then speed up the lookup by skipping one or more levels.
>>> In case any levels were skipped, we might end up in a valid section
>>> instead of erroring out. We handle this by checking that
>>> the address is in range of the resulting section.
>>>
>>
>> I think this is overkill. It can be done in a simpler way as follows:
>>
>>
>> phys_page_find(RadixTree* tr, hwaddr index, ...)
>> {
>> if (index & rt->invalid_index_mask) {
>> // not found
>> }
>> lp = rt->root;
>> for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
>> ...
>>
>> This exploits the fact the lower portion of the address space is always
>> filled, at least in the cases that matter to us.
>>
>>
>>
>>
>>
> Basically skip unused high bits?
> Sure.
> In fact I think both optimizations can be combined.
Not much value in combining them -- the variable level tree check will
be dominated by the level skip logic.
IMO however skipping intermediate levels will be too rare to justify the
complexity and the doubling of the page table size -- it can only happen
in iommu setups that place memory in very high addresses. These ought
to be rare.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
2013-11-14 14:56 ` Avi Kivity
@ 2013-11-14 15:37 ` Michael S. Tsirkin
2013-11-14 15:42 ` Avi Kivity
0 siblings, 1 reply; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-14 15:37 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel
On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:
> On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:
> >On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
> >>Michael S. Tsirkin <mst <at> redhat.com> writes:
> >>
> >>>At the moment, memory radix tree is already variable width, but it can
> >>>only skip the low bits of address.
> >>>
> >>>This is efficient if we have huge memory regions but inefficient if we
> >>>are only using a tiny portion of the address space.
> >>>
> >>>After we have built up the map, detect
> >>>configurations where a single L2 entry is valid.
> >>>
> >>>We then speed up the lookup by skipping one or more levels.
> >>>In case any levels were skipped, we might end up in a valid section
> >>>instead of erroring out. We handle this by checking that
> >>>the address is in range of the resulting section.
> >>>
> >>
> >>I think this is overkill. It can be done in a simpler way as follows:
> >>
> >>
> >>phys_page_find(RadixTree* tr, hwaddr index, ...)
> >>{
> >> if (index & rt->invalid_index_mask) {
> >> // not found
> >> }
> >> lp = rt->root;
> >> for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
> >> ...
> >>
> >>This exploits the fact the lower portion of the address space is always
> >>filled, at least in the cases that matter to us.
> >>
> >>
> >>
> >>
> >>
> >Basically skip unused high bits?
> >Sure.
> >In fact I think both optimizations can be combined.
>
> Not much value in combining them -- the variable level tree check
> will be dominated by the level skip logic.
>
> IMO however skipping intermediate levels will be too rare to justify
> the complexity and the doubling of the page table size -- it can
> only happen in iommu setups that place memory in very high
> addresses. These ought to be rare.
>
Well maybe not very high address, but you can have a device with
a 64 bit bar and this will add back levels (though it would not
be slower than it is currently).
I agree the simplicity is attractive.
However I really would like some logic that can handle > 1 leaf
somehow.
Specifically both tricks break if we add a full 64 bit io region
in order to return -1 for reads on master abort.
--
MST
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
2013-11-14 15:37 ` Michael S. Tsirkin
@ 2013-11-14 15:42 ` Avi Kivity
2013-11-17 15:37 ` Michael S. Tsirkin
0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2013-11-14 15:42 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: qemu-devel
On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:
>> On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:
>>> On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
>>>> Michael S. Tsirkin <mst <at> redhat.com> writes:
>>>>
>>>>> At the moment, memory radix tree is already variable width, but it can
>>>>> only skip the low bits of address.
>>>>>
>>>>> This is efficient if we have huge memory regions but inefficient if we
>>>>> are only using a tiny portion of the address space.
>>>>>
>>>>> After we have built up the map, detect
>>>>> configurations where a single L2 entry is valid.
>>>>>
>>>>> We then speed up the lookup by skipping one or more levels.
>>>>> In case any levels were skipped, we might end up in a valid section
>>>>> instead of erroring out. We handle this by checking that
>>>>> the address is in range of the resulting section.
>>>>>
>>>> I think this is overkill. It can be done in a simpler way as follows:
>>>>
>>>>
>>>> phys_page_find(RadixTree* tr, hwaddr index, ...)
>>>> {
>>>> if (index & rt->invalid_index_mask) {
>>>> // not found
>>>> }
>>>> lp = rt->root;
>>>> for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
>>>> ...
>>>>
>>>> This exploits the fact the lower portion of the address space is always
>>>> filled, at least in the cases that matter to us.
>>>>
>>>>
>>>>
>>>>
>>>>
>>> Basically skip unused high bits?
>>> Sure.
>>> In fact I think both optimizations can be combined.
>> Not much value in combining them -- the variable level tree check
>> will be dominated by the level skip logic.
>>
>> IMO however skipping intermediate levels will be too rare to justify
>> the complexity and the doubling of the page table size -- it can
>> only happen in iommu setups that place memory in very high
>> addresses. These ought to be rare.
>>
> Well maybe not very high address, but you can have a device with
> a 64 bit bar and this will add back levels (though it would not
> be slower than it is currently).
>
> I agree the simplicity is attractive.
>
> However I really would like some logic that can handle > 1 leaf
> somehow.
>
> Specifically both tricks break if we add a full 64 bit io region
> in order to return -1 for reads on master abort.
That can be achieved by directing a missed lookup to a catch-all region.
It would be best to do this completely internally to the memory code,
without API changes.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
2013-11-14 15:42 ` Avi Kivity
@ 2013-11-17 15:37 ` Michael S. Tsirkin
2013-11-17 16:12 ` Avi Kivity
0 siblings, 1 reply; 15+ messages in thread
From: Michael S. Tsirkin @ 2013-11-17 15:37 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel
On Thu, Nov 14, 2013 at 05:42:26PM +0200, Avi Kivity wrote:
> On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote:
> >On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:
> >>On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:
> >>>On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
> >>>>Michael S. Tsirkin <mst <at> redhat.com> writes:
> >>>>
> >>>>>At the moment, memory radix tree is already variable width, but it can
> >>>>>only skip the low bits of address.
> >>>>>
> >>>>>This is efficient if we have huge memory regions but inefficient if we
> >>>>>are only using a tiny portion of the address space.
> >>>>>
> >>>>>After we have built up the map, detect
> >>>>>configurations where a single L2 entry is valid.
> >>>>>
> >>>>>We then speed up the lookup by skipping one or more levels.
> >>>>>In case any levels were skipped, we might end up in a valid section
> >>>>>instead of erroring out. We handle this by checking that
> >>>>>the address is in range of the resulting section.
> >>>>>
> >>>>I think this is overkill. It can be done in a simpler way as follows:
> >>>>
> >>>>
> >>>>phys_page_find(RadixTree* tr, hwaddr index, ...)
> >>>>{
> >>>> if (index & rt->invalid_index_mask) {
> >>>> // not found
> >>>> }
> >>>> lp = rt->root;
> >>>> for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
> >>>> ...
> >>>>
> >>>>This exploits the fact the lower portion of the address space is always
> >>>>filled, at least in the cases that matter to us.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>Basically skip unused high bits?
> >>>Sure.
> >>>In fact I think both optimizations can be combined.
> >>Not much value in combining them -- the variable level tree check
> >>will be dominated by the level skip logic.
> >>
> >>IMO however skipping intermediate levels will be too rare to justify
> >>the complexity and the doubling of the page table size -- it can
> >>only happen in iommu setups that place memory in very high
> >>addresses. These ought to be rare.
> >>
> >Well maybe not very high address, but you can have a device with
> >a 64 bit bar and this will add back levels (though it would not
> >be slower than it is currently).
> >
> >I agree the simplicity is attractive.
> >
> >However I really would like some logic that can handle > 1 leaf
> >somehow.
> >
> >Specifically both tricks break if we add a full 64 bit io region
> >in order to return -1 for reads on master abort.
>
> That can be achieved by directing a missed lookup to a catch-all region.
>
> It would be best to do this completely internally to the memory
> code, without API changes.
You mean e.g. add a catch-all region to the address space?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
2013-11-17 15:37 ` Michael S. Tsirkin
@ 2013-11-17 16:12 ` Avi Kivity
0 siblings, 0 replies; 15+ messages in thread
From: Avi Kivity @ 2013-11-17 16:12 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: qemu-devel
On 11/17/2013 05:37 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 14, 2013 at 05:42:26PM +0200, Avi Kivity wrote:
>> On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote:
>>> On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:
>>>> On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:
>>>>> On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
>>>>>> Michael S. Tsirkin <mst <at> redhat.com> writes:
>>>>>>
>>>>>>> At the moment, memory radix tree is already variable width, but it can
>>>>>>> only skip the low bits of address.
>>>>>>>
>>>>>>> This is efficient if we have huge memory regions but inefficient if we
>>>>>>> are only using a tiny portion of the address space.
>>>>>>>
>>>>>>> After we have built up the map, detect
>>>>>>> configurations where a single L2 entry is valid.
>>>>>>>
>>>>>>> We then speed up the lookup by skipping one or more levels.
>>>>>>> In case any levels were skipped, we might end up in a valid section
>>>>>>> instead of erroring out. We handle this by checking that
>>>>>>> the address is in range of the resulting section.
>>>>>>>
>>>>>> I think this is overkill. It can be done in a simpler way as follows:
>>>>>>
>>>>>>
>>>>>> phys_page_find(RadixTree* tr, hwaddr index, ...)
>>>>>> {
>>>>>> if (index & rt->invalid_index_mask) {
>>>>>> // not found
>>>>>> }
>>>>>> lp = rt->root;
>>>>>> for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
>>>>>> ...
>>>>>>
>>>>>> This exploits the fact the lower portion of the address space is always
>>>>>> filled, at least in the cases that matter to us.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Basically skip unused high bits?
>>>>> Sure.
>>>>> In fact I think both optimizations can be combined.
>>>> Not much value in combining them -- the variable level tree check
>>>> will be dominated by the level skip logic.
>>>>
>>>> IMO however skipping intermediate levels will be too rare to justify
>>>> the complexity and the doubling of the page table size -- it can
>>>> only happen in iommu setups that place memory in very high
>>>> addresses. These ought to be rare.
>>>>
>>> Well maybe not very high address, but you can have a device with
>>> a 64 bit bar and this will add back levels (though it would not
>>> be slower than it is currently).
>>>
>>> I agree the simplicity is attractive.
>>>
>>> However I really would like some logic that can handle > 1 leaf
>>> somehow.
>>>
>>> Specifically both tricks break if we add a full 64 bit io region
>>> in order to return -1 for reads on master abort.
>> That can be achieved by directing a missed lookup to a catch-all region.
>>
>> It would be best to do this completely internally to the memory
>> code, without API changes.
> You mean e.g. add a catch-all region to the address space?
>
The API already specifies a catch-all region - the container region
catches everything if an access doesn't hit a sub-region (though I
wanted to discourage the practice of using the same region for both I/O
and a container), or you can have an I/O region as a subregion with the
lowest priority, that spans the entire space, so that any access which
misses the other sub-regions will hit the default subregion.
So we don't need API changes, just to change the internals.
I think this can be done by looking at the address space, and finding
out what region is mapped to the very last byte in the address space, if
any, and using that as the default if an access misses the tree. That
is, this region will be the target of null pointers in the tree, or if
(addr & addr_invalid_mask) is true.
This will ensure the tree is of the smallest possible size.
Another option is to simply give up on the tree and use
std::binary_search() on the FlatView. It will be more cpu intensive,
but on the other hand more cache-friendly. tcg already caches stuff in
the iotlb, and maybe it makes sense to add a cache for kvm lookups too
(the advantage of the cache is that it only needs to contain addresses
that are actually touched, so it can be very small and fast, compared to
the radix tree which needs to cover all addresses).
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-11-17 16:12 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-13 19:23 [Qemu-devel] [PATCH v2 0/7] making address spaces 64 bit wide Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 1/7] split definitions for exec.c and translate-all.c radix trees Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 2/7] exec: replace leaf with skip Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 3/7] exec: extend skip field to 6 bit, page entry to 32 bit Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 4/7] exec: pass hw address to phys_page_find Michael S. Tsirkin
2013-11-13 19:23 ` [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression Michael S. Tsirkin
2013-11-14 8:54 ` Avi Kivity
2013-11-14 14:40 ` [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression Michael S. Tsirkin
2013-11-14 14:56 ` Avi Kivity
2013-11-14 15:37 ` Michael S. Tsirkin
2013-11-14 15:42 ` Avi Kivity
2013-11-17 15:37 ` Michael S. Tsirkin
2013-11-17 16:12 ` Avi Kivity
2013-11-13 19:24 ` [Qemu-devel] [PATCH v2 6/7] exec: make address spaces 64-bit wide Michael S. Tsirkin
2013-11-13 19:24 ` [Qemu-devel] [PATCH v2 7/7] exec: reduce L2_PAGE_SIZE Michael S. Tsirkin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).