qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH] exec: limit system memory size
@ 2013-11-04  6:06 Michael S. Tsirkin
  2013-11-04  6:15 ` Michael S. Tsirkin
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2013-11-04  6:06 UTC (permalink / raw)
  To: marcel.a, qemu-devel
  Cc: Peter Maydell, Paolo Bonzini, Jan Kiszka, Andreas Färber,
	Richard Henderson

The page table logic in exec.c assumes
that memory addresses are at most TARGET_PHYS_ADDR_SPACE_BITS.

But pci addresses are full 64 bit so if we try to render them ignoring
the extra bits, we get strange effects with sections overlapping each
other.

To fix, simply limit the system memory size to
 1 << TARGET_PHYS_ADDR_SPACE_BITS,
pci addresses will be rendered within that.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 exec.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/exec.c b/exec.c
index 030118e..c7a8df5 100644
--- a/exec.c
+++ b/exec.c
@@ -1801,7 +1801,12 @@ void address_space_destroy_dispatch(AddressSpace *as)
 static void memory_map_init(void)
 {
     system_memory = g_malloc(sizeof(*system_memory));
-    memory_region_init(system_memory, NULL, "system", INT64_MAX);
+
+    assert(TARGET_PHYS_ADDR_SPACE_BITS <= 64);
+
+    memory_region_init(system_memory, NULL, "system",
+                       TARGET_PHYS_ADDR_SPACE_BITS == 64 ?
+                       UINT64_MAX : (0x1ULL << TARGET_PHYS_ADDR_SPACE_BITS));
     address_space_init(&address_space_memory, system_memory, "memory");
 
     system_io = g_malloc(sizeof(*system_io));
-- 
MST

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] exec: limit system memory size
  2013-11-04  6:06 [Qemu-devel] [PATCH] exec: limit system memory size Michael S. Tsirkin
@ 2013-11-04  6:15 ` Michael S. Tsirkin
  2013-11-04  9:50 ` Marcel Apfelbaum
  2013-11-04 12:26 ` Paolo Bonzini
  2 siblings, 0 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2013-11-04  6:15 UTC (permalink / raw)
  To: marcel.a, qemu-devel
  Cc: Peter Maydell, Jan Kiszka, qemu-stable, Paolo Bonzini,
	Andreas Färber, Richard Henderson

On Mon, Nov 04, 2013 at 08:06:08AM +0200, Michael S. Tsirkin wrote:
> The page table logic in exec.c assumes
> that memory addresses are at most TARGET_PHYS_ADDR_SPACE_BITS.
> 
> But pci addresses are full 64 bit so if we try to render them ignoring
> the extra bits, we get strange effects with sections overlapping each
> other.
> 
> To fix, simply limit the system memory size to
>  1 << TARGET_PHYS_ADDR_SPACE_BITS,
> pci addresses will be rendered within that.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

BTW I think this is -stable material too.

> ---
>  exec.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/exec.c b/exec.c
> index 030118e..c7a8df5 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1801,7 +1801,12 @@ void address_space_destroy_dispatch(AddressSpace *as)
>  static void memory_map_init(void)
>  {
>      system_memory = g_malloc(sizeof(*system_memory));
> -    memory_region_init(system_memory, NULL, "system", INT64_MAX);
> +
> +    assert(TARGET_PHYS_ADDR_SPACE_BITS <= 64);
> +
> +    memory_region_init(system_memory, NULL, "system",
> +                       TARGET_PHYS_ADDR_SPACE_BITS == 64 ?
> +                       UINT64_MAX : (0x1ULL << TARGET_PHYS_ADDR_SPACE_BITS));
>      address_space_init(&address_space_memory, system_memory, "memory");
>  
>      system_io = g_malloc(sizeof(*system_io));
> -- 
> MST

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] exec: limit system memory size
  2013-11-04  6:06 [Qemu-devel] [PATCH] exec: limit system memory size Michael S. Tsirkin
  2013-11-04  6:15 ` Michael S. Tsirkin
@ 2013-11-04  9:50 ` Marcel Apfelbaum
  2013-11-04 10:07   ` Michael S. Tsirkin
  2013-11-04 12:26 ` Paolo Bonzini
  2 siblings, 1 reply; 10+ messages in thread
From: Marcel Apfelbaum @ 2013-11-04  9:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Jan Kiszka, qemu-devel, Paolo Bonzini,
	Andreas Färber, Richard Henderson

On Mon, 2013-11-04 at 08:06 +0200, Michael S. Tsirkin wrote:
> The page table logic in exec.c assumes
> that memory addresses are at most TARGET_PHYS_ADDR_SPACE_BITS.
> 
> But pci addresses are full 64 bit so if we try to render them ignoring
> the extra bits, we get strange effects with sections overlapping each
> other.
> 
> To fix, simply limit the system memory size to
>  1 << TARGET_PHYS_ADDR_SPACE_BITS,
> pci addresses will be rendered within that.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  exec.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/exec.c b/exec.c
> index 030118e..c7a8df5 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1801,7 +1801,12 @@ void address_space_destroy_dispatch(AddressSpace *as)
>  static void memory_map_init(void)
>  {
>      system_memory = g_malloc(sizeof(*system_memory));
> -    memory_region_init(system_memory, NULL, "system", INT64_MAX);
> +
> +    assert(TARGET_PHYS_ADDR_SPACE_BITS <= 64);
> +
> +    memory_region_init(system_memory, NULL, "system",
> +                       TARGET_PHYS_ADDR_SPACE_BITS == 64 ?
> +                       UINT64_MAX : (0x1ULL << TARGET_PHYS_ADDR_SPACE_BITS));

Michael, thanks again for the help.

I am concerned that we cannot use all the UINT64_MAX
address space.

By the way, this patch makes the memory size aligned to page size,
so the call to register_subpage (the asserted code) is diverted
to register_multipage that does not have an assert.
That leads me to another question?
Maybe the fact that INT64_MAX is not aligned to page size makes
all the trouble? 

What do you think?

Regarding this patch:
Maybe we should to add an assert inside memory_region_init 
in order to protect all the code that creates memory regions?

And also maybe we should add a define MAX_MEMORY_REGION_SIZE
to be used in all places we want a "maximum size" memory region?

Thanks,
Marcel

>      address_space_init(&address_space_memory, system_memory, "memory");
>  
>      system_io = g_malloc(sizeof(*system_io));

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] exec: limit system memory size
  2013-11-04  9:50 ` Marcel Apfelbaum
@ 2013-11-04 10:07   ` Michael S. Tsirkin
  2013-11-04 10:54     ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Michael S. Tsirkin @ 2013-11-04 10:07 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: Peter Maydell, Jan Kiszka, qemu-devel, Paolo Bonzini,
	Andreas Färber, Richard Henderson

On Mon, Nov 04, 2013 at 11:50:05AM +0200, Marcel Apfelbaum wrote:
> On Mon, 2013-11-04 at 08:06 +0200, Michael S. Tsirkin wrote:
> > The page table logic in exec.c assumes
> > that memory addresses are at most TARGET_PHYS_ADDR_SPACE_BITS.
> > 
> > But pci addresses are full 64 bit so if we try to render them ignoring
> > the extra bits, we get strange effects with sections overlapping each
> > other.
> > 
> > To fix, simply limit the system memory size to
> >  1 << TARGET_PHYS_ADDR_SPACE_BITS,
> > pci addresses will be rendered within that.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  exec.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/exec.c b/exec.c
> > index 030118e..c7a8df5 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -1801,7 +1801,12 @@ void address_space_destroy_dispatch(AddressSpace *as)
> >  static void memory_map_init(void)
> >  {
> >      system_memory = g_malloc(sizeof(*system_memory));
> > -    memory_region_init(system_memory, NULL, "system", INT64_MAX);
> > +
> > +    assert(TARGET_PHYS_ADDR_SPACE_BITS <= 64);
> > +
> > +    memory_region_init(system_memory, NULL, "system",
> > +                       TARGET_PHYS_ADDR_SPACE_BITS == 64 ?
> > +                       UINT64_MAX : (0x1ULL << TARGET_PHYS_ADDR_SPACE_BITS));
> 
> Michael, thanks again for the help.
> 
> I am concerned that we cannot use all the UINT64_MAX
> address space.

Well, exec isn't ready for this, it expects at most
TARGET_PHYS_ADDR_SPACE_BITS.
Fortunately there's no way for CPU to initiate io outside
this area.

So this is another place where device to device IO
would be broken.

> By the way, this patch makes the memory size aligned to page size,
> so the call to register_subpage (the asserted code) is diverted
> to register_multipage that does not have an assert.

I tested with (0x1ULL << TARGET_PHYS_ADDR_SPACE_BITS) - 1
as well, works fine.

> That leads me to another question?
> Maybe the fact that INT64_MAX is not aligned to page size makes
> all the trouble? 

It's not what's causing the trouble, it's merely making
the bug visible since subpage is the only path that actually checks that
sections do not overlap.

> What do you think?
> 
> Regarding this patch:
> Maybe we should to add an assert inside memory_region_init 
> in order to protect all the code that creates memory regions?

To make sure sections don't overlap? Sure. Go for it.

> And also maybe we should add a define MAX_MEMORY_REGION_SIZE
> to be used in all places we want a "maximum size" memory region?
> 
> Thanks,
> Marcel

I think we can't define this in a target independent way really.

> >      address_space_init(&address_space_memory, system_memory, "memory");
> >  
> >      system_io = g_malloc(sizeof(*system_io));
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] exec: limit system memory size
  2013-11-04 10:07   ` Michael S. Tsirkin
@ 2013-11-04 10:54     ` Paolo Bonzini
  2013-11-04 11:14       ` Michael S. Tsirkin
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2013-11-04 10:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Marcel Apfelbaum, Jan Kiszka, qemu-devel,
	Andreas Färber, Richard Henderson

Il 04/11/2013 11:07, Michael S. Tsirkin ha scritto:
> On Mon, Nov 04, 2013 at 11:50:05AM +0200, Marcel Apfelbaum wrote:
>> On Mon, 2013-11-04 at 08:06 +0200, Michael S. Tsirkin wrote:
>>> The page table logic in exec.c assumes
>>> that memory addresses are at most TARGET_PHYS_ADDR_SPACE_BITS.
>>>
>>> But pci addresses are full 64 bit so if we try to render them ignoring
>>> the extra bits, we get strange effects with sections overlapping each
>>> other.
>>>
>>> To fix, simply limit the system memory size to
>>>  1 << TARGET_PHYS_ADDR_SPACE_BITS,
>>> pci addresses will be rendered within that.
>>>
>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>> ---
>>>  exec.c | 6 +++++-
>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/exec.c b/exec.c
>>> index 030118e..c7a8df5 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -1801,7 +1801,12 @@ void address_space_destroy_dispatch(AddressSpace *as)
>>>  static void memory_map_init(void)
>>>  {
>>>      system_memory = g_malloc(sizeof(*system_memory));
>>> -    memory_region_init(system_memory, NULL, "system", INT64_MAX);
>>> +
>>> +    assert(TARGET_PHYS_ADDR_SPACE_BITS <= 64);
>>> +
>>> +    memory_region_init(system_memory, NULL, "system",
>>> +                       TARGET_PHYS_ADDR_SPACE_BITS == 64 ?
>>> +                       UINT64_MAX : (0x1ULL << TARGET_PHYS_ADDR_SPACE_BITS));
>>
>> Michael, thanks again for the help.
>>
>> I am concerned that we cannot use all the UINT64_MAX
>> address space.
> 
> Well, exec isn't ready for this, it expects at most
> TARGET_PHYS_ADDR_SPACE_BITS.
> Fortunately there's no way for CPU to initiate io outside
> this area.
> 
> So this is another place where device to device IO
> would be broken.

However, firmware that places BARs where the CPU cannot access them 
would be "interesting" to say the least.  This applies both to device-
device I/O and to non-aligned <4K BARs.

This patch looks good; however, on top of it can you test 
kvm-unit-tests with TARGET_PHYS_ADDR_SPACE_BITS=64 and see whether 
there is a measurable slowdown (in the inl_from_qemu tests)?  If not, 
we can just get rid of TARGET_PHYS_ADDR_SPACE_BITS in exec.c.

Note that L2_BITS is shared between translate-all.c and exec.c only for 
historical reasons (the translate-all.c code used to be in exec.c).  It 
is probably a good idea to split them like this:

diff --git a/exec.c b/exec.c
index 2e31ffc..3faea0e 100644
--- a/exec.c
+++ b/exec.c
@@ -88,7 +88,15 @@ struct PhysPageEntry {
     uint16_t ptr : 15;
 };
 
-typedef PhysPageEntry Node[L2_SIZE];
+/* Size of the L2 (and L3, etc) page tables.  */
+#define ADDR_SPACE_BITS 64
+
+#define P_L2_BITS 10
+#define P_L2_SIZE (1 << P_L2_BITS)
+
+#define P_L2_LEVELS (((ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / P_L2_BITS) + 1)
+
+typedef PhysPageEntry Node[P_L2_SIZE];
 
 struct AddressSpaceDispatch {
     /* This is a multi-level map on the physical address space.
@@ -155,7 +163,7 @@ static uint16_t phys_map_node_alloc(void)
     ret = next_map.nodes_nb++;
     assert(ret != PHYS_MAP_NODE_NIL);
     assert(ret != next_map.nodes_nb_alloc);
-    for (i = 0; i < L2_SIZE; ++i) {
+    for (i = 0; i < P_L2_SIZE; ++i) {
         next_map.nodes[ret][i].is_leaf = 0;
         next_map.nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
     }
@@ -168,13 +176,13 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
 {
     PhysPageEntry *p;
     int i;
-    hwaddr step = (hwaddr)1 << (level * L2_BITS);
+    hwaddr step = (hwaddr)1 << (level * P_L2_BITS);
 
     if (!lp->is_leaf && lp->ptr == PHYS_MAP_NODE_NIL) {
         lp->ptr = phys_map_node_alloc();
         p = next_map.nodes[lp->ptr];
         if (level == 0) {
-            for (i = 0; i < L2_SIZE; i++) {
+            for (i = 0; i < P_L2_SIZE; i++) {
                 p[i].is_leaf = 1;
                 p[i].ptr = PHYS_SECTION_UNASSIGNED;
             }
@@ -182,9 +190,9 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
     } else {
         p = next_map.nodes[lp->ptr];
     }
-    lp = &p[(*index >> (level * L2_BITS)) & (L2_SIZE - 1)];
+    lp = &p[(*index >> (level * P_L2_BITS)) & (P_L2_SIZE - 1)];
 
-    while (*nb && lp < &p[L2_SIZE]) {
+    while (*nb && lp < &p[P_L2_SIZE]) {
         if ((*index & (step - 1)) == 0 && *nb >= step) {
             lp->is_leaf = true;
             lp->ptr = leaf;
@@ -218,7 +226,7 @@ static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr index,
             return &sections[PHYS_SECTION_UNASSIGNED];
         }
         p = nodes[lp.ptr];
-        lp = p[(index >> (i * L2_BITS)) & (L2_SIZE - 1)];
+        lp = p[(index >> (i * P_L2_BITS)) & (P_L2_SIZE - 1)];
     }
     return &sections[lp.ptr];
 }
diff --git a/translate-all.c b/translate-all.c
index aeda54d..1c63d78 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -96,12 +96,16 @@ typedef struct PageDesc {
 # define L1_MAP_ADDR_SPACE_BITS  TARGET_VIRT_ADDR_SPACE_BITS
 #endif
 
+/* Size of the L2 (and L3, etc) page tables.  */
+#define V_L2_BITS 10
+#define V_L2_SIZE (1 << V_L2_BITS)
+
 /* The bits remaining after N lower levels of page tables.  */
 #define V_L1_BITS_REM \
-    ((L1_MAP_ADDR_SPACE_BITS - TARGET_PAGE_BITS) % L2_BITS)
+    ((L1_MAP_ADDR_SPACE_BITS - TARGET_PAGE_BITS) % V_L2_BITS)
 
 #if V_L1_BITS_REM < 4
-#define V_L1_BITS  (V_L1_BITS_REM + L2_BITS)
+#define V_L1_BITS  (V_L1_BITS_REM + V_L2_BITS)
 #else
 #define V_L1_BITS  V_L1_BITS_REM
 #endif
@@ -395,18 +399,18 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
     lp = l1_map + ((index >> V_L1_SHIFT) & (V_L1_SIZE - 1));
 
     /* Level 2..N-1.  */
-    for (i = V_L1_SHIFT / L2_BITS - 1; i > 0; i--) {
+    for (i = V_L1_SHIFT / V_L2_BITS - 1; i > 0; i--) {
         void **p = *lp;
 
         if (p == NULL) {
             if (!alloc) {
                 return NULL;
             }
-            ALLOC(p, sizeof(void *) * L2_SIZE);
+            ALLOC(p, sizeof(void *) * V_L2_SIZE);
             *lp = p;
         }
 
-        lp = p + ((index >> (i * L2_BITS)) & (L2_SIZE - 1));
+        lp = p + ((index >> (i * V_L2_BITS)) & (V_L2_SIZE - 1));
     }
 
     pd = *lp;
@@ -414,13 +418,13 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
         if (!alloc) {
             return NULL;
         }
-        ALLOC(pd, sizeof(PageDesc) * L2_SIZE);
+        ALLOC(pd, sizeof(PageDesc) * V_L2_SIZE);
         *lp = pd;
     }
 
 #undef ALLOC
 
-    return pd + (index & (L2_SIZE - 1));
+    return pd + (index & (V_L2_SIZE - 1));
 }
 
 static inline PageDesc *page_find(tb_page_addr_t index)
@@ -655,14 +659,14 @@ static void page_flush_tb_1(int level, void **lp)
     if (level == 0) {
         PageDesc *pd = *lp;
 
-        for (i = 0; i < L2_SIZE; ++i) {
+        for (i = 0; i < V_L2_SIZE; ++i) {
             pd[i].first_tb = NULL;
             invalidate_page_bitmap(pd + i);
         }
     } else {
         void **pp = *lp;
 
-        for (i = 0; i < L2_SIZE; ++i) {
+        for (i = 0; i < V_L2_SIZE; ++i) {
             page_flush_tb_1(level - 1, pp + i);
         }
     }
@@ -673,7 +677,7 @@ static void page_flush_tb(void)
     int i;
 
     for (i = 0; i < V_L1_SIZE; i++) {
-        page_flush_tb_1(V_L1_SHIFT / L2_BITS - 1, l1_map + i);
+        page_flush_tb_1(V_L1_SHIFT / V_L2_BITS - 1, l1_map + i);
     }
 }
 
@@ -1600,7 +1604,7 @@ static int walk_memory_regions_1(struct walk_memory_regions_data *data,
     if (level == 0) {
         PageDesc *pd = *lp;
 
-        for (i = 0; i < L2_SIZE; ++i) {
+        for (i = 0; i < V_L2_SIZE; ++i) {
             int prot = pd[i].flags;
 
             pa = base | (i << TARGET_PAGE_BITS);
@@ -1614,9 +1618,9 @@ static int walk_memory_regions_1(struct walk_memory_regions_data *data,
     } else {
         void **pp = *lp;
 
-        for (i = 0; i < L2_SIZE; ++i) {
+        for (i = 0; i < V_L2_SIZE; ++i) {
             pa = base | ((abi_ulong)i <<
-                (TARGET_PAGE_BITS + L2_BITS * level));
+                (TARGET_PAGE_BITS + V_L2_BITS * level));
             rc = walk_memory_regions_1(data, pa, level - 1, pp + i);
             if (rc != 0) {
                 return rc;
@@ -1639,7 +1643,7 @@ int walk_memory_regions(void *priv, walk_memory_regions_fn fn)
 
     for (i = 0; i < V_L1_SIZE; i++) {
         int rc = walk_memory_regions_1(&data, (abi_ulong)i << V_L1_SHIFT,
-                                       V_L1_SHIFT / L2_BITS - 1, l1_map + i);
+                                       V_L1_SHIFT / V_L2_BITS - 1, l1_map + i);
 
         if (rc != 0) {
             return rc;
diff --git a/translate-all.h b/translate-all.h
index 5c38819..f7e5932 100644
--- a/translate-all.h
+++ b/translate-all.h
@@ -19,13 +19,6 @@
 #ifndef TRANSLATE_ALL_H
 #define TRANSLATE_ALL_H
 
-/* Size of the L2 (and L3, etc) page tables.  */
-#define L2_BITS 10
-#define L2_SIZE (1 << L2_BITS)
-
-#define P_L2_LEVELS \
-    (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1)
-
 /* translate-all.c */
 void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len);
 void cpu_unlink_tb(CPUState *cpu);

Paolo

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] exec: limit system memory size
  2013-11-04 10:54     ` Paolo Bonzini
@ 2013-11-04 11:14       ` Michael S. Tsirkin
  2013-11-04 11:22         ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Michael S. Tsirkin @ 2013-11-04 11:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Maydell, Marcel Apfelbaum, Jan Kiszka, qemu-devel,
	Andreas Färber, Richard Henderson

On Mon, Nov 04, 2013 at 11:54:34AM +0100, Paolo Bonzini wrote:
> Il 04/11/2013 11:07, Michael S. Tsirkin ha scritto:
> > On Mon, Nov 04, 2013 at 11:50:05AM +0200, Marcel Apfelbaum wrote:
> >> On Mon, 2013-11-04 at 08:06 +0200, Michael S. Tsirkin wrote:
> >>> The page table logic in exec.c assumes
> >>> that memory addresses are at most TARGET_PHYS_ADDR_SPACE_BITS.
> >>>
> >>> But pci addresses are full 64 bit so if we try to render them ignoring
> >>> the extra bits, we get strange effects with sections overlapping each
> >>> other.
> >>>
> >>> To fix, simply limit the system memory size to
> >>>  1 << TARGET_PHYS_ADDR_SPACE_BITS,
> >>> pci addresses will be rendered within that.
> >>>
> >>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>> ---
> >>>  exec.c | 6 +++++-
> >>>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/exec.c b/exec.c
> >>> index 030118e..c7a8df5 100644
> >>> --- a/exec.c
> >>> +++ b/exec.c
> >>> @@ -1801,7 +1801,12 @@ void address_space_destroy_dispatch(AddressSpace *as)
> >>>  static void memory_map_init(void)
> >>>  {
> >>>      system_memory = g_malloc(sizeof(*system_memory));
> >>> -    memory_region_init(system_memory, NULL, "system", INT64_MAX);
> >>> +
> >>> +    assert(TARGET_PHYS_ADDR_SPACE_BITS <= 64);
> >>> +
> >>> +    memory_region_init(system_memory, NULL, "system",
> >>> +                       TARGET_PHYS_ADDR_SPACE_BITS == 64 ?
> >>> +                       UINT64_MAX : (0x1ULL << TARGET_PHYS_ADDR_SPACE_BITS));
> >>
> >> Michael, thanks again for the help.
> >>
> >> I am concerned that we cannot use all the UINT64_MAX
> >> address space.
> > 
> > Well, exec isn't ready for this, it expects at most
> > TARGET_PHYS_ADDR_SPACE_BITS.
> > Fortunately there's no way for CPU to initiate io outside
> > this area.
> > 
> > So this is another place where device to device IO
> > would be broken.
> 
> However, firmware that places BARs where the CPU cannot access them 
> would be "interesting" to say the least.  This applies both to device-
> device I/O and to non-aligned <4K BARs.
> 
> This patch looks good; however, on top of it can you test 
> kvm-unit-tests with TARGET_PHYS_ADDR_SPACE_BITS=64 and see whether 
> there is a measurable slowdown (in the inl_from_qemu tests)?  If not, 
> we can just get rid of TARGET_PHYS_ADDR_SPACE_BITS in exec.c.

I'd rather we fixed a bug first - we need to fix it on stable too - any
cleanups can come on top.  Also, I'm not sure what will this test tell
us: inl reads io space, not memory, right?

> 
> Note that L2_BITS is shared between translate-all.c and exec.c only for 
> historical reasons (the translate-all.c code used to be in exec.c).  It 
> is probably a good idea to split them like this:

Want to test this and post properly?

> diff --git a/exec.c b/exec.c
> index 2e31ffc..3faea0e 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -88,7 +88,15 @@ struct PhysPageEntry {
>      uint16_t ptr : 15;
>  };
>  
> -typedef PhysPageEntry Node[L2_SIZE];
> +/* Size of the L2 (and L3, etc) page tables.  */
> +#define ADDR_SPACE_BITS 64
> +
> +#define P_L2_BITS 10
> +#define P_L2_SIZE (1 << P_L2_BITS)
> +
> +#define P_L2_LEVELS (((ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / P_L2_BITS) + 1)
> +
> +typedef PhysPageEntry Node[P_L2_SIZE];
>  
>  struct AddressSpaceDispatch {
>      /* This is a multi-level map on the physical address space.
> @@ -155,7 +163,7 @@ static uint16_t phys_map_node_alloc(void)
>      ret = next_map.nodes_nb++;
>      assert(ret != PHYS_MAP_NODE_NIL);
>      assert(ret != next_map.nodes_nb_alloc);
> -    for (i = 0; i < L2_SIZE; ++i) {
> +    for (i = 0; i < P_L2_SIZE; ++i) {
>          next_map.nodes[ret][i].is_leaf = 0;
>          next_map.nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
>      }
> @@ -168,13 +176,13 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
>  {
>      PhysPageEntry *p;
>      int i;
> -    hwaddr step = (hwaddr)1 << (level * L2_BITS);
> +    hwaddr step = (hwaddr)1 << (level * P_L2_BITS);
>  
>      if (!lp->is_leaf && lp->ptr == PHYS_MAP_NODE_NIL) {
>          lp->ptr = phys_map_node_alloc();
>          p = next_map.nodes[lp->ptr];
>          if (level == 0) {
> -            for (i = 0; i < L2_SIZE; i++) {
> +            for (i = 0; i < P_L2_SIZE; i++) {
>                  p[i].is_leaf = 1;
>                  p[i].ptr = PHYS_SECTION_UNASSIGNED;
>              }
> @@ -182,9 +190,9 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
>      } else {
>          p = next_map.nodes[lp->ptr];
>      }
> -    lp = &p[(*index >> (level * L2_BITS)) & (L2_SIZE - 1)];
> +    lp = &p[(*index >> (level * P_L2_BITS)) & (P_L2_SIZE - 1)];
>  
> -    while (*nb && lp < &p[L2_SIZE]) {
> +    while (*nb && lp < &p[P_L2_SIZE]) {
>          if ((*index & (step - 1)) == 0 && *nb >= step) {
>              lp->is_leaf = true;
>              lp->ptr = leaf;
> @@ -218,7 +226,7 @@ static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr index,
>              return &sections[PHYS_SECTION_UNASSIGNED];
>          }
>          p = nodes[lp.ptr];
> -        lp = p[(index >> (i * L2_BITS)) & (L2_SIZE - 1)];
> +        lp = p[(index >> (i * P_L2_BITS)) & (P_L2_SIZE - 1)];
>      }
>      return &sections[lp.ptr];
>  }
> diff --git a/translate-all.c b/translate-all.c
> index aeda54d..1c63d78 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -96,12 +96,16 @@ typedef struct PageDesc {
>  # define L1_MAP_ADDR_SPACE_BITS  TARGET_VIRT_ADDR_SPACE_BITS
>  #endif
>  
> +/* Size of the L2 (and L3, etc) page tables.  */
> +#define V_L2_BITS 10
> +#define V_L2_SIZE (1 << V_L2_BITS)
> +
>  /* The bits remaining after N lower levels of page tables.  */
>  #define V_L1_BITS_REM \
> -    ((L1_MAP_ADDR_SPACE_BITS - TARGET_PAGE_BITS) % L2_BITS)
> +    ((L1_MAP_ADDR_SPACE_BITS - TARGET_PAGE_BITS) % V_L2_BITS)
>  
>  #if V_L1_BITS_REM < 4
> -#define V_L1_BITS  (V_L1_BITS_REM + L2_BITS)
> +#define V_L1_BITS  (V_L1_BITS_REM + V_L2_BITS)
>  #else
>  #define V_L1_BITS  V_L1_BITS_REM
>  #endif
> @@ -395,18 +399,18 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
>      lp = l1_map + ((index >> V_L1_SHIFT) & (V_L1_SIZE - 1));
>  
>      /* Level 2..N-1.  */
> -    for (i = V_L1_SHIFT / L2_BITS - 1; i > 0; i--) {
> +    for (i = V_L1_SHIFT / V_L2_BITS - 1; i > 0; i--) {
>          void **p = *lp;
>  
>          if (p == NULL) {
>              if (!alloc) {
>                  return NULL;
>              }
> -            ALLOC(p, sizeof(void *) * L2_SIZE);
> +            ALLOC(p, sizeof(void *) * V_L2_SIZE);
>              *lp = p;
>          }
>  
> -        lp = p + ((index >> (i * L2_BITS)) & (L2_SIZE - 1));
> +        lp = p + ((index >> (i * V_L2_BITS)) & (V_L2_SIZE - 1));
>      }
>  
>      pd = *lp;
> @@ -414,13 +418,13 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
>          if (!alloc) {
>              return NULL;
>          }
> -        ALLOC(pd, sizeof(PageDesc) * L2_SIZE);
> +        ALLOC(pd, sizeof(PageDesc) * V_L2_SIZE);
>          *lp = pd;
>      }
>  
>  #undef ALLOC
>  
> -    return pd + (index & (L2_SIZE - 1));
> +    return pd + (index & (V_L2_SIZE - 1));
>  }
>  
>  static inline PageDesc *page_find(tb_page_addr_t index)
> @@ -655,14 +659,14 @@ static void page_flush_tb_1(int level, void **lp)
>      if (level == 0) {
>          PageDesc *pd = *lp;
>  
> -        for (i = 0; i < L2_SIZE; ++i) {
> +        for (i = 0; i < V_L2_SIZE; ++i) {
>              pd[i].first_tb = NULL;
>              invalidate_page_bitmap(pd + i);
>          }
>      } else {
>          void **pp = *lp;
>  
> -        for (i = 0; i < L2_SIZE; ++i) {
> +        for (i = 0; i < V_L2_SIZE; ++i) {
>              page_flush_tb_1(level - 1, pp + i);
>          }
>      }
> @@ -673,7 +677,7 @@ static void page_flush_tb(void)
>      int i;
>  
>      for (i = 0; i < V_L1_SIZE; i++) {
> -        page_flush_tb_1(V_L1_SHIFT / L2_BITS - 1, l1_map + i);
> +        page_flush_tb_1(V_L1_SHIFT / V_L2_BITS - 1, l1_map + i);
>      }
>  }
>  
> @@ -1600,7 +1604,7 @@ static int walk_memory_regions_1(struct walk_memory_regions_data *data,
>      if (level == 0) {
>          PageDesc *pd = *lp;
>  
> -        for (i = 0; i < L2_SIZE; ++i) {
> +        for (i = 0; i < V_L2_SIZE; ++i) {
>              int prot = pd[i].flags;
>  
>              pa = base | (i << TARGET_PAGE_BITS);
> @@ -1614,9 +1618,9 @@ static int walk_memory_regions_1(struct walk_memory_regions_data *data,
>      } else {
>          void **pp = *lp;
>  
> -        for (i = 0; i < L2_SIZE; ++i) {
> +        for (i = 0; i < V_L2_SIZE; ++i) {
>              pa = base | ((abi_ulong)i <<
> -                (TARGET_PAGE_BITS + L2_BITS * level));
> +                (TARGET_PAGE_BITS + V_L2_BITS * level));
>              rc = walk_memory_regions_1(data, pa, level - 1, pp + i);
>              if (rc != 0) {
>                  return rc;
> @@ -1639,7 +1643,7 @@ int walk_memory_regions(void *priv, walk_memory_regions_fn fn)
>  
>      for (i = 0; i < V_L1_SIZE; i++) {
>          int rc = walk_memory_regions_1(&data, (abi_ulong)i << V_L1_SHIFT,
> -                                       V_L1_SHIFT / L2_BITS - 1, l1_map + i);
> +                                       V_L1_SHIFT / V_L2_BITS - 1, l1_map + i);
>  
>          if (rc != 0) {
>              return rc;
> diff --git a/translate-all.h b/translate-all.h
> index 5c38819..f7e5932 100644
> --- a/translate-all.h
> +++ b/translate-all.h
> @@ -19,13 +19,6 @@
>  #ifndef TRANSLATE_ALL_H
>  #define TRANSLATE_ALL_H
>  
> -/* Size of the L2 (and L3, etc) page tables.  */
> -#define L2_BITS 10
> -#define L2_SIZE (1 << L2_BITS)
> -
> -#define P_L2_LEVELS \
> -    (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1)
> -
>  /* translate-all.c */
>  void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len);
>  void cpu_unlink_tb(CPUState *cpu);
> 
> Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] exec: limit system memory size
  2013-11-04 11:14       ` Michael S. Tsirkin
@ 2013-11-04 11:22         ` Paolo Bonzini
  2013-11-04 12:04           ` Michael S. Tsirkin
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2013-11-04 11:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Marcel Apfelbaum, Jan Kiszka, qemu-devel,
	Andreas Färber, Richard Henderson

Il 04/11/2013 12:14, Michael S. Tsirkin ha scritto:
>> > 
>> > This patch looks good; however, on top of it can you test 
>> > kvm-unit-tests with TARGET_PHYS_ADDR_SPACE_BITS=64 and see whether 
>> > there is a measurable slowdown (in the inl_from_qemu tests)?  If not, 
>> > we can just get rid of TARGET_PHYS_ADDR_SPACE_BITS in exec.c.
> I'd rather we fixed a bug first - we need to fix it on stable too - any
> cleanups can come on top.

This is not necessarily a cleanup.  Getting rid of
TARGET_PHYS_ADDR_SPACE_BITS in exec.c means fixing device-to-device DMA
bugs for example.

Of course a smaller patch can be done that avoids the renaming of L2_*
constants.

> Also, I'm not sure what will this test tell
> us: inl reads io space, not memory, right?

The number of levels in the dispatch radix tree is independent of the
size of the AddressSpace; it is P_L2_LEVELS for both the 64K io space
and the 2^TARGET_PHYS_ADDRESS_SPACE_BITS memory space.

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] exec: limit system memory size
  2013-11-04 11:22         ` Paolo Bonzini
@ 2013-11-04 12:04           ` Michael S. Tsirkin
  2013-11-04 12:11             ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Michael S. Tsirkin @ 2013-11-04 12:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Maydell, Marcel Apfelbaum, Jan Kiszka, qemu-devel,
	Andreas Färber, Richard Henderson

On Mon, Nov 04, 2013 at 12:22:35PM +0100, Paolo Bonzini wrote:
> Il 04/11/2013 12:14, Michael S. Tsirkin ha scritto:
> >> > 
> >> > This patch looks good; however, on top of it can you test 
> >> > kvm-unit-tests with TARGET_PHYS_ADDR_SPACE_BITS=64 and see whether 
> >> > there is a measurable slowdown (in the inl_from_qemu tests)?  If not, 
> >> > we can just get rid of TARGET_PHYS_ADDR_SPACE_BITS in exec.c.
> > I'd rather we fixed a bug first - we need to fix it on stable too - any
> > cleanups can come on top.
> 
> This is not necessarily a cleanup.  Getting rid of
> TARGET_PHYS_ADDR_SPACE_BITS in exec.c means fixing device-to-device DMA
> bugs for example.
> 
> Of course a smaller patch can be done that avoids the renaming of L2_*
> constants.
> 
> > Also, I'm not sure what will this test tell
> > us: inl reads io space, not memory, right?
> 
> The number of levels in the dispatch radix tree is independent of the
> size of the AddressSpace; it is P_L2_LEVELS for both the 64K io space
> and the 2^TARGET_PHYS_ADDRESS_SPACE_BITS memory space.
> 
> Paolo

Hmm I think it's *at most* that deep but can be more shallow, no?

-- 
MST

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] exec: limit system memory size
  2013-11-04 12:04           ` Michael S. Tsirkin
@ 2013-11-04 12:11             ` Paolo Bonzini
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2013-11-04 12:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Marcel Apfelbaum, Jan Kiszka, qemu-devel,
	Andreas Färber, Richard Henderson

Il 04/11/2013 13:04, Michael S. Tsirkin ha scritto:
> > > > Also, I'm not sure what will this test tell
> > > > us: inl reads io space, not memory, right?
> > 
> > The number of levels in the dispatch radix tree is independent of the
> > size of the AddressSpace; it is P_L2_LEVELS for both the 64K io space
> > and the 2^TARGET_PHYS_ADDRESS_SPACE_BITS memory space.
> 
> Hmm I think it's *at most* that deep but can be more shallow, no?

Yes, but it gets more shallow only if you have very large regions (which
is obviously not the case for io space).  The levels always cover the
same bit range (e.g. bits 42-51 for the first level on x86).

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] exec: limit system memory size
  2013-11-04  6:06 [Qemu-devel] [PATCH] exec: limit system memory size Michael S. Tsirkin
  2013-11-04  6:15 ` Michael S. Tsirkin
  2013-11-04  9:50 ` Marcel Apfelbaum
@ 2013-11-04 12:26 ` Paolo Bonzini
  2 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2013-11-04 12:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, marcel.a, Jan Kiszka, qemu-devel,
	Andreas Färber, Richard Henderson

Il 04/11/2013 07:06, Michael S. Tsirkin ha scritto:
> The page table logic in exec.c assumes
> that memory addresses are at most TARGET_PHYS_ADDR_SPACE_BITS.
> 
> But pci addresses are full 64 bit so if we try to render them ignoring
> the extra bits, we get strange effects with sections overlapping each
> other.
> 
> To fix, simply limit the system memory size to
>  1 << TARGET_PHYS_ADDR_SPACE_BITS,
> pci addresses will be rendered within that.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  exec.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/exec.c b/exec.c
> index 030118e..c7a8df5 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1801,7 +1801,12 @@ void address_space_destroy_dispatch(AddressSpace *as)
>  static void memory_map_init(void)
>  {
>      system_memory = g_malloc(sizeof(*system_memory));
> -    memory_region_init(system_memory, NULL, "system", INT64_MAX);
> +
> +    assert(TARGET_PHYS_ADDR_SPACE_BITS <= 64);
> +
> +    memory_region_init(system_memory, NULL, "system",
> +                       TARGET_PHYS_ADDR_SPACE_BITS == 64 ?
> +                       UINT64_MAX : (0x1ULL << TARGET_PHYS_ADDR_SPACE_BITS));
>      address_space_init(&address_space_memory, system_memory, "memory");
>  
>      system_io = g_malloc(sizeof(*system_io));
> 

You can include either this patch or Marcel's with my Reviewed-by: Paolo
Bonzini <pbonzini@redhat.com>.  I don't have any preference.


Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-11-04 12:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-04  6:06 [Qemu-devel] [PATCH] exec: limit system memory size Michael S. Tsirkin
2013-11-04  6:15 ` Michael S. Tsirkin
2013-11-04  9:50 ` Marcel Apfelbaum
2013-11-04 10:07   ` Michael S. Tsirkin
2013-11-04 10:54     ` Paolo Bonzini
2013-11-04 11:14       ` Michael S. Tsirkin
2013-11-04 11:22         ` Paolo Bonzini
2013-11-04 12:04           ` Michael S. Tsirkin
2013-11-04 12:11             ` Paolo Bonzini
2013-11-04 12:26 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).