All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
@ 2025-12-16  8:00 Chuang Xu
  2025-12-16 13:25 ` Fabiano Rosas
  2025-12-17 17:01 ` Peter Xu
  0 siblings, 2 replies; 10+ messages in thread
From: Chuang Xu @ 2025-12-16  8:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, sgarzare, richard.henderson, pbonzini, peterx, david, philmd,
	farosas, xuchuangxclwt

From: xuchuangxclwt <xuchuangxclwt@bytedance.com>

In our long-term experience in Bytedance, we've found that under
the same load, live migration of larger VMs with more devices is
often more difficult to converge (requiring a larger downtime limit).

Through some testing and calculations, we conclude that bitmap sync time
affects the calculation of live migration bandwidth.

When the addresses processed are not aligned, a large number of
clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate
2048 clear_dirty ioctls from two different memory_listener),
which increases the time required for bitmap_sync and makes it
more difficult for dirty pages to converge.

For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and
16 vhost-user-blk(4 queue per blk), the sync time is as high as *73ms*
(tested with 10GBps dirty rate, the sync time increases as the dirty
page rate increases), Here are each part of the sync time:

- sync from kvm to ram_list: 2.5ms
- vhost_log_sync:3ms
- sync aligned memory from ram_list to RAMBlock: 5ms
- sync misaligned memory from ram_list to RAMBlock: 61ms

Attempt to merge those fragmented clear_dirty ioctls, then syncing
misaligned memory from ram_list to RAMBlock takes only about 1ms,
and the total sync time is only *12ms*.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
---
 accel/tcg/cputlb.c       |  5 ++--
 include/system/physmem.h |  7 +++---
 migration/ram.c          | 17 ++++----------
 system/memory.c          |  2 +-
 system/physmem.c         | 49 ++++++++++++++++++++++++++++------------
 5 files changed, 47 insertions(+), 33 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index fd1606c856..c8827c8b0d 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
 void tlb_protect_code(ram_addr_t ram_addr)
 {
     physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
-                                             TARGET_PAGE_SIZE,
-                                             DIRTY_MEMORY_CODE);
+                                         TARGET_PAGE_SIZE,
+                                         DIRTY_MEMORY_CODE,
+                                         NULL);
 }
 
 /* update the TLB so that writes in physical page 'phys_addr' are no longer
diff --git a/include/system/physmem.h b/include/system/physmem.h
index 879f6eae38..8eeace9d1f 100644
--- a/include/system/physmem.h
+++ b/include/system/physmem.h
@@ -39,9 +39,10 @@ uint64_t physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
 
 void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
 
-bool physical_memory_test_and_clear_dirty(ram_addr_t start,
-                                          ram_addr_t length,
-                                          unsigned client);
+uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
+                                              ram_addr_t length,
+                                              unsigned client,
+                                              unsigned long *dest);
 
 DirtyBitmapSnapshot *
 physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
diff --git a/migration/ram.c b/migration/ram.c
index 29f016cb25..a03c9874a2 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -942,7 +942,6 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                                                   ram_addr_t start,
                                                   ram_addr_t length)
 {
-    ram_addr_t addr;
     unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
     uint64_t num_dirty = 0;
     unsigned long *dest = rb->bmap;
@@ -996,17 +995,11 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
     } else {
         ram_addr_t offset = rb->offset;
 
-        for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
-            if (physical_memory_test_and_clear_dirty(
-                        start + addr + offset,
-                        TARGET_PAGE_SIZE,
-                        DIRTY_MEMORY_MIGRATION)) {
-                long k = (start + addr) >> TARGET_PAGE_BITS;
-                if (!test_and_set_bit(k, dest)) {
-                    num_dirty++;
-                }
-            }
-        }
+        num_dirty = physical_memory_test_and_clear_dirty(
+                        start + offset,
+                        length,
+                        DIRTY_MEMORY_MIGRATION,
+                        dest);
     }
 
     return num_dirty;
diff --git a/system/memory.c b/system/memory.c
index 8b84661ae3..666364392d 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
 {
     assert(mr->ram_block);
     physical_memory_test_and_clear_dirty(
-        memory_region_get_ram_addr(mr) + addr, size, client);
+        memory_region_get_ram_addr(mr) + addr, size, client, NULL);
 }
 
 int memory_region_get_fd(MemoryRegion *mr)
diff --git a/system/physmem.c b/system/physmem.c
index c9869e4049..f8b660dafe 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1089,19 +1089,31 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
     }
 }
 
-/* Note: start and end must be within the same ram block.  */
-bool physical_memory_test_and_clear_dirty(ram_addr_t start,
+/*
+ * Note: start and end must be within the same ram block.
+ *
+ * @dest usage:
+ * - When @dest is provided, set bits for newly discovered dirty pages
+ *   only if the bit wasn't already set in dest, and count those pages
+ *   in num_dirty.
+ * - When @dest is NULL, count all dirty pages in the range
+ *
+ * @return:
+ * - Number of dirty guest pages found within [start, start + length).
+ */
+uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
                                               ram_addr_t length,
-                                              unsigned client)
+                                              unsigned client,
+                                              unsigned long *dest)
 {
     DirtyMemoryBlocks *blocks;
     unsigned long end, page, start_page;
-    bool dirty = false;
+    uint64_t num_dirty = 0;
     RAMBlock *ramblock;
     uint64_t mr_offset, mr_size;
 
     if (length == 0) {
-        return false;
+        return 0;
     }
 
     end = TARGET_PAGE_ALIGN(start + length) >> TARGET_PAGE_BITS;
@@ -1118,12 +1130,19 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
         while (page < end) {
             unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
             unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
-            unsigned long num = MIN(end - page,
-                                    DIRTY_MEMORY_BLOCK_SIZE - offset);
 
-            dirty |= bitmap_test_and_clear_atomic(blocks->blocks[idx],
-                                                  offset, num);
-            page += num;
+            if (bitmap_test_and_clear_atomic(blocks->blocks[idx], offset, 1)) {
+                if (dest) {
+                    unsigned long k = page - (ramblock->offset >> TARGET_PAGE_BITS);
+                    if (!test_and_set_bit(k, dest)) {
+                        num_dirty++;
+                    }
+                } else {
+                    num_dirty++;
+                }
+            }
+
+            page++;
         }
 
         mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
@@ -1131,18 +1150,18 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
         memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
     }
 
-    if (dirty) {
+    if (num_dirty) {
         physical_memory_dirty_bits_cleared(start, length);
     }
 
-    return dirty;
+    return num_dirty;
 }
 
 static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
 {
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, NULL);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, NULL);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, NULL);
 }
 
 DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-16  8:00 [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
@ 2025-12-16 13:25 ` Fabiano Rosas
  2025-12-16 16:26   ` Peter Xu
  2025-12-17 17:01 ` Peter Xu
  1 sibling, 1 reply; 10+ messages in thread
From: Fabiano Rosas @ 2025-12-16 13:25 UTC (permalink / raw)
  To: Chuang Xu, qemu-devel
  Cc: mst, sgarzare, richard.henderson, pbonzini, peterx, david, philmd,
	xuchuangxclwt

"Chuang Xu" <xuchuangxclwt@bytedance.com> writes:

> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
>
> In our long-term experience in Bytedance, we've found that under
> the same load, live migration of larger VMs with more devices is
> often more difficult to converge (requiring a larger downtime limit).
>
> Through some testing and calculations, we conclude that bitmap sync time
> affects the calculation of live migration bandwidth.
>
> When the addresses processed are not aligned, a large number of
> clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate
> 2048 clear_dirty ioctls from two different memory_listener),
> which increases the time required for bitmap_sync and makes it
> more difficult for dirty pages to converge.
>
> For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and
> 16 vhost-user-blk(4 queue per blk), the sync time is as high as *73ms*
> (tested with 10GBps dirty rate, the sync time increases as the dirty
> page rate increases), Here are each part of the sync time:
>
> - sync from kvm to ram_list: 2.5ms
> - vhost_log_sync:3ms
> - sync aligned memory from ram_list to RAMBlock: 5ms
> - sync misaligned memory from ram_list to RAMBlock: 61ms
>
> Attempt to merge those fragmented clear_dirty ioctls, then syncing
> misaligned memory from ram_list to RAMBlock takes only about 1ms,
> and the total sync time is only *12ms*.
>
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> ---
>  accel/tcg/cputlb.c       |  5 ++--
>  include/system/physmem.h |  7 +++---
>  migration/ram.c          | 17 ++++----------
>  system/memory.c          |  2 +-
>  system/physmem.c         | 49 ++++++++++++++++++++++++++++------------
>  5 files changed, 47 insertions(+), 33 deletions(-)
>
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index fd1606c856..c8827c8b0d 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
>  void tlb_protect_code(ram_addr_t ram_addr)
>  {
>      physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
> -                                             TARGET_PAGE_SIZE,
> -                                             DIRTY_MEMORY_CODE);
> +                                         TARGET_PAGE_SIZE,
> +                                         DIRTY_MEMORY_CODE,
> +                                         NULL);
>  }
>  
>  /* update the TLB so that writes in physical page 'phys_addr' are no longer
> diff --git a/include/system/physmem.h b/include/system/physmem.h
> index 879f6eae38..8eeace9d1f 100644
> --- a/include/system/physmem.h
> +++ b/include/system/physmem.h
> @@ -39,9 +39,10 @@ uint64_t physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
>  
>  void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
>  
> -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> -                                          ram_addr_t length,
> -                                          unsigned client);
> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
> +                                              ram_addr_t length,
> +                                              unsigned client,
> +                                              unsigned long *dest);
>  
>  DirtyBitmapSnapshot *
>  physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
> diff --git a/migration/ram.c b/migration/ram.c
> index 29f016cb25..a03c9874a2 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -942,7 +942,6 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>                                                    ram_addr_t start,
>                                                    ram_addr_t length)
>  {
> -    ram_addr_t addr;
>      unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
>      uint64_t num_dirty = 0;
>      unsigned long *dest = rb->bmap;
> @@ -996,17 +995,11 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>      } else {
>          ram_addr_t offset = rb->offset;
>  
> -        for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
> -            if (physical_memory_test_and_clear_dirty(
> -                        start + addr + offset,
> -                        TARGET_PAGE_SIZE,
> -                        DIRTY_MEMORY_MIGRATION)) {
> -                long k = (start + addr) >> TARGET_PAGE_BITS;
> -                if (!test_and_set_bit(k, dest)) {
> -                    num_dirty++;
> -                }
> -            }
> -        }
> +        num_dirty = physical_memory_test_and_clear_dirty(
> +                        start + offset,
> +                        length,
> +                        DIRTY_MEMORY_MIGRATION,
> +                        dest);
>      }
>  
>      return num_dirty;
> diff --git a/system/memory.c b/system/memory.c
> index 8b84661ae3..666364392d 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
>  {
>      assert(mr->ram_block);
>      physical_memory_test_and_clear_dirty(
> -        memory_region_get_ram_addr(mr) + addr, size, client);
> +        memory_region_get_ram_addr(mr) + addr, size, client, NULL);
>  }
>  
>  int memory_region_get_fd(MemoryRegion *mr)
> diff --git a/system/physmem.c b/system/physmem.c
> index c9869e4049..f8b660dafe 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1089,19 +1089,31 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
>      }
>  }
>  
> -/* Note: start and end must be within the same ram block.  */
> -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> +/*
> + * Note: start and end must be within the same ram block.
> + *
> + * @dest usage:

I'm not sure if it's just me, but I find this "dest" term quite
confusing. "bmap" might be more straight-forward.

> + * - When @dest is provided, set bits for newly discovered dirty pages
> + *   only if the bit wasn't already set in dest, and count those pages
> + *   in num_dirty.

Am I misreading the code? I don't see this "set ... only if the bit
wasn't already set" part. What I see is: "set bits, but only count those
pages if the bit wasn't already set".

> + * - When @dest is NULL, count all dirty pages in the range
> + *
> + * @return:
> + * - Number of dirty guest pages found within [start, start + length).
> + */
> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
>                                                ram_addr_t length,
> -                                              unsigned client)
> +                                              unsigned client,
> +                                              unsigned long *dest)
>  {
>      DirtyMemoryBlocks *blocks;
>      unsigned long end, page, start_page;
> -    bool dirty = false;
> +    uint64_t num_dirty = 0;
>      RAMBlock *ramblock;
>      uint64_t mr_offset, mr_size;
>  
>      if (length == 0) {
> -        return false;
> +        return 0;
>      }
>  
>      end = TARGET_PAGE_ALIGN(start + length) >> TARGET_PAGE_BITS;
> @@ -1118,12 +1130,19 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>          while (page < end) {
>              unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
>              unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
> -            unsigned long num = MIN(end - page,
> -                                    DIRTY_MEMORY_BLOCK_SIZE - offset);
>  
> -            dirty |= bitmap_test_and_clear_atomic(blocks->blocks[idx],
> -                                                  offset, num);
> -            page += num;
> +            if (bitmap_test_and_clear_atomic(blocks->blocks[idx], offset, 1)) {
> +                if (dest) {
> +                    unsigned long k = page - (ramblock->offset >> TARGET_PAGE_BITS);
> +                    if (!test_and_set_bit(k, dest)) {
> +                        num_dirty++;
> +                    }
> +                } else {
> +                    num_dirty++;
> +                }
> +            }
> +
> +            page++;
>          }
>  
>          mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
> @@ -1131,18 +1150,18 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>          memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
>      }
>  
> -    if (dirty) {
> +    if (num_dirty) {
>          physical_memory_dirty_bits_cleared(start, length);
>      }
>  
> -    return dirty;
> +    return num_dirty;
>  }
>  
>  static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
>  {
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, NULL);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, NULL);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, NULL);
>  }
>  
>  DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-16 13:25 ` Fabiano Rosas
@ 2025-12-16 16:26   ` Peter Xu
  2025-12-17  6:46     ` Chuang Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2025-12-16 16:26 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Chuang Xu, qemu-devel, mst, sgarzare, richard.henderson, pbonzini,
	david, philmd

On Tue, Dec 16, 2025 at 10:25:46AM -0300, Fabiano Rosas wrote:
> "Chuang Xu" <xuchuangxclwt@bytedance.com> writes:
> 
> > From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> >
> > In our long-term experience in Bytedance, we've found that under
> > the same load, live migration of larger VMs with more devices is
> > often more difficult to converge (requiring a larger downtime limit).
> >
> > Through some testing and calculations, we conclude that bitmap sync time
> > affects the calculation of live migration bandwidth.

Side note:

I forgot to mention when replying to the old versions, but we introduced
avail-switchover-bandwidth to partially remedy this problem when we hit it
before - which may or may not be exactly the same reason here on unaligned
syncs as we didn't further investigate (we have VFIO-PCI devices when
testing), but the whole logic should be similar that bw was calculated too
small.

So even if with this patch optimizing sync, bw is always not as accurate.
I wonder if we can still fix it somehow, e.g. I wonder if 100ms is too
short a period to take samples, or at least we should be able to remember
more samples so the reported bw (even if we keep sampling per 100ms) will
cover longer period.

Feel free to share your thoughts if you have any.

> >
> > When the addresses processed are not aligned, a large number of
> > clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate
> > 2048 clear_dirty ioctls from two different memory_listener),
> > which increases the time required for bitmap_sync and makes it
> > more difficult for dirty pages to converge.
> >
> > For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and
> > 16 vhost-user-blk(4 queue per blk), the sync time is as high as *73ms*
> > (tested with 10GBps dirty rate, the sync time increases as the dirty
> > page rate increases), Here are each part of the sync time:
> >
> > - sync from kvm to ram_list: 2.5ms
> > - vhost_log_sync:3ms
> > - sync aligned memory from ram_list to RAMBlock: 5ms
> > - sync misaligned memory from ram_list to RAMBlock: 61ms
> >
> > Attempt to merge those fragmented clear_dirty ioctls, then syncing
> > misaligned memory from ram_list to RAMBlock takes only about 1ms,
> > and the total sync time is only *12ms*.
> >
> > Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> > ---
> >  accel/tcg/cputlb.c       |  5 ++--
> >  include/system/physmem.h |  7 +++---
> >  migration/ram.c          | 17 ++++----------
> >  system/memory.c          |  2 +-
> >  system/physmem.c         | 49 ++++++++++++++++++++++++++++------------
> >  5 files changed, 47 insertions(+), 33 deletions(-)
> >
> > diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> > index fd1606c856..c8827c8b0d 100644
> > --- a/accel/tcg/cputlb.c
> > +++ b/accel/tcg/cputlb.c
> > @@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
> >  void tlb_protect_code(ram_addr_t ram_addr)
> >  {
> >      physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
> > -                                             TARGET_PAGE_SIZE,
> > -                                             DIRTY_MEMORY_CODE);
> > +                                         TARGET_PAGE_SIZE,
> > +                                         DIRTY_MEMORY_CODE,
> > +                                         NULL);
> >  }
> >  
> >  /* update the TLB so that writes in physical page 'phys_addr' are no longer
> > diff --git a/include/system/physmem.h b/include/system/physmem.h
> > index 879f6eae38..8eeace9d1f 100644
> > --- a/include/system/physmem.h
> > +++ b/include/system/physmem.h
> > @@ -39,9 +39,10 @@ uint64_t physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
> >  
> >  void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
> >  
> > -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> > -                                          ram_addr_t length,
> > -                                          unsigned client);
> > +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
> > +                                              ram_addr_t length,
> > +                                              unsigned client,
> > +                                              unsigned long *dest);
> >  
> >  DirtyBitmapSnapshot *
> >  physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 29f016cb25..a03c9874a2 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -942,7 +942,6 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
> >                                                    ram_addr_t start,
> >                                                    ram_addr_t length)
> >  {
> > -    ram_addr_t addr;
> >      unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
> >      uint64_t num_dirty = 0;
> >      unsigned long *dest = rb->bmap;
> > @@ -996,17 +995,11 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
> >      } else {
> >          ram_addr_t offset = rb->offset;
> >  
> > -        for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
> > -            if (physical_memory_test_and_clear_dirty(
> > -                        start + addr + offset,
> > -                        TARGET_PAGE_SIZE,
> > -                        DIRTY_MEMORY_MIGRATION)) {
> > -                long k = (start + addr) >> TARGET_PAGE_BITS;
> > -                if (!test_and_set_bit(k, dest)) {
> > -                    num_dirty++;
> > -                }
> > -            }
> > -        }
> > +        num_dirty = physical_memory_test_and_clear_dirty(
> > +                        start + offset,
> > +                        length,
> > +                        DIRTY_MEMORY_MIGRATION,
> > +                        dest);
> >      }
> >  
> >      return num_dirty;
> > diff --git a/system/memory.c b/system/memory.c
> > index 8b84661ae3..666364392d 100644
> > --- a/system/memory.c
> > +++ b/system/memory.c
> > @@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
> >  {
> >      assert(mr->ram_block);
> >      physical_memory_test_and_clear_dirty(
> > -        memory_region_get_ram_addr(mr) + addr, size, client);
> > +        memory_region_get_ram_addr(mr) + addr, size, client, NULL);
> >  }
> >  
> >  int memory_region_get_fd(MemoryRegion *mr)
> > diff --git a/system/physmem.c b/system/physmem.c
> > index c9869e4049..f8b660dafe 100644
> > --- a/system/physmem.c
> > +++ b/system/physmem.c
> > @@ -1089,19 +1089,31 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
> >      }
> >  }
> >  
> > -/* Note: start and end must be within the same ram block.  */
> > -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> > +/*
> > + * Note: start and end must be within the same ram block.
> > + *
> > + * @dest usage:
> 
> I'm not sure if it's just me, but I find this "dest" term quite
> confusing. "bmap" might be more straight-forward.
> 
> > + * - When @dest is provided, set bits for newly discovered dirty pages
> > + *   only if the bit wasn't already set in dest, and count those pages
> > + *   in num_dirty.
> 
> Am I misreading the code? I don't see this "set ... only if the bit
> wasn't already set" part. What I see is: "set bits, but only count those
> pages if the bit wasn't already set".

Agrees on both points.. one more thing to mention below.

> 
> > + * - When @dest is NULL, count all dirty pages in the range
> > + *
> > + * @return:
> > + * - Number of dirty guest pages found within [start, start + length).
> > + */
> > +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
> >                                                ram_addr_t length,
> > -                                              unsigned client)
> > +                                              unsigned client,
> > +                                              unsigned long *dest)
> >  {
> >      DirtyMemoryBlocks *blocks;
> >      unsigned long end, page, start_page;
> > -    bool dirty = false;
> > +    uint64_t num_dirty = 0;
> >      RAMBlock *ramblock;
> >      uint64_t mr_offset, mr_size;
> >  
> >      if (length == 0) {
> > -        return false;
> > +        return 0;
> >      }
> >  
> >      end = TARGET_PAGE_ALIGN(start + length) >> TARGET_PAGE_BITS;
> > @@ -1118,12 +1130,19 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> >          while (page < end) {
> >              unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
> >              unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
> > -            unsigned long num = MIN(end - page,
> > -                                    DIRTY_MEMORY_BLOCK_SIZE - offset);
> >  
> > -            dirty |= bitmap_test_and_clear_atomic(blocks->blocks[idx],
> > -                                                  offset, num);
> > -            page += num;
> > +            if (bitmap_test_and_clear_atomic(blocks->blocks[idx], offset, 1)) {
> > +                if (dest) {
> > +                    unsigned long k = page - (ramblock->offset >> TARGET_PAGE_BITS);
> > +                    if (!test_and_set_bit(k, dest)) {
> > +                        num_dirty++;
> > +                    }
> > +                } else {
> > +                    num_dirty++;
> > +                }
> > +            }
> > +
> > +            page++;

Sorry I could have mentioned this in the previous version: IMHO it'll still
be nice to keep the one atomic operations for !dest/!bmap case over "num".
There's no reason we need to introduce even any slightest regression in
those paths.

Thanks,

> >          }
> >  
> >          mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
> > @@ -1131,18 +1150,18 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> >          memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
> >      }
> >  
> > -    if (dirty) {
> > +    if (num_dirty) {
> >          physical_memory_dirty_bits_cleared(start, length);
> >      }
> >  
> > -    return dirty;
> > +    return num_dirty;
> >  }
> >  
> >  static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
> >  {
> > -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
> > -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
> > -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
> > +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, NULL);
> > +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, NULL);
> > +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, NULL);
> >  }
> >  
> >  DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-16 16:26   ` Peter Xu
@ 2025-12-17  6:46     ` Chuang Xu
  2025-12-17 13:21       ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Chuang Xu @ 2025-12-17  6:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, mst, sgarzare, richard.henderson,
	pbonzini, david, philmd

On 17/12/2025 00:26, Peter Xu wrote:
> On Tue, Dec 16, 2025 at 10:25:46AM -0300, Fabiano Rosas wrote:
>> "Chuang Xu" <xuchuangxclwt@bytedance.com> writes:
>>
>>> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
>>>
>>> In our long-term experience in Bytedance, we've found that under
>>> the same load, live migration of larger VMs with more devices is
>>> often more difficult to converge (requiring a larger downtime limit).
>>>
>>> Through some testing and calculations, we conclude that bitmap sync time
>>> affects the calculation of live migration bandwidth.
> Side note:
>
> I forgot to mention when replying to the old versions, but we introduced
> avail-switchover-bandwidth to partially remedy this problem when we hit it
> before - which may or may not be exactly the same reason here on unaligned
> syncs as we didn't further investigate (we have VFIO-PCI devices when
> testing), but the whole logic should be similar that bw was calculated too
> small.
In bytedance, we also migrate vms with vfio devices, which also suffer from
the issue of long vfio bitmap sync time for large vm.
>
> So even if with this patch optimizing sync, bw is always not as accurate.
> I wonder if we can still fix it somehow, e.g. I wonder if 100ms is too
> short a period to take samples, or at least we should be able to remember
> more samples so the reported bw (even if we keep sampling per 100ms) will
> cover longer period.
>
> Feel free to share your thoughts if you have any.
>
FYI:
Initially, when I encountered the problem of large vm migration hard to 
converge,
I tried subtracting the bitmap sync time from the bandwidth calculation,
which alleviated the problem somewhat. However, through formula calculation,
I found that this did not completely solve the problem. Therefore, I 
decided to
conduct specific analysis for specific scenario to minimize the bitmap 
sync time
as much as possible.
>>> When the addresses processed are not aligned, a large number of
>>> clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate
>>> 2048 clear_dirty ioctls from two different memory_listener),
>>> which increases the time required for bitmap_sync and makes it
>>> more difficult for dirty pages to converge.
>>>
>>> For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and
>>> 16 vhost-user-blk(4 queue per blk), the sync time is as high as *73ms*
>>> (tested with 10GBps dirty rate, the sync time increases as the dirty
>>> page rate increases), Here are each part of the sync time:
>>>
>>> - sync from kvm to ram_list: 2.5ms
>>> - vhost_log_sync:3ms
>>> - sync aligned memory from ram_list to RAMBlock: 5ms
>>> - sync misaligned memory from ram_list to RAMBlock: 61ms
>>>
>>> Attempt to merge those fragmented clear_dirty ioctls, then syncing
>>> misaligned memory from ram_list to RAMBlock takes only about 1ms,
>>> and the total sync time is only *12ms*.
>>>
>>> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
>>> ---
>>>   accel/tcg/cputlb.c       |  5 ++--
>>>   include/system/physmem.h |  7 +++---
>>>   migration/ram.c          | 17 ++++----------
>>>   system/memory.c          |  2 +-
>>>   system/physmem.c         | 49 ++++++++++++++++++++++++++++------------
>>>   5 files changed, 47 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
>>> index fd1606c856..c8827c8b0d 100644
>>> --- a/accel/tcg/cputlb.c
>>> +++ b/accel/tcg/cputlb.c
>>> @@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
>>>   void tlb_protect_code(ram_addr_t ram_addr)
>>>   {
>>>       physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
>>> -                                             TARGET_PAGE_SIZE,
>>> -                                             DIRTY_MEMORY_CODE);
>>> +                                         TARGET_PAGE_SIZE,
>>> +                                         DIRTY_MEMORY_CODE,
>>> +                                         NULL);
>>>   }
>>>   
>>>   /* update the TLB so that writes in physical page 'phys_addr' are no longer
>>> diff --git a/include/system/physmem.h b/include/system/physmem.h
>>> index 879f6eae38..8eeace9d1f 100644
>>> --- a/include/system/physmem.h
>>> +++ b/include/system/physmem.h
>>> @@ -39,9 +39,10 @@ uint64_t physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
>>>   
>>>   void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
>>>   
>>> -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>>> -                                          ram_addr_t length,
>>> -                                          unsigned client);
>>> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
>>> +                                              ram_addr_t length,
>>> +                                              unsigned client,
>>> +                                              unsigned long *dest);
>>>   
>>>   DirtyBitmapSnapshot *
>>>   physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index 29f016cb25..a03c9874a2 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -942,7 +942,6 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>>>                                                     ram_addr_t start,
>>>                                                     ram_addr_t length)
>>>   {
>>> -    ram_addr_t addr;
>>>       unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
>>>       uint64_t num_dirty = 0;
>>>       unsigned long *dest = rb->bmap;
>>> @@ -996,17 +995,11 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>>>       } else {
>>>           ram_addr_t offset = rb->offset;
>>>   
>>> -        for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
>>> -            if (physical_memory_test_and_clear_dirty(
>>> -                        start + addr + offset,
>>> -                        TARGET_PAGE_SIZE,
>>> -                        DIRTY_MEMORY_MIGRATION)) {
>>> -                long k = (start + addr) >> TARGET_PAGE_BITS;
>>> -                if (!test_and_set_bit(k, dest)) {
>>> -                    num_dirty++;
>>> -                }
>>> -            }
>>> -        }
>>> +        num_dirty = physical_memory_test_and_clear_dirty(
>>> +                        start + offset,
>>> +                        length,
>>> +                        DIRTY_MEMORY_MIGRATION,
>>> +                        dest);
>>>       }
>>>   
>>>       return num_dirty;
>>> diff --git a/system/memory.c b/system/memory.c
>>> index 8b84661ae3..666364392d 100644
>>> --- a/system/memory.c
>>> +++ b/system/memory.c
>>> @@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
>>>   {
>>>       assert(mr->ram_block);
>>>       physical_memory_test_and_clear_dirty(
>>> -        memory_region_get_ram_addr(mr) + addr, size, client);
>>> +        memory_region_get_ram_addr(mr) + addr, size, client, NULL);
>>>   }
>>>   
>>>   int memory_region_get_fd(MemoryRegion *mr)
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index c9869e4049..f8b660dafe 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -1089,19 +1089,31 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
>>>       }
>>>   }
>>>   
>>> -/* Note: start and end must be within the same ram block.  */
>>> -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>>> +/*
>>> + * Note: start and end must be within the same ram block.
>>> + *
>>> + * @dest usage:
>> I'm not sure if it's just me, but I find this "dest" term quite
>> confusing. "bmap" might be more straight-forward.
>>
>>> + * - When @dest is provided, set bits for newly discovered dirty pages
>>> + *   only if the bit wasn't already set in dest, and count those pages
>>> + *   in num_dirty.
>> Am I misreading the code? I don't see this "set ... only if the bit
>> wasn't already set" part. What I see is: "set bits, but only count those
>> pages if the bit wasn't already set".
> Agrees on both points.. one more thing to mention below.
This will be fixed in next version.
>>> + * - When @dest is NULL, count all dirty pages in the range
>>> + *
>>> + * @return:
>>> + * - Number of dirty guest pages found within [start, start + length).
>>> + */
>>> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
>>>                                                 ram_addr_t length,
>>> -                                              unsigned client)
>>> +                                              unsigned client,
>>> +                                              unsigned long *dest)
>>>   {
>>>       DirtyMemoryBlocks *blocks;
>>>       unsigned long end, page, start_page;
>>> -    bool dirty = false;
>>> +    uint64_t num_dirty = 0;
>>>       RAMBlock *ramblock;
>>>       uint64_t mr_offset, mr_size;
>>>   
>>>       if (length == 0) {
>>> -        return false;
>>> +        return 0;
>>>       }
>>>   
>>>       end = TARGET_PAGE_ALIGN(start + length) >> TARGET_PAGE_BITS;
>>> @@ -1118,12 +1130,19 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>>>           while (page < end) {
>>>               unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
>>>               unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
>>> -            unsigned long num = MIN(end - page,
>>> -                                    DIRTY_MEMORY_BLOCK_SIZE - offset);
>>>   
>>> -            dirty |= bitmap_test_and_clear_atomic(blocks->blocks[idx],
>>> -                                                  offset, num);
>>> -            page += num;
>>> +            if (bitmap_test_and_clear_atomic(blocks->blocks[idx], offset, 1)) {
>>> +                if (dest) {
>>> +                    unsigned long k = page - (ramblock->offset >> TARGET_PAGE_BITS);
>>> +                    if (!test_and_set_bit(k, dest)) {
>>> +                        num_dirty++;
>>> +                    }
>>> +                } else {
>>> +                    num_dirty++;
>>> +                }
>>> +            }
>>> +
>>> +            page++;
> Sorry I could have mentioned this in the previous version: IMHO it'll still
> be nice to keep the one atomic operations for !dest/!bmap case over "num".
> There's no reason we need to introduce even any slightest regression in
> those paths.
>
> Thanks,

bitmap_test_and_clear_atomic returns bool, not the number of bits cleared.
So for !bmap case we can only return whether there is dirty, not the number
of dirty, and this might be a bit confusing.

>>>           }
>>>   
>>>           mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
>>> @@ -1131,18 +1150,18 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>>>           memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
>>>       }
>>>   
>>> -    if (dirty) {
>>> +    if (num_dirty) {
>>>           physical_memory_dirty_bits_cleared(start, length);
>>>       }
>>>   
>>> -    return dirty;
>>> +    return num_dirty;
>>>   }
>>>   
>>>   static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
>>>   {
>>> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
>>> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
>>> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
>>> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, NULL);
>>> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, NULL);
>>> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, NULL);
>>>   }
>>>   
>>>   DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-17  6:46     ` Chuang Xu
@ 2025-12-17 13:21       ` Peter Xu
  2025-12-17 13:43         ` Chuang Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2025-12-17 13:21 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, Fabiano Rosas, mst, sgarzare, richard.henderson,
	pbonzini, david, philmd

On Wed, Dec 17, 2025 at 02:46:58PM +0800, Chuang Xu wrote:
> On 17/12/2025 00:26, Peter Xu wrote:
> > On Tue, Dec 16, 2025 at 10:25:46AM -0300, Fabiano Rosas wrote:
> >> "Chuang Xu" <xuchuangxclwt@bytedance.com> writes:
> >>
> >>> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> >>>
> >>> In our long-term experience in Bytedance, we've found that under
> >>> the same load, live migration of larger VMs with more devices is
> >>> often more difficult to converge (requiring a larger downtime limit).
> >>>
> >>> Through some testing and calculations, we conclude that bitmap sync time
> >>> affects the calculation of live migration bandwidth.
> > Side note:
> >
> > I forgot to mention when replying to the old versions, but we introduced
> > avail-switchover-bandwidth to partially remedy this problem when we hit it
> > before - which may or may not be exactly the same reason here on unaligned
> > syncs as we didn't further investigate (we have VFIO-PCI devices when
> > testing), but the whole logic should be similar that bw was calculated too
> > small.
> In bytedance, we also migrate vms with vfio devices, which also suffer from
> the issue of long vfio bitmap sync time for large vm.
> >
> > So even if with this patch optimizing sync, bw is always not as accurate.
> > I wonder if we can still fix it somehow, e.g. I wonder if 100ms is too
> > short a period to take samples, or at least we should be able to remember
> > more samples so the reported bw (even if we keep sampling per 100ms) will
> > cover longer period.
> >
> > Feel free to share your thoughts if you have any.
> >
> FYI:
> Initially, when I encountered the problem of large vm migration hard to 
> converge,
> I tried subtracting the bitmap sync time from the bandwidth calculation,
> which alleviated the problem somewhat. However, through formula calculation,
> I found that this did not completely solve the problem. Therefore, I 

If you ruled out sync time, why the bw is still not accurate?  Have you
investigated that?

Maybe there's something else happening besides the sync period you
excluded.

> decided to
> conduct specific analysis for specific scenario to minimize the bitmap 
> sync time
> as much as possible.
> >>> When the addresses processed are not aligned, a large number of
> >>> clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate
> >>> 2048 clear_dirty ioctls from two different memory_listener),
> >>> which increases the time required for bitmap_sync and makes it
> >>> more difficult for dirty pages to converge.
> >>>
> >>> For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and
> >>> 16 vhost-user-blk(4 queue per blk), the sync time is as high as *73ms*
> >>> (tested with 10GBps dirty rate, the sync time increases as the dirty
> >>> page rate increases), Here are each part of the sync time:
> >>>
> >>> - sync from kvm to ram_list: 2.5ms
> >>> - vhost_log_sync:3ms
> >>> - sync aligned memory from ram_list to RAMBlock: 5ms
> >>> - sync misaligned memory from ram_list to RAMBlock: 61ms
> >>>
> >>> Attempt to merge those fragmented clear_dirty ioctls, then syncing
> >>> misaligned memory from ram_list to RAMBlock takes only about 1ms,
> >>> and the total sync time is only *12ms*.
> >>>
> >>> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> >>> ---
> >>>   accel/tcg/cputlb.c       |  5 ++--
> >>>   include/system/physmem.h |  7 +++---
> >>>   migration/ram.c          | 17 ++++----------
> >>>   system/memory.c          |  2 +-
> >>>   system/physmem.c         | 49 ++++++++++++++++++++++++++++------------
> >>>   5 files changed, 47 insertions(+), 33 deletions(-)
> >>>
> >>> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> >>> index fd1606c856..c8827c8b0d 100644
> >>> --- a/accel/tcg/cputlb.c
> >>> +++ b/accel/tcg/cputlb.c
> >>> @@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
> >>>   void tlb_protect_code(ram_addr_t ram_addr)
> >>>   {
> >>>       physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
> >>> -                                             TARGET_PAGE_SIZE,
> >>> -                                             DIRTY_MEMORY_CODE);
> >>> +                                         TARGET_PAGE_SIZE,
> >>> +                                         DIRTY_MEMORY_CODE,
> >>> +                                         NULL);
> >>>   }
> >>>   
> >>>   /* update the TLB so that writes in physical page 'phys_addr' are no longer
> >>> diff --git a/include/system/physmem.h b/include/system/physmem.h
> >>> index 879f6eae38..8eeace9d1f 100644
> >>> --- a/include/system/physmem.h
> >>> +++ b/include/system/physmem.h
> >>> @@ -39,9 +39,10 @@ uint64_t physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
> >>>   
> >>>   void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
> >>>   
> >>> -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> >>> -                                          ram_addr_t length,
> >>> -                                          unsigned client);
> >>> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
> >>> +                                              ram_addr_t length,
> >>> +                                              unsigned client,
> >>> +                                              unsigned long *dest);
> >>>   
> >>>   DirtyBitmapSnapshot *
> >>>   physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
> >>> diff --git a/migration/ram.c b/migration/ram.c
> >>> index 29f016cb25..a03c9874a2 100644
> >>> --- a/migration/ram.c
> >>> +++ b/migration/ram.c
> >>> @@ -942,7 +942,6 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
> >>>                                                     ram_addr_t start,
> >>>                                                     ram_addr_t length)
> >>>   {
> >>> -    ram_addr_t addr;
> >>>       unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
> >>>       uint64_t num_dirty = 0;
> >>>       unsigned long *dest = rb->bmap;
> >>> @@ -996,17 +995,11 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
> >>>       } else {
> >>>           ram_addr_t offset = rb->offset;
> >>>   
> >>> -        for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
> >>> -            if (physical_memory_test_and_clear_dirty(
> >>> -                        start + addr + offset,
> >>> -                        TARGET_PAGE_SIZE,
> >>> -                        DIRTY_MEMORY_MIGRATION)) {
> >>> -                long k = (start + addr) >> TARGET_PAGE_BITS;
> >>> -                if (!test_and_set_bit(k, dest)) {
> >>> -                    num_dirty++;
> >>> -                }
> >>> -            }
> >>> -        }
> >>> +        num_dirty = physical_memory_test_and_clear_dirty(
> >>> +                        start + offset,
> >>> +                        length,
> >>> +                        DIRTY_MEMORY_MIGRATION,
> >>> +                        dest);
> >>>       }
> >>>   
> >>>       return num_dirty;
> >>> diff --git a/system/memory.c b/system/memory.c
> >>> index 8b84661ae3..666364392d 100644
> >>> --- a/system/memory.c
> >>> +++ b/system/memory.c
> >>> @@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
> >>>   {
> >>>       assert(mr->ram_block);
> >>>       physical_memory_test_and_clear_dirty(
> >>> -        memory_region_get_ram_addr(mr) + addr, size, client);
> >>> +        memory_region_get_ram_addr(mr) + addr, size, client, NULL);
> >>>   }
> >>>   
> >>>   int memory_region_get_fd(MemoryRegion *mr)
> >>> diff --git a/system/physmem.c b/system/physmem.c
> >>> index c9869e4049..f8b660dafe 100644
> >>> --- a/system/physmem.c
> >>> +++ b/system/physmem.c
> >>> @@ -1089,19 +1089,31 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
> >>>       }
> >>>   }
> >>>   
> >>> -/* Note: start and end must be within the same ram block.  */
> >>> -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> >>> +/*
> >>> + * Note: start and end must be within the same ram block.
> >>> + *
> >>> + * @dest usage:
> >> I'm not sure if it's just me, but I find this "dest" term quite
> >> confusing. "bmap" might be more straight-forward.
> >>
> >>> + * - When @dest is provided, set bits for newly discovered dirty pages
> >>> + *   only if the bit wasn't already set in dest, and count those pages
> >>> + *   in num_dirty.
> >> Am I misreading the code? I don't see this "set ... only if the bit
> >> wasn't already set" part. What I see is: "set bits, but only count those
> >> pages if the bit wasn't already set".
> > Agrees on both points.. one more thing to mention below.
> This will be fixed in next version.
> >>> + * - When @dest is NULL, count all dirty pages in the range
> >>> + *
> >>> + * @return:
> >>> + * - Number of dirty guest pages found within [start, start + length).
> >>> + */
> >>> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
> >>>                                                 ram_addr_t length,
> >>> -                                              unsigned client)
> >>> +                                              unsigned client,
> >>> +                                              unsigned long *dest)
> >>>   {
> >>>       DirtyMemoryBlocks *blocks;
> >>>       unsigned long end, page, start_page;
> >>> -    bool dirty = false;
> >>> +    uint64_t num_dirty = 0;
> >>>       RAMBlock *ramblock;
> >>>       uint64_t mr_offset, mr_size;
> >>>   
> >>>       if (length == 0) {
> >>> -        return false;
> >>> +        return 0;
> >>>       }
> >>>   
> >>>       end = TARGET_PAGE_ALIGN(start + length) >> TARGET_PAGE_BITS;
> >>> @@ -1118,12 +1130,19 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> >>>           while (page < end) {
> >>>               unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
> >>>               unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
> >>> -            unsigned long num = MIN(end - page,
> >>> -                                    DIRTY_MEMORY_BLOCK_SIZE - offset);
> >>>   
> >>> -            dirty |= bitmap_test_and_clear_atomic(blocks->blocks[idx],
> >>> -                                                  offset, num);
> >>> -            page += num;
> >>> +            if (bitmap_test_and_clear_atomic(blocks->blocks[idx], offset, 1)) {
> >>> +                if (dest) {
> >>> +                    unsigned long k = page - (ramblock->offset >> TARGET_PAGE_BITS);
> >>> +                    if (!test_and_set_bit(k, dest)) {
> >>> +                        num_dirty++;
> >>> +                    }
> >>> +                } else {
> >>> +                    num_dirty++;
> >>> +                }
> >>> +            }
> >>> +
> >>> +            page++;
> > Sorry I could have mentioned this in the previous version: IMHO it'll still
> > be nice to keep the one atomic operations for !dest/!bmap case over "num".
> > There's no reason we need to introduce even any slightest regression in
> > those paths.
> >
> > Thanks,
> 
> bitmap_test_and_clear_atomic returns bool, not the number of bits cleared.
> So for !bmap case we can only return whether there is dirty, not the number
> of dirty, and this might be a bit confusing.

Ah, right..

Looks like we only have two real users of this API that clears more than
one target page (tcx_reset, qemu_ram_resize), I assume they're not perf
critical as of now.  When it comes, it should be easy to optimize.

Unless others have concerns, IMHO we can go with the current one until
later.  Feel free to ignore this comment.

Thanks,

> 
> >>>           }
> >>>   
> >>>           mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
> >>> @@ -1131,18 +1150,18 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> >>>           memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
> >>>       }
> >>>   
> >>> -    if (dirty) {
> >>> +    if (num_dirty) {
> >>>           physical_memory_dirty_bits_cleared(start, length);
> >>>       }
> >>>   
> >>> -    return dirty;
> >>> +    return num_dirty;
> >>>   }
> >>>   
> >>>   static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
> >>>   {
> >>> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
> >>> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
> >>> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
> >>> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, NULL);
> >>> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, NULL);
> >>> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, NULL);
> >>>   }
> >>>   
> >>>   DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-17 13:21       ` Peter Xu
@ 2025-12-17 13:43         ` Chuang Xu
  2025-12-17 14:59           ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Chuang Xu @ 2025-12-17 13:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, mst, sgarzare, richard.henderson,
	pbonzini, david, philmd

On 17/12/2025 21:21, Peter Xu wrote:
> On Wed, Dec 17, 2025 at 02:46:58PM +0800, Chuang Xu wrote:
>> On 17/12/2025 00:26, Peter Xu wrote:
>>> On Tue, Dec 16, 2025 at 10:25:46AM -0300, Fabiano Rosas wrote:
>>>> "Chuang Xu" <xuchuangxclwt@bytedance.com> writes:
>>>>
>>>>> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
>>>>>
>>>>> In our long-term experience in Bytedance, we've found that under
>>>>> the same load, live migration of larger VMs with more devices is
>>>>> often more difficult to converge (requiring a larger downtime limit).
>>>>>
>>>>> Through some testing and calculations, we conclude that bitmap sync time
>>>>> affects the calculation of live migration bandwidth.
>>> Side note:
>>>
>>> I forgot to mention when replying to the old versions, but we introduced
>>> avail-switchover-bandwidth to partially remedy this problem when we hit it
>>> before - which may or may not be exactly the same reason here on unaligned
>>> syncs as we didn't further investigate (we have VFIO-PCI devices when
>>> testing), but the whole logic should be similar that bw was calculated too
>>> small.
>> In bytedance, we also migrate vms with vfio devices, which also suffer from
>> the issue of long vfio bitmap sync time for large vm.
>>> So even if with this patch optimizing sync, bw is always not as accurate.
>>> I wonder if we can still fix it somehow, e.g. I wonder if 100ms is too
>>> short a period to take samples, or at least we should be able to remember
>>> more samples so the reported bw (even if we keep sampling per 100ms) will
>>> cover longer period.
>>>
>>> Feel free to share your thoughts if you have any.
>>>
>> FYI:
>> Initially, when I encountered the problem of large vm migration hard to
>> converge,
>> I tried subtracting the bitmap sync time from the bandwidth calculation,
>> which alleviated the problem somewhat. However, through formula calculation,
>> I found that this did not completely solve the problem. Therefore, I
> If you ruled out sync time, why the bw is still not accurate?  Have you
> investigated that?
>
> Maybe there's something else happening besides the sync period you
> excluded.

Referring to the formula I wrote in the cover, after subtracting sync time,

we get the prerequisite that R=B. Substituting this condition into the

subsequent formula derivation(B * t = D * (x + t) and R * y > D * (x + t)),

we will eventually get y > D * x / (B - D).

This means that even if our bandwidth calculations are correct,

the sync time can still affect our judgment of downtime conditions.

>> decided to
>> conduct specific analysis for specific scenario to minimize the bitmap
>> sync time
>> as much as possible.
>>>>> When the addresses processed are not aligned, a large number of
>>>>> clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate
>>>>> 2048 clear_dirty ioctls from two different memory_listener),
>>>>> which increases the time required for bitmap_sync and makes it
>>>>> more difficult for dirty pages to converge.
>>>>>
>>>>> For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and
>>>>> 16 vhost-user-blk(4 queue per blk), the sync time is as high as *73ms*
>>>>> (tested with 10GBps dirty rate, the sync time increases as the dirty
>>>>> page rate increases), Here are each part of the sync time:
>>>>>
>>>>> - sync from kvm to ram_list: 2.5ms
>>>>> - vhost_log_sync:3ms
>>>>> - sync aligned memory from ram_list to RAMBlock: 5ms
>>>>> - sync misaligned memory from ram_list to RAMBlock: 61ms
>>>>>
>>>>> Attempt to merge those fragmented clear_dirty ioctls, then syncing
>>>>> misaligned memory from ram_list to RAMBlock takes only about 1ms,
>>>>> and the total sync time is only *12ms*.
>>>>>
>>>>> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
>>>>> ---
>>>>>    accel/tcg/cputlb.c       |  5 ++--
>>>>>    include/system/physmem.h |  7 +++---
>>>>>    migration/ram.c          | 17 ++++----------
>>>>>    system/memory.c          |  2 +-
>>>>>    system/physmem.c         | 49 ++++++++++++++++++++++++++++------------
>>>>>    5 files changed, 47 insertions(+), 33 deletions(-)
>>>>>
>>>>> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
>>>>> index fd1606c856..c8827c8b0d 100644
>>>>> --- a/accel/tcg/cputlb.c
>>>>> +++ b/accel/tcg/cputlb.c
>>>>> @@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
>>>>>    void tlb_protect_code(ram_addr_t ram_addr)
>>>>>    {
>>>>>        physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
>>>>> -                                             TARGET_PAGE_SIZE,
>>>>> -                                             DIRTY_MEMORY_CODE);
>>>>> +                                         TARGET_PAGE_SIZE,
>>>>> +                                         DIRTY_MEMORY_CODE,
>>>>> +                                         NULL);
>>>>>    }
>>>>>    
>>>>>    /* update the TLB so that writes in physical page 'phys_addr' are no longer
>>>>> diff --git a/include/system/physmem.h b/include/system/physmem.h
>>>>> index 879f6eae38..8eeace9d1f 100644
>>>>> --- a/include/system/physmem.h
>>>>> +++ b/include/system/physmem.h
>>>>> @@ -39,9 +39,10 @@ uint64_t physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
>>>>>    
>>>>>    void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
>>>>>    
>>>>> -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>>>>> -                                          ram_addr_t length,
>>>>> -                                          unsigned client);
>>>>> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
>>>>> +                                              ram_addr_t length,
>>>>> +                                              unsigned client,
>>>>> +                                              unsigned long *dest);
>>>>>    
>>>>>    DirtyBitmapSnapshot *
>>>>>    physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
>>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>>> index 29f016cb25..a03c9874a2 100644
>>>>> --- a/migration/ram.c
>>>>> +++ b/migration/ram.c
>>>>> @@ -942,7 +942,6 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>>>>>                                                      ram_addr_t start,
>>>>>                                                      ram_addr_t length)
>>>>>    {
>>>>> -    ram_addr_t addr;
>>>>>        unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
>>>>>        uint64_t num_dirty = 0;
>>>>>        unsigned long *dest = rb->bmap;
>>>>> @@ -996,17 +995,11 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>>>>>        } else {
>>>>>            ram_addr_t offset = rb->offset;
>>>>>    
>>>>> -        for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
>>>>> -            if (physical_memory_test_and_clear_dirty(
>>>>> -                        start + addr + offset,
>>>>> -                        TARGET_PAGE_SIZE,
>>>>> -                        DIRTY_MEMORY_MIGRATION)) {
>>>>> -                long k = (start + addr) >> TARGET_PAGE_BITS;
>>>>> -                if (!test_and_set_bit(k, dest)) {
>>>>> -                    num_dirty++;
>>>>> -                }
>>>>> -            }
>>>>> -        }
>>>>> +        num_dirty = physical_memory_test_and_clear_dirty(
>>>>> +                        start + offset,
>>>>> +                        length,
>>>>> +                        DIRTY_MEMORY_MIGRATION,
>>>>> +                        dest);
>>>>>        }
>>>>>    
>>>>>        return num_dirty;
>>>>> diff --git a/system/memory.c b/system/memory.c
>>>>> index 8b84661ae3..666364392d 100644
>>>>> --- a/system/memory.c
>>>>> +++ b/system/memory.c
>>>>> @@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
>>>>>    {
>>>>>        assert(mr->ram_block);
>>>>>        physical_memory_test_and_clear_dirty(
>>>>> -        memory_region_get_ram_addr(mr) + addr, size, client);
>>>>> +        memory_region_get_ram_addr(mr) + addr, size, client, NULL);
>>>>>    }
>>>>>    
>>>>>    int memory_region_get_fd(MemoryRegion *mr)
>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>> index c9869e4049..f8b660dafe 100644
>>>>> --- a/system/physmem.c
>>>>> +++ b/system/physmem.c
>>>>> @@ -1089,19 +1089,31 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
>>>>>        }
>>>>>    }
>>>>>    
>>>>> -/* Note: start and end must be within the same ram block.  */
>>>>> -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>>>>> +/*
>>>>> + * Note: start and end must be within the same ram block.
>>>>> + *
>>>>> + * @dest usage:
>>>> I'm not sure if it's just me, but I find this "dest" term quite
>>>> confusing. "bmap" might be more straight-forward.
>>>>
>>>>> + * - When @dest is provided, set bits for newly discovered dirty pages
>>>>> + *   only if the bit wasn't already set in dest, and count those pages
>>>>> + *   in num_dirty.
>>>> Am I misreading the code? I don't see this "set ... only if the bit
>>>> wasn't already set" part. What I see is: "set bits, but only count those
>>>> pages if the bit wasn't already set".
>>> Agrees on both points.. one more thing to mention below.
>> This will be fixed in next version.
>>>>> + * - When @dest is NULL, count all dirty pages in the range
>>>>> + *
>>>>> + * @return:
>>>>> + * - Number of dirty guest pages found within [start, start + length).
>>>>> + */
>>>>> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
>>>>>                                                  ram_addr_t length,
>>>>> -                                              unsigned client)
>>>>> +                                              unsigned client,
>>>>> +                                              unsigned long *dest)
>>>>>    {
>>>>>        DirtyMemoryBlocks *blocks;
>>>>>        unsigned long end, page, start_page;
>>>>> -    bool dirty = false;
>>>>> +    uint64_t num_dirty = 0;
>>>>>        RAMBlock *ramblock;
>>>>>        uint64_t mr_offset, mr_size;
>>>>>    
>>>>>        if (length == 0) {
>>>>> -        return false;
>>>>> +        return 0;
>>>>>        }
>>>>>    
>>>>>        end = TARGET_PAGE_ALIGN(start + length) >> TARGET_PAGE_BITS;
>>>>> @@ -1118,12 +1130,19 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>>>>>            while (page < end) {
>>>>>                unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
>>>>>                unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
>>>>> -            unsigned long num = MIN(end - page,
>>>>> -                                    DIRTY_MEMORY_BLOCK_SIZE - offset);
>>>>>    
>>>>> -            dirty |= bitmap_test_and_clear_atomic(blocks->blocks[idx],
>>>>> -                                                  offset, num);
>>>>> -            page += num;
>>>>> +            if (bitmap_test_and_clear_atomic(blocks->blocks[idx], offset, 1)) {
>>>>> +                if (dest) {
>>>>> +                    unsigned long k = page - (ramblock->offset >> TARGET_PAGE_BITS);
>>>>> +                    if (!test_and_set_bit(k, dest)) {
>>>>> +                        num_dirty++;
>>>>> +                    }
>>>>> +                } else {
>>>>> +                    num_dirty++;
>>>>> +                }
>>>>> +            }
>>>>> +
>>>>> +            page++;
>>> Sorry I could have mentioned this in the previous version: IMHO it'll still
>>> be nice to keep the one atomic operations for !dest/!bmap case over "num".
>>> There's no reason we need to introduce even any slightest regression in
>>> those paths.
>>>
>>> Thanks,
>> bitmap_test_and_clear_atomic returns bool, not the number of bits cleared.
>> So for !bmap case we can only return whether there is dirty, not the number
>> of dirty, and this might be a bit confusing.
> Ah, right..
>
> Looks like we only have two real users of this API that clears more than
> one target page (tcx_reset, qemu_ram_resize), I assume they're not perf
> critical as of now.  When it comes, it should be easy to optimize.
>
> Unless others have concerns, IMHO we can go with the current one until
> later.  Feel free to ignore this comment.
>
> Thanks,
>
>>>>>            }
>>>>>    
>>>>>            mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
>>>>> @@ -1131,18 +1150,18 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>>>>>            memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
>>>>>        }
>>>>>    
>>>>> -    if (dirty) {
>>>>> +    if (num_dirty) {
>>>>>            physical_memory_dirty_bits_cleared(start, length);
>>>>>        }
>>>>>    
>>>>> -    return dirty;
>>>>> +    return num_dirty;
>>>>>    }
>>>>>    
>>>>>    static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
>>>>>    {
>>>>> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
>>>>> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
>>>>> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
>>>>> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, NULL);
>>>>> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, NULL);
>>>>> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, NULL);
>>>>>    }
>>>>>    
>>>>>    DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-17 13:43         ` Chuang Xu
@ 2025-12-17 14:59           ` Peter Xu
  2025-12-18  9:20             ` Chuang Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2025-12-17 14:59 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, Fabiano Rosas, mst, sgarzare, richard.henderson,
	pbonzini, david, philmd

On Wed, Dec 17, 2025 at 09:43:24PM +0800, Chuang Xu wrote:
> On 17/12/2025 21:21, Peter Xu wrote:
> > On Wed, Dec 17, 2025 at 02:46:58PM +0800, Chuang Xu wrote:
> >> On 17/12/2025 00:26, Peter Xu wrote:
> >>> On Tue, Dec 16, 2025 at 10:25:46AM -0300, Fabiano Rosas wrote:
> >>>> "Chuang Xu" <xuchuangxclwt@bytedance.com> writes:
> >>>>
> >>>>> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> >>>>>
> >>>>> In our long-term experience in Bytedance, we've found that under
> >>>>> the same load, live migration of larger VMs with more devices is
> >>>>> often more difficult to converge (requiring a larger downtime limit).
> >>>>>
> >>>>> Through some testing and calculations, we conclude that bitmap sync time
> >>>>> affects the calculation of live migration bandwidth.
> >>> Side note:
> >>>
> >>> I forgot to mention when replying to the old versions, but we introduced
> >>> avail-switchover-bandwidth to partially remedy this problem when we hit it
> >>> before - which may or may not be exactly the same reason here on unaligned
> >>> syncs as we didn't further investigate (we have VFIO-PCI devices when
> >>> testing), but the whole logic should be similar that bw was calculated too
> >>> small.
> >> In bytedance, we also migrate vms with vfio devices, which also suffer from
> >> the issue of long vfio bitmap sync time for large vm.
> >>> So even if with this patch optimizing sync, bw is always not as accurate.
> >>> I wonder if we can still fix it somehow, e.g. I wonder if 100ms is too
> >>> short a period to take samples, or at least we should be able to remember
> >>> more samples so the reported bw (even if we keep sampling per 100ms) will
> >>> cover longer period.
> >>>
> >>> Feel free to share your thoughts if you have any.
> >>>
> >> FYI:
> >> Initially, when I encountered the problem of large vm migration hard to
> >> converge,
> >> I tried subtracting the bitmap sync time from the bandwidth calculation,
> >> which alleviated the problem somewhat. However, through formula calculation,
> >> I found that this did not completely solve the problem. Therefore, I
> > If you ruled out sync time, why the bw is still not accurate?  Have you
> > investigated that?
> >
> > Maybe there's something else happening besides the sync period you
> > excluded.
> 
> Referring to the formula I wrote in the cover, after subtracting sync time,
> 
> we get the prerequisite that R=B. Substituting this condition into the
> 
> subsequent formula derivation(B * t = D * (x + t) and R * y > D * (x + t)),
> 
> we will eventually get y > D * x / (B - D).
> 
> This means that even if our bandwidth calculations are correct,
> 
> the sync time can still affect our judgment of downtime conditions.

Right, it will, because any time used for sync has the vCPUs running, so
that will contributes to the total dirtied pages, hence partly increase D,
as you pointed out.

But my point is, if you _really_ have R=B all right, you should e.g. on a
10Gbps NIC seeing R~=10Gbps.  If R is not wire speed, it means the R is not
really correctly measured..

I think it's likely impossible to measure the correct R so that it'll equal
to B, however IMHO we can still think about something that makes the R
getting much closer to B, then when normally y is a constant (default
300ms, for example) it'll start to converge where it used to not be able to.

E.g. QEMU can currently report R as low as 10Mbps even if on 10Gbps, IMHO
it'll be much better and start solving a lot of such problems if it can
start to report at least a few Gbps based on all kinds of methods
(e.g. excluding sync, as you experimented), then even if it's not reporting
10Gbps it'll help.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-16  8:00 [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
  2025-12-16 13:25 ` Fabiano Rosas
@ 2025-12-17 17:01 ` Peter Xu
  1 sibling, 0 replies; 10+ messages in thread
From: Peter Xu @ 2025-12-17 17:01 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, mst, sgarzare, richard.henderson, pbonzini, david,
	philmd, farosas

On Tue, Dec 16, 2025 at 04:00:01PM +0800, Chuang Xu wrote:
> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>

Another trivial thing to mention: Chuang, while preparing for repost, you
can either remove this line (you already have From: in your mail header, so
this isn't needed), or change it to your real First+Last name.  Otherwise
git log will collect this instead of your real name.  Just FYI.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-17 14:59           ` Peter Xu
@ 2025-12-18  9:20             ` Chuang Xu
  2025-12-18 15:32               ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Chuang Xu @ 2025-12-18  9:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, mst, sgarzare, richard.henderson,
	pbonzini, david, philmd

On 17/12/2025 22:59, Peter Xu wrote:
> Right, it will, because any time used for sync has the vCPUs running, so
> that will contributes to the total dirtied pages, hence partly increase D,
> as you pointed out.
>
> But my point is, if you _really_ have R=B all right, you should e.g. on a
> 10Gbps NIC seeing R~=10Gbps.  If R is not wire speed, it means the R is not
> really correctly measured..

In my experience, the bandwidth of live migration usually doesn't reach
the nic's bandwidth limit (my test environment's nic bandwidth limit is 200Gbps).
This could be due to various reasons: for example, the live migration main thread's
ability to search for dirty pages may have reached a bottleneck;
the nic's interrupt binding range might limit the softirq's processing capacity;
there might be too few multifd threads; or there might be overhead in synchronizing
between the live migration main thread and the multifd thread.

>
> I think it's likely impossible to measure the correct R so that it'll equal
> to B, however IMHO we can still think about something that makes the R
> getting much closer to B, then when normally y is a constant (default
> 300ms, for example) it'll start to converge where it used to not be able to.

Yes, there are always various factors that can cause measurement errors.
We can only try to make the calculated value as close as possible to the actual value.

> E.g. QEMU can currently report R as low as 10Mbps even if on 10Gbps, IMHO
> it'll be much better and start solving a lot of such problems if it can
> start to report at least a few Gbps based on all kinds of methods
> (e.g. excluding sync, as you experimented), then even if it's not reporting
> 10Gbps it'll help.
>
After I applied these optimizations, typically the bandwidth statistics
from QEMU and the real-time nic bandwidth monitored by atop are close.

Those extremely low bandwidth(but consistent with atop monitoring) is usually
caused by zero pages or dirty pages with extremely high compression rates.
In these cases, QEMU uses very little nic bandwidth to transmit a large number
of dirty pages, but the bandwidth is only calculated based on the actual
amount of data transmitted.

If we want to use the actual number of dirty pages transmitted to calculate
bandwidth, we face another risk: if the dirty pages transmitted before the
downtime have a high compression ratio, and the dirty pages to be transmitted
after the downtime have a low compression ratio, then the downtime will far
exceed expectations.

This may have strayed a bit, but just providing some potentially useful information
from my perspective.

Thanks!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-18  9:20             ` Chuang Xu
@ 2025-12-18 15:32               ` Peter Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2025-12-18 15:32 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, Fabiano Rosas, mst, sgarzare, richard.henderson,
	pbonzini, david, philmd

On Thu, Dec 18, 2025 at 05:20:19PM +0800, Chuang Xu wrote:
> On 17/12/2025 22:59, Peter Xu wrote:
> > Right, it will, because any time used for sync has the vCPUs running, so
> > that will contributes to the total dirtied pages, hence partly increase D,
> > as you pointed out.
> >
> > But my point is, if you _really_ have R=B all right, you should e.g. on a
> > 10Gbps NIC seeing R~=10Gbps.  If R is not wire speed, it means the R is not
> > really correctly measured..
> 
> In my experience, the bandwidth of live migration usually doesn't reach
> the nic's bandwidth limit (my test environment's nic bandwidth limit is 200Gbps).
> This could be due to various reasons: for example, the live migration main thread's
> ability to search for dirty pages may have reached a bottleneck;
> the nic's interrupt binding range might limit the softirq's processing capacity;
> there might be too few multifd threads; or there might be overhead in synchronizing
> between the live migration main thread and the multifd thread.

Exactly, especially when you have 200Gbps NICs.

I hope I have some of those for testing too!  I don't, so I can't provide
really useful input..  My vague memory (I got some chance using a 100Gbps
NIC, if I recall correctly) is that main thread will bottleneck already
there, where I should have (maybe?) 8 multifd threads.

I just never knew whether we need to scale it out yet so far, normally
100G/200G setup only happens with direct attached, not a major use case for
cluster setup?  Or maybe I am outdated?

If that'll be a major use case at some point, and if main thread is the
bottleneck distributing things, then we need to scale it out.  I think it's
doable.

> 
> >
> > I think it's likely impossible to measure the correct R so that it'll equal
> > to B, however IMHO we can still think about something that makes the R
> > getting much closer to B, then when normally y is a constant (default
> > 300ms, for example) it'll start to converge where it used to not be able to.
> 
> Yes, there are always various factors that can cause measurement errors.
> We can only try to make the calculated value as close as possible to the actual value.
> 
> > E.g. QEMU can currently report R as low as 10Mbps even if on 10Gbps, IMHO
> > it'll be much better and start solving a lot of such problems if it can
> > start to report at least a few Gbps based on all kinds of methods
> > (e.g. excluding sync, as you experimented), then even if it's not reporting
> > 10Gbps it'll help.
> >
> After I applied these optimizations, typically the bandwidth statistics
> from QEMU and the real-time nic bandwidth monitored by atop are close.
> 
> Those extremely low bandwidth(but consistent with atop monitoring) is usually
> caused by zero pages or dirty pages with extremely high compression rates.
> In these cases, QEMU uses very little nic bandwidth to transmit a large number
> of dirty pages, but the bandwidth is only calculated based on the actual
> amount of data transmitted.

Yes.  That's a major issue in QEMU, zero page / compressed page / ... not
only affects how QEMU "measures" the mbps, but also affects how QEMU
decides when to converge: here I'm not talking about the bw difference
causing "bw * downtime_limit" [A] too small.  I'm talking about the other
side of equation where we used [A] to compare with "remain_dirty_pages *
psize" [B].  In reality, [B] isn't accurate either when zero page /
compressed page / ... is used..

Maybe.. the switchover decision shouldn't be MBps as unit, but "number of
pages".  It'll remove most of those effects at least, but that needs some
more considerations..

> 
> If we want to use the actual number of dirty pages transmitted to calculate
> bandwidth, we face another risk: if the dirty pages transmitted before the
> downtime have a high compression ratio, and the dirty pages to be transmitted
> after the downtime have a low compression ratio, then the downtime will far
> exceed expectations.

... like what you mentioned here will also be an issue if we switch to use
n_pages to do the math. :)

> 
> This may have strayed a bit, but just providing some potentially useful information
> from my perspective.

Not really; patch alone is good, I appreciate the discussions.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-12-18 15:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-16  8:00 [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
2025-12-16 13:25 ` Fabiano Rosas
2025-12-16 16:26   ` Peter Xu
2025-12-17  6:46     ` Chuang Xu
2025-12-17 13:21       ` Peter Xu
2025-12-17 13:43         ` Chuang Xu
2025-12-17 14:59           ` Peter Xu
2025-12-18  9:20             ` Chuang Xu
2025-12-18 15:32               ` Peter Xu
2025-12-17 17:01 ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.