[PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily
@ 2025-12-15 14:06 Chuang Xu
  2025-12-15 14:06 ` [PATCH v2 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
  2025-12-15 16:26 ` [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily Peter Xu
  0 siblings, 2 replies; 4+ messages in thread
From: Chuang Xu @ 2025-12-15 14:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, sgarzare, richard.henderson, pbonzini, peterx, david, philmd,
	farosas

In this version:

- drop duplicate vhost_log_sync optimization
- refactor physical_memory_test_and_clear_dirty
- provide more detailed bitmap sync time for each part in this cover

In our long-term experience in Bytedance, we've found that under the same load,
live migration of larger VMs with more devices is often more difficult to
converge (requiring a larger downtime limit).

We've observed that the live migration bandwidth of large, multi-device VMs is
severely distorted, a phenomenon likely similar to the problem described in this link
(https://wiki.qemu.org/ToDo/LiveMigration#Optimize_migration_bandwidth_calculation).

Through some testing and calculations, we conclude that bitmap sync time affects
the calculation of live migration bandwidth.

Now, let me use formulaic reasoning to illustrate the relationship between the downtime
limit required to achieve the stop conditions and the bitmap sync time.

Assume the actual live migration bandwidth is B, the dirty page rate is D,
the bitmap sync time is x (ms), the transfer time per iteration is t (ms), and the
downtime limit is y (ms).

To simplify the calculation, we assume all of dirty pages are not zero page and only
consider the case B > D.

When x + t > 100ms, the bandwidth calculated by qemu is R = B * t / (x + t).
When x + t < 100ms, the bandwidth calculated by qemu is R = B * (100 - x) / 100.

If there is a critical convergence state, then we have:
  (1) B * t = D * (x + t)
  (2) t = D * x / (B - D)
For the stop condition to be successfully determined, then we have two cases:
  When:
  (3) x + t > 100
  (4) x + D * x / (B - D) > 100
  (5) x > 100 - 100 * D / B
  Then:
  (6) R * y > D * (x + t)
  (7) B * t * y / (x + t) > D * (x + t)
  (8) (B * (D * x / (B - D)) * y) / (x + D * x / (B - D)) > D * (x + D * x / (B - D))
  (9) D * y > D * (x + D * x / (B - D))
  (10) y > x + D * x / (B - D)
  (11) (B - D) * y > B * x
  (12) y > B * x / (B - D)

  When:
  (13) x + t < 100
  (14) x + D * x / (B - D) < 100
  (15) x < 100 - 100 * D / B
  Then:
  (16) R * y > D * (x + t)
  (17) B * (100 - x) * y / 100 > D * (x + t)
  (18) B * (100 - x) * y / 100 > D * (x + D * x / (B - D))
  (19) y > 100 * D * x / ((B - D) * (100 - x))

After deriving the formula, we can use some data for comparison.

For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and 16 vhost-user-blk(4 queue per blk),
the sync time is as high as *73ms* (tested with 10GBps dirty rate, the sync time increases as the dirty page rate increases),
Here are each part of the sync time:

- sync from kvm to ram_list: 2.5ms
- vhost_log_sync:3ms
- sync aligned memory from ram_list to RAMBlock: 5ms
- sync misaligned memory from ram_list to RAMBlock: 61ms

After applying this patch, syncing misaligned memory from ram_list to RAMBlock takes only about 1ms,
and the total sync time is only *12ms*.

*First case, assume our maximum bandwidth can reach 15GBps and the dirty page rate is 10GBps.

If x = 73 ms, when there is a critical convergence state,
we use formula(2) get t = D * x / (B - D) = 146 ms,
because x + t = 219ms > 100ms,
so we get y > B * x / (B - D) = 219ms.

If x = 12 ms, when there is a critical convergence state,
we use formula(2) get t = D * x / (B - D) = 24 ms,
because x + t = 36ms < 100ms,
so we get y > 100 * D * x / ((B - D) * (100 - x)) = 27.2ms.

We can see that after optimization, under the same bandwidth and dirty rate scenario,
the downtime limit required for dirty page convergence is significantly reduced.

*Second case, assume our maximum bandwidth can reach 15GBps and the downtime limit is set to 150ms.
If x = 73 ms,
when x + t > 100ms,
we use formula(12) get D < B * (y - x) / y = 15 * (150 - 73) / 150 = 7.7GBps,
when x + t < 100ms,
we use formula(19) get D < 5.35GBps

If x = 12 ms,
when x + t > 100ms,
we use formula(12) get D < B * (y - x) / y = 15 * (150 - 12) / 150 = 13.8GBps,
when x + t < 100ms,
we use formula(19) get D < 13.75GBps

We can see that after optimization, under the same bandwidth and downtime limit scenario,
the convergent dirty page rate is significantly improved.

Through the above formula derivation, we have proven that reducing bitmap sync time
can significantly improve dirty page convergence capability. 

This patch only optimizes bitmap sync time for part of scenarios.
There may still be many scenarios where bitmap sync time negatively impacts dirty page
convergence capability, and we can also try to optimize using this approach.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-15 14:06 [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily Chuang Xu
@ 2025-12-15 14:06 ` Chuang Xu
  2025-12-15 16:32   ` Peter Xu
  2025-12-15 16:26 ` [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily Peter Xu
  1 sibling, 1 reply; 4+ messages in thread
From: Chuang Xu @ 2025-12-15 14:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, sgarzare, richard.henderson, pbonzini, peterx, david, philmd,
	farosas, xuchuangxclwt

From: xuchuangxclwt <xuchuangxclwt@bytedance.com>

When the addresses processed are not aligned, a large number of
clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate
2048 clear_dirty ioctls from two different memory_listener),
which increases the time required for bitmap_sync and makes it
more difficult for dirty pages to converge.

Attempt to merge those fragmented clear_dirty ioctls.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
---
 accel/tcg/cputlb.c       |  5 +++--
 include/system/physmem.h |  7 ++++---
 migration/ram.c          | 26 ++++++++++++------------
 system/memory.c          |  2 +-
 system/physmem.c         | 44 ++++++++++++++++++++++++----------------
 5 files changed, 48 insertions(+), 36 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index fd1606c856..c8827c8b0d 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
 void tlb_protect_code(ram_addr_t ram_addr)
 {
     physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
-                                             TARGET_PAGE_SIZE,
-                                             DIRTY_MEMORY_CODE);
+                                         TARGET_PAGE_SIZE,
+                                         DIRTY_MEMORY_CODE,
+                                         NULL);
 }
 
 /* update the TLB so that writes in physical page 'phys_addr' are no longer
diff --git a/include/system/physmem.h b/include/system/physmem.h
index 879f6eae38..8eeace9d1f 100644
--- a/include/system/physmem.h
+++ b/include/system/physmem.h
@@ -39,9 +39,10 @@ uint64_t physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
 
 void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
 
-bool physical_memory_test_and_clear_dirty(ram_addr_t start,
-                                          ram_addr_t length,
-                                          unsigned client);
+uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
+                                              ram_addr_t length,
+                                              unsigned client,
+                                              unsigned long *dest);
 
 DirtyBitmapSnapshot *
 physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
diff --git a/migration/ram.c b/migration/ram.c
index 29f016cb25..2d5e979211 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -942,7 +942,6 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                                                   ram_addr_t start,
                                                   ram_addr_t length)
 {
-    ram_addr_t addr;
     unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
     uint64_t num_dirty = 0;
     unsigned long *dest = rb->bmap;
@@ -995,18 +994,19 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
         }
     } else {
         ram_addr_t offset = rb->offset;
-
-        for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
-            if (physical_memory_test_and_clear_dirty(
-                        start + addr + offset,
-                        TARGET_PAGE_SIZE,
-                        DIRTY_MEMORY_MIGRATION)) {
-                long k = (start + addr) >> TARGET_PAGE_BITS;
-                if (!test_and_set_bit(k, dest)) {
-                    num_dirty++;
-                }
-            }
-        }
+        unsigned long end, start_page;
+        uint64_t mr_offset, mr_size;
+
+        num_dirty = physical_memory_test_and_clear_dirty(
+                        start + offset,
+                        length,
+                        DIRTY_MEMORY_MIGRATION,
+                        dest);
+        end = TARGET_PAGE_ALIGN(start + offset + length) >> TARGET_PAGE_BITS;
+        start_page = (start + offset) >> TARGET_PAGE_BITS;
+        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - offset;
+        mr_size = (end - start_page) << TARGET_PAGE_BITS;
+        memory_region_clear_dirty_bitmap(rb->mr, mr_offset, mr_size);
     }
 
     return num_dirty;
diff --git a/system/memory.c b/system/memory.c
index 8b84661ae3..666364392d 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
 {
     assert(mr->ram_block);
     physical_memory_test_and_clear_dirty(
-        memory_region_get_ram_addr(mr) + addr, size, client);
+        memory_region_get_ram_addr(mr) + addr, size, client, NULL);
 }
 
 int memory_region_get_fd(MemoryRegion *mr)
diff --git a/system/physmem.c b/system/physmem.c
index c9869e4049..d015eb2133 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1090,18 +1090,19 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
 }
 
 /* Note: start and end must be within the same ram block.  */
-bool physical_memory_test_and_clear_dirty(ram_addr_t start,
+uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
                                               ram_addr_t length,
-                                              unsigned client)
+                                              unsigned client,
+                                              unsigned long *dest)
 {
     DirtyMemoryBlocks *blocks;
     unsigned long end, page, start_page;
-    bool dirty = false;
+    uint64_t num_dirty = 0;
     RAMBlock *ramblock;
     uint64_t mr_offset, mr_size;
 
     if (length == 0) {
-        return false;
+        return 0;
     }
 
     end = TARGET_PAGE_ALIGN(start + length) >> TARGET_PAGE_BITS;
@@ -1118,31 +1119,40 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
         while (page < end) {
             unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
             unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
-            unsigned long num = MIN(end - page,
-                                    DIRTY_MEMORY_BLOCK_SIZE - offset);
 
-            dirty |= bitmap_test_and_clear_atomic(blocks->blocks[idx],
-                                                  offset, num);
-            page += num;
+            if (bitmap_test_and_clear_atomic(blocks->blocks[idx], offset, 1)) {
+                if (dest) {
+                    unsigned long k = page - (ramblock->offset >> TARGET_PAGE_BITS);
+                    if (!test_and_set_bit(k, dest)) {
+                        num_dirty++;
+                    }
+                } else {
+                    num_dirty++;
+                }
+            }
+
+            page++;
         }
 
-        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
-        mr_size = (end - start_page) << TARGET_PAGE_BITS;
-        memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
+        if (!dest && num_dirty) {
+            mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
+            mr_size = (end - start_page) << TARGET_PAGE_BITS;
+            memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
+        }
     }
 
-    if (dirty) {
+    if (num_dirty) {
         physical_memory_dirty_bits_cleared(start, length);
     }
 
-    return dirty;
+    return num_dirty;
 }
 
 static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
 {
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, NULL);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, NULL);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, NULL);
 }
 
 DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/1] migration: merge fragmented clear_dirty ioctls
  2025-12-15 14:06 ` [PATCH v2 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
@ 2025-12-15 16:32   ` Peter Xu
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Xu @ 2025-12-15 16:32 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, mst, sgarzare, richard.henderson, pbonzini, david,
	philmd, farosas

On Mon, Dec 15, 2025 at 10:06:11PM +0800, Chuang Xu wrote:
> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> 
> When the addresses processed are not aligned, a large number of
> clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate
> 2048 clear_dirty ioctls from two different memory_listener),
> which increases the time required for bitmap_sync and makes it
> more difficult for dirty pages to converge.
> 
> Attempt to merge those fragmented clear_dirty ioctls.
> 
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> ---
>  accel/tcg/cputlb.c       |  5 +++--
>  include/system/physmem.h |  7 ++++---
>  migration/ram.c          | 26 ++++++++++++------------
>  system/memory.c          |  2 +-
>  system/physmem.c         | 44 ++++++++++++++++++++++++----------------
>  5 files changed, 48 insertions(+), 36 deletions(-)
> 
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index fd1606c856..c8827c8b0d 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
>  void tlb_protect_code(ram_addr_t ram_addr)
>  {
>      physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
> -                                             TARGET_PAGE_SIZE,
> -                                             DIRTY_MEMORY_CODE);
> +                                         TARGET_PAGE_SIZE,
> +                                         DIRTY_MEMORY_CODE,
> +                                         NULL);
>  }
>  
>  /* update the TLB so that writes in physical page 'phys_addr' are no longer
> diff --git a/include/system/physmem.h b/include/system/physmem.h
> index 879f6eae38..8eeace9d1f 100644
> --- a/include/system/physmem.h
> +++ b/include/system/physmem.h
> @@ -39,9 +39,10 @@ uint64_t physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
>  
>  void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
>  
> -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> -                                          ram_addr_t length,
> -                                          unsigned client);
> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
> +                                              ram_addr_t length,
> +                                              unsigned client,
> +                                              unsigned long *dest);

Nitpick: please consider adding doc for this function now, both "dest" and
retval may need some explanations.

>  
>  DirtyBitmapSnapshot *
>  physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
> diff --git a/migration/ram.c b/migration/ram.c
> index 29f016cb25..2d5e979211 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -942,7 +942,6 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>                                                    ram_addr_t start,
>                                                    ram_addr_t length)
>  {
> -    ram_addr_t addr;
>      unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
>      uint64_t num_dirty = 0;
>      unsigned long *dest = rb->bmap;
> @@ -995,18 +994,19 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>          }
>      } else {
>          ram_addr_t offset = rb->offset;
> -
> -        for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
> -            if (physical_memory_test_and_clear_dirty(
> -                        start + addr + offset,
> -                        TARGET_PAGE_SIZE,
> -                        DIRTY_MEMORY_MIGRATION)) {
> -                long k = (start + addr) >> TARGET_PAGE_BITS;
> -                if (!test_and_set_bit(k, dest)) {
> -                    num_dirty++;
> -                }
> -            }
> -        }
> +        unsigned long end, start_page;
> +        uint64_t mr_offset, mr_size;
> +
> +        num_dirty = physical_memory_test_and_clear_dirty(
> +                        start + offset,
> +                        length,
> +                        DIRTY_MEMORY_MIGRATION,
> +                        dest);

Thanks for doing this, I think this is better.  Though IIUC you missed a
major benefit of the current API I'm suggesting here, which is to avoid
explicit invokations of memory_region_clear_dirty_bitmap().

IIUC you can remove the "if" check at [1] below, then you can drop the five
lines here afterwards.

So physical_memory_test_and_clear_dirty() should never worry about remote
clears, and it should always be done properly.

> +        end = TARGET_PAGE_ALIGN(start + offset + length) >> TARGET_PAGE_BITS;
> +        start_page = (start + offset) >> TARGET_PAGE_BITS;
> +        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - offset;
> +        mr_size = (end - start_page) << TARGET_PAGE_BITS;
> +        memory_region_clear_dirty_bitmap(rb->mr, mr_offset, mr_size);
>      }
>  
>      return num_dirty;
> diff --git a/system/memory.c b/system/memory.c
> index 8b84661ae3..666364392d 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
>  {
>      assert(mr->ram_block);
>      physical_memory_test_and_clear_dirty(
> -        memory_region_get_ram_addr(mr) + addr, size, client);
> +        memory_region_get_ram_addr(mr) + addr, size, client, NULL);
>  }
>  
>  int memory_region_get_fd(MemoryRegion *mr)
> diff --git a/system/physmem.c b/system/physmem.c
> index c9869e4049..d015eb2133 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1090,18 +1090,19 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
>  }
>  
>  /* Note: start and end must be within the same ram block.  */
> -bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> +uint64_t physical_memory_test_and_clear_dirty(ram_addr_t start,
>                                                ram_addr_t length,
> -                                              unsigned client)
> +                                              unsigned client,
> +                                              unsigned long *dest)
>  {
>      DirtyMemoryBlocks *blocks;
>      unsigned long end, page, start_page;
> -    bool dirty = false;
> +    uint64_t num_dirty = 0;
>      RAMBlock *ramblock;
>      uint64_t mr_offset, mr_size;
>  
>      if (length == 0) {
> -        return false;
> +        return 0;
>      }
>  
>      end = TARGET_PAGE_ALIGN(start + length) >> TARGET_PAGE_BITS;
> @@ -1118,31 +1119,40 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>          while (page < end) {
>              unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
>              unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
> -            unsigned long num = MIN(end - page,
> -                                    DIRTY_MEMORY_BLOCK_SIZE - offset);
>  
> -            dirty |= bitmap_test_and_clear_atomic(blocks->blocks[idx],
> -                                                  offset, num);
> -            page += num;
> +            if (bitmap_test_and_clear_atomic(blocks->blocks[idx], offset, 1)) {
> +                if (dest) {
> +                    unsigned long k = page - (ramblock->offset >> TARGET_PAGE_BITS);
> +                    if (!test_and_set_bit(k, dest)) {
> +                        num_dirty++;
> +                    }
> +                } else {
> +                    num_dirty++;
> +                }
> +            }
> +
> +            page++;
>          }
>  
> -        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
> -        mr_size = (end - start_page) << TARGET_PAGE_BITS;
> -        memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
> +        if (!dest && num_dirty) {

[1]

> +            mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
> +            mr_size = (end - start_page) << TARGET_PAGE_BITS;
> +            memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
> +        }
>      }
>  
> -    if (dirty) {
> +    if (num_dirty) {
>          physical_memory_dirty_bits_cleared(start, length);
>      }
>  
> -    return dirty;
> +    return num_dirty;
>  }
>  
>  static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
>  {
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, NULL);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, NULL);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, NULL);
>  }
>  
>  DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty
> -- 
> 2.39.3 (Apple Git-146)
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily
  2025-12-15 14:06 [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily Chuang Xu
  2025-12-15 14:06 ` [PATCH v2 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
@ 2025-12-15 16:26 ` Peter Xu
  1 sibling, 0 replies; 4+ messages in thread
From: Peter Xu @ 2025-12-15 16:26 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, mst, sgarzare, richard.henderson, pbonzini, david,
	philmd, farosas

On Mon, Dec 15, 2025 at 10:06:10PM +0800, Chuang Xu wrote:
> In this version:
> 
> - drop duplicate vhost_log_sync optimization
> - refactor physical_memory_test_and_clear_dirty
> - provide more detailed bitmap sync time for each part in this cover
> 
> 
> In our long-term experience in Bytedance, we've found that under the same load,
> live migration of larger VMs with more devices is often more difficult to
> converge (requiring a larger downtime limit).
> 
> We've observed that the live migration bandwidth of large, multi-device VMs is
> severely distorted, a phenomenon likely similar to the problem described in this link
> (https://wiki.qemu.org/ToDo/LiveMigration#Optimize_migration_bandwidth_calculation).
> 
> Through some testing and calculations, we conclude that bitmap sync time affects
> the calculation of live migration bandwidth.
> 
> Now, let me use formulaic reasoning to illustrate the relationship between the downtime
> limit required to achieve the stop conditions and the bitmap sync time.
> 
> Assume the actual live migration bandwidth is B, the dirty page rate is D,
> the bitmap sync time is x (ms), the transfer time per iteration is t (ms), and the
> downtime limit is y (ms).
> 
> To simplify the calculation, we assume all of dirty pages are not zero page and only
> consider the case B > D.
> 
> When x + t > 100ms, the bandwidth calculated by qemu is R = B * t / (x + t).
> When x + t < 100ms, the bandwidth calculated by qemu is R = B * (100 - x) / 100.
> 
> If there is a critical convergence state, then we have:
>   (1) B * t = D * (x + t)
>   (2) t = D * x / (B - D)
> For the stop condition to be successfully determined, then we have two cases:
>   When:
>   (3) x + t > 100
>   (4) x + D * x / (B - D) > 100
>   (5) x > 100 - 100 * D / B
>   Then:
>   (6) R * y > D * (x + t)
>   (7) B * t * y / (x + t) > D * (x + t)
>   (8) (B * (D * x / (B - D)) * y) / (x + D * x / (B - D)) > D * (x + D * x / (B - D))
>   (9) D * y > D * (x + D * x / (B - D))
>   (10) y > x + D * x / (B - D)
>   (11) (B - D) * y > B * x
>   (12) y > B * x / (B - D)
> 
>   When:
>   (13) x + t < 100
>   (14) x + D * x / (B - D) < 100
>   (15) x < 100 - 100 * D / B
>   Then:
>   (16) R * y > D * (x + t)
>   (17) B * (100 - x) * y / 100 > D * (x + t)
>   (18) B * (100 - x) * y / 100 > D * (x + D * x / (B - D))
>   (19) y > 100 * D * x / ((B - D) * (100 - x))
> 
> After deriving the formula, we can use some data for comparison.
> 
> For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and 16 vhost-user-blk(4 queue per blk),
> the sync time is as high as *73ms* (tested with 10GBps dirty rate, the sync time increases as the dirty page rate increases),
> Here are each part of the sync time:
> 
> - sync from kvm to ram_list: 2.5ms
> - vhost_log_sync:3ms
> - sync aligned memory from ram_list to RAMBlock: 5ms
> - sync misaligned memory from ram_list to RAMBlock: 61ms
> 
> After applying this patch, syncing misaligned memory from ram_list to RAMBlock takes only about 1ms,
> and the total sync time is only *12ms*.

These numbers are greatly helpful, thanks a lot.  Please put that into the
commit message of the patch.

OTOH, IMHO you can drop the formula and bw calculation complexities.  Your
numbers here already justify this patch very useful.

I could have amended the commit message myself when queuing, but there's a
code change I want to double check with you.  I'll reply there soon.

> 
> *First case, assume our maximum bandwidth can reach 15GBps and the dirty page rate is 10GBps.
> 
> If x = 73 ms, when there is a critical convergence state,
> we use formula(2) get t = D * x / (B - D) = 146 ms,
> because x + t = 219ms > 100ms,
> so we get y > B * x / (B - D) = 219ms.
> 
> If x = 12 ms, when there is a critical convergence state,
> we use formula(2) get t = D * x / (B - D) = 24 ms,
> because x + t = 36ms < 100ms,
> so we get y > 100 * D * x / ((B - D) * (100 - x)) = 27.2ms.
> 
> We can see that after optimization, under the same bandwidth and dirty rate scenario,
> the downtime limit required for dirty page convergence is significantly reduced.
> 
> *Second case, assume our maximum bandwidth can reach 15GBps and the downtime limit is set to 150ms.
> If x = 73 ms,
> when x + t > 100ms,
> we use formula(12) get D < B * (y - x) / y = 15 * (150 - 73) / 150 = 7.7GBps,
> when x + t < 100ms,
> we use formula(19) get D < 5.35GBps
> 
> If x = 12 ms,
> when x + t > 100ms,
> we use formula(12) get D < B * (y - x) / y = 15 * (150 - 12) / 150 = 13.8GBps,
> when x + t < 100ms,
> we use formula(19) get D < 13.75GBps
> 
> We can see that after optimization, under the same bandwidth and downtime limit scenario,
> the convergent dirty page rate is significantly improved.
> 
> Through the above formula derivation, we have proven that reducing bitmap sync time
> can significantly improve dirty page convergence capability. 
> 
> This patch only optimizes bitmap sync time for part of scenarios.
> There may still be many scenarios where bitmap sync time negatively impacts dirty page
> convergence capability, and we can also try to optimize using this approach.
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-12-15 16:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-15 14:06 [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily Chuang Xu
2025-12-15 14:06 ` [PATCH v2 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
2025-12-15 16:32   ` Peter Xu
2025-12-15 16:26 ` [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).