[RFC v1 0/2] migration: reduce bitmap sync time and make dirty pages converge much more easily

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC v1 0/2] migration: reduce bitmap sync time and make dirty pages converge much more easily
@ 2025-12-08 12:09 Chuang Xu
  2025-12-08 12:09 ` [RFC v1 1/2] vhost: eliminate duplicate dirty_bitmap sync when log shared by multiple devices Chuang Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Chuang Xu @ 2025-12-08 12:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, sgarzare, richard.henderson, pbonzini, peterx, david, philmd,
	farosas

In our long-term experience in Bytedance, we've found that under the same load,
live migration of larger VMs with more devices is often more difficult to
converge (requiring a larger downtime limit).

We've observed that the live migration bandwidth of large, multi-device VMs is
severely distorted, a phenomenon likely similar to the problem described in this link
(https://wiki.qemu.org/ToDo/LiveMigration#Optimize_migration_bandwidth_calculation).

Through some testing and calculations, we conclude that bitmap sync time affects
the calculation of live migration bandwidth.

Now, let me use formulaic reasoning to illustrate the relationship between the downtime
limit required to achieve the stop conditions and the bitmap sync time.

Assume the actual live migration bandwidth is B, the dirty page rate is D,
the bitmap sync time is x (ms), the transfer time per iteration is t (ms), and the
downtime limit is y (ms).

To simplify the calculation, we assume all of dirty pages are not zero page and only
consider the case B > D.

When x + t > 100ms, the bandwidth calculated by qemu is R = B * t / (x + t).
When x + t < 100ms, the bandwidth calculated by qemu is R = B * (100 - x) / 100.

If there is a critical convergence state, then we have:
  (1) B * t = D * (x + t)
  (2) t = D * x / (B - D)
For the stop condition to be successfully determined, then we have two cases:
  When:
  (3) x + t > 100
  (4) x + D * x / (B - D) > 100
  (5) x > 100 - 100 * D / B
  Then:
  (6) R * y > D * (x + t)
  (7) B * t * y / (x + t) > D * (x + t)
  (8) (B * (D * x / (B - D)) * y) / (x + D * x / (B - D)) > D * (x + D * x / (B - D))
  (9) D * y > D * (x + D * x / (B - D))
  (10) y > x + D * x / (B - D)
  (11) (B - D) * y > B * x
  (12) y > B * x / (B - D)

  When:
  (13) x + t < 100
  (14) x + D * x / (B - D) < 100
  (15) x < 100 - 100 * D / B
  Then:
  (16) R * y > D * (x + t)
  (17) B * (100 - x) * y / 100 > D * (x + t)
  (18) B * (100 - x) * y / 100 > D * (x + D * x / (B - D))
  (19) y > 100 * D * x / ((B - D) * (100 - x))

After deriving the formula, we can use some data for comparison.

For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and 16 vhost-user-blk(4 queue per blk),
the sync time is as high as 250ms, while after applying this patch, the sync time is only 10ms.

*First case, assume our maximum bandwidth can reach 15GBps and the dirty page rate is 7.5GBps.

If x = 250 ms, when there is a critical convergence state,
we use formula(2) get t = D * x / (B - D) = 250 ms,
because x + t = 500ms > 100ms,
so we get y > B * x / (B - D) = 500ms.

If x = 10 ms,
when there is a critical convergence state,
we use formula(2) get t = D * x / (B - D) = 10 ms,
because x + t = 20ms < 100ms,
so we get y > 100 * D * x / ((B - D) * (100 - x)) = 11.1ms.

We can see that after optimization, under the same bandwidth and dirty rate scenario,
the downtime limit required for dirty page convergence is significantly reduced.

*Second case, assume our maximum bandwidth can reach 15GBps and the downtime limit is set to 300ms.
If x = 250 ms,  x + t > 250ms > 100ms,
so we use formula(12) get D < B * (y - x) / y = 15 * (300 - 250) / 300 = 2.5GBps

If x = 10 ms,
when x + t > 100ms,
we use formula(12) get D < B * (y - x) / y = 15 * (300 - 10) / 300 = 14.5GBps,
when x + t < 100ms,
we use formula(19) get D < 14.46GBps

We can see that after optimization, under the same bandwidth and downtime limit scenario,
the convergent dirty page rate is significantly improved.

Through the above formula derivation, we have proven that reducing bitmap sync time
can significantly improve dirty page convergence capability.

This series only optimize bitmap sync time for some scenarios.
There may still be many scenarios where bitmap sync time negatively impacts dirty page
convergence capability, and we can also try to optimize using this approach.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC v1 1/2] vhost: eliminate duplicate dirty_bitmap sync when log shared by multiple devices
  2025-12-08 12:09 [RFC v1 0/2] migration: reduce bitmap sync time and make dirty pages converge much more easily Chuang Xu
@ 2025-12-08 12:09 ` Chuang Xu
  2025-12-09 20:47   ` Peter Xu
  2025-12-08 12:09 ` [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls Chuang Xu
  2025-12-09 21:06 ` [RFC v1 0/2] migration: reduce bitmap sync time and make dirty pages converge much more easily Peter Xu
  2 siblings, 1 reply; 10+ messages in thread
From: Chuang Xu @ 2025-12-08 12:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, sgarzare, richard.henderson, pbonzini, peterx, david, philmd,
	farosas, xuchuangxclwt

From: xuchuangxclwt <xuchuangxclwt@bytedance.com>

Although logs can now be shared among multiple vhost devices,
live migration still performs repeated vhost_log_sync for each
vhost device during bitmap_sync, which increases the time required
for bitmap_sync and makes it more difficult for dirty pages to converge.

Attempt to eliminate these duplicate sync.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
---
 hw/virtio/vhost.c         | 30 ++++++++++++++++++++++--------
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 266a11514a..d397ca327f 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -268,14 +268,6 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
     return 0;
 }
 
-static void vhost_log_sync(MemoryListener *listener,
-                          MemoryRegionSection *section)
-{
-    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
-                                         memory_listener);
-    vhost_sync_dirty_bitmap(dev, section, 0x0, ~0x0ULL);
-}
-
 static void vhost_log_sync_range(struct vhost_dev *dev,
                                  hwaddr first, hwaddr last)
 {
@@ -287,6 +279,27 @@ static void vhost_log_sync_range(struct vhost_dev *dev,
     }
 }
 
+static void vhost_log_sync(MemoryListener *listener,
+                          MemoryRegionSection *section)
+{
+    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
+                                         memory_listener);
+    struct vhost_log *log = dev->log;
+
+    if (log && log->refcnt > 1) {
+        /*
+         * When multiple devices use same log, we implement the logic of
+         * vhost_log_sync just like what we do in vhost_log_put.
+         */
+        log->sync_cnt = (log->sync_cnt + 1) % log->refcnt;
+        if (!log->sync_cnt) {
+            vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
+        }
+    } else {
+        vhost_sync_dirty_bitmap(dev, section, 0x0, ~0x0ULL);
+    }
+}
+
 static uint64_t vhost_get_log_size(struct vhost_dev *dev)
 {
     uint64_t log_size = 0;
@@ -383,6 +396,7 @@ static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
         ++log->refcnt;
     }
 
+    log->sync_cnt = 0;
     return log;
 }
 
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 08bbb4dfe9..43bf1c2150 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -50,6 +50,7 @@ typedef unsigned long vhost_log_chunk_t;
 struct vhost_log {
     unsigned long long size;
     int refcnt;
+    int sync_cnt;
     int fd;
     vhost_log_chunk_t *log;
 };
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC v1 1/2] vhost: eliminate duplicate dirty_bitmap sync when log shared by multiple devices
  2025-12-08 12:09 ` [RFC v1 1/2] vhost: eliminate duplicate dirty_bitmap sync when log shared by multiple devices Chuang Xu
@ 2025-12-09 20:47   ` Peter Xu
  2025-12-10  6:51     ` Jason Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2025-12-09 20:47 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, mst, sgarzare, richard.henderson, pbonzini, david,
	philmd, farosas, Jason Wang

On Mon, Dec 08, 2025 at 08:09:51PM +0800, Chuang Xu wrote:
> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> 
> Although logs can now be shared among multiple vhost devices,
> live migration still performs repeated vhost_log_sync for each
> vhost device during bitmap_sync, which increases the time required
> for bitmap_sync and makes it more difficult for dirty pages to converge.
> 
> Attempt to eliminate these duplicate sync.
> 
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>

It looks reasonable from migration POV, but I don't know the details.

Please remember to copy Jason (I added for this email) when repost.

Thanks,

> ---
>  hw/virtio/vhost.c         | 30 ++++++++++++++++++++++--------
>  include/hw/virtio/vhost.h |  1 +
>  2 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 266a11514a..d397ca327f 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -268,14 +268,6 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
>      return 0;
>  }
>  
> -static void vhost_log_sync(MemoryListener *listener,
> -                          MemoryRegionSection *section)
> -{
> -    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> -                                         memory_listener);
> -    vhost_sync_dirty_bitmap(dev, section, 0x0, ~0x0ULL);
> -}
> -
>  static void vhost_log_sync_range(struct vhost_dev *dev,
>                                   hwaddr first, hwaddr last)
>  {
> @@ -287,6 +279,27 @@ static void vhost_log_sync_range(struct vhost_dev *dev,
>      }
>  }
>  
> +static void vhost_log_sync(MemoryListener *listener,
> +                          MemoryRegionSection *section)
> +{
> +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> +                                         memory_listener);
> +    struct vhost_log *log = dev->log;
> +
> +    if (log && log->refcnt > 1) {
> +        /*
> +         * When multiple devices use same log, we implement the logic of
> +         * vhost_log_sync just like what we do in vhost_log_put.
> +         */
> +        log->sync_cnt = (log->sync_cnt + 1) % log->refcnt;
> +        if (!log->sync_cnt) {
> +            vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
> +        }
> +    } else {
> +        vhost_sync_dirty_bitmap(dev, section, 0x0, ~0x0ULL);
> +    }
> +}
> +
>  static uint64_t vhost_get_log_size(struct vhost_dev *dev)
>  {
>      uint64_t log_size = 0;
> @@ -383,6 +396,7 @@ static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
>          ++log->refcnt;
>      }
>  
> +    log->sync_cnt = 0;
>      return log;
>  }
>  
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index 08bbb4dfe9..43bf1c2150 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -50,6 +50,7 @@ typedef unsigned long vhost_log_chunk_t;
>  struct vhost_log {
>      unsigned long long size;
>      int refcnt;
> +    int sync_cnt;
>      int fd;
>      vhost_log_chunk_t *log;
>  };
> -- 
> 2.20.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v1 1/2] vhost: eliminate duplicate dirty_bitmap sync when log shared by multiple devices
  2025-12-09 20:47   ` Peter Xu
@ 2025-12-10  6:51     ` Jason Wang
  2025-12-10 13:52       ` Chuang Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Wang @ 2025-12-10  6:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Chuang Xu, qemu-devel, mst, sgarzare, richard.henderson, pbonzini,
	david, philmd, farosas

On Wed, Dec 10, 2025 at 4:47 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Dec 08, 2025 at 08:09:51PM +0800, Chuang Xu wrote:
> > From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> >
> > Although logs can now be shared among multiple vhost devices,
> > live migration still performs repeated vhost_log_sync for each
> > vhost device during bitmap_sync, which increases the time required
> > for bitmap_sync and makes it more difficult for dirty pages to converge.

Please show us how you do the benchmark.

> >
> > Attempt to eliminate these duplicate sync.

I think this is suspicious, more below.

> >
> > Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
>
> It looks reasonable from migration POV, but I don't know the details.
>
> Please remember to copy Jason (I added for this email) when repost.

Thanks for copying me Peter.

>
> Thanks,
>
> > ---
> >  hw/virtio/vhost.c         | 30 ++++++++++++++++++++++--------
> >  include/hw/virtio/vhost.h |  1 +
> >  2 files changed, 23 insertions(+), 8 deletions(-)
> >
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index 266a11514a..d397ca327f 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -268,14 +268,6 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
> >      return 0;
> >  }
> >
> > -static void vhost_log_sync(MemoryListener *listener,
> > -                          MemoryRegionSection *section)
> > -{
> > -    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> > -                                         memory_listener);
> > -    vhost_sync_dirty_bitmap(dev, section, 0x0, ~0x0ULL);
> > -}
> > -
> >  static void vhost_log_sync_range(struct vhost_dev *dev,
> >                                   hwaddr first, hwaddr last)
> >  {
> > @@ -287,6 +279,27 @@ static void vhost_log_sync_range(struct vhost_dev *dev,
> >      }
> >  }
> >
> > +static void vhost_log_sync(MemoryListener *listener,
> > +                          MemoryRegionSection *section)
> > +{
> > +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> > +                                         memory_listener);
> > +    struct vhost_log *log = dev->log;
> > +
> > +    if (log && log->refcnt > 1) {
> > +        /*
> > +         * When multiple devices use same log, we implement the logic of
> > +         * vhost_log_sync just like what we do in vhost_log_put.
> > +         */

We should have already avoided the duplicated syncing with
vhost_dev_should_log()? Or anything I miss here?

> > +        log->sync_cnt = (log->sync_cnt + 1) % log->refcnt;
> > +        if (!log->sync_cnt) {
> > +            vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);

This will lose some syncs when the device is not the logger device
(see vhost_dev_should_log())?

> > +        }
> > +    } else {
> > +        vhost_sync_dirty_bitmap(dev, section, 0x0, ~0x0ULL);
> > +    }
> > +}
> > +
> >  static uint64_t vhost_get_log_size(struct vhost_dev *dev)
> >  {
> >      uint64_t log_size = 0;
> > @@ -383,6 +396,7 @@ static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
> >          ++log->refcnt;
> >      }
> >
> > +    log->sync_cnt = 0;
> >      return log;
> >  }
> >
> > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> > index 08bbb4dfe9..43bf1c2150 100644
> > --- a/include/hw/virtio/vhost.h
> > +++ b/include/hw/virtio/vhost.h
> > @@ -50,6 +50,7 @@ typedef unsigned long vhost_log_chunk_t;
> >  struct vhost_log {
> >      unsigned long long size;
> >      int refcnt;
> > +    int sync_cnt;
> >      int fd;
> >      vhost_log_chunk_t *log;
> >  };
> > --
> > 2.20.1
> >
>
> --
> Peter Xu
>

Thanks



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v1 1/2] vhost: eliminate duplicate dirty_bitmap sync when log shared by multiple devices
  2025-12-10  6:51     ` Jason Wang
@ 2025-12-10 13:52       ` Chuang Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Chuang Xu @ 2025-12-10 13:52 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, mst, sgarzare, richard.henderson, pbonzini, david,
	philmd, farosas, Peter Xu

Hi, Jason

On 10/12/2025 14:51, Jason Wang wrote:
>>> ---
>>>   hw/virtio/vhost.c         | 30 ++++++++++++++++++++++--------
>>>   include/hw/virtio/vhost.h |  1 +
>>>   2 files changed, 23 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>> index 266a11514a..d397ca327f 100644
>>> --- a/hw/virtio/vhost.c
>>> +++ b/hw/virtio/vhost.c
>>> @@ -268,14 +268,6 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
>>>       return 0;
>>>   }
>>>
>>> -static void vhost_log_sync(MemoryListener *listener,
>>> -                          MemoryRegionSection *section)
>>> -{
>>> -    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
>>> -                                         memory_listener);
>>> -    vhost_sync_dirty_bitmap(dev, section, 0x0, ~0x0ULL);
>>> -}
>>> -
>>>   static void vhost_log_sync_range(struct vhost_dev *dev,
>>>                                    hwaddr first, hwaddr last)
>>>   {
>>> @@ -287,6 +279,27 @@ static void vhost_log_sync_range(struct vhost_dev *dev,
>>>       }
>>>   }
>>>
>>> +static void vhost_log_sync(MemoryListener *listener,
>>> +                          MemoryRegionSection *section)
>>> +{
>>> +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
>>> +                                         memory_listener);
>>> +    struct vhost_log *log = dev->log;
>>> +
>>> +    if (log && log->refcnt > 1) {
>>> +        /*
>>> +         * When multiple devices use same log, we implement the logic of
>>> +         * vhost_log_sync just like what we do in vhost_log_put.
>>> +         */
> We should have already avoided the duplicated syncing with
> vhost_dev_should_log()? Or anything I miss here?

You're right, in my environment, I used version 8.1 for daily testing. 
When I checked

the latest version in the community, I didn't notice that optimization 
and naturally

assumed that the latest version of vhost_log_sync also had the same issue.


I apologize for sending a duplicate optimization for vhost_log_sync.

Hope the formula derivation in the cover will provide some inspiration 
for future optimization work.


Thank you for the reminder.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls
  2025-12-08 12:09 [RFC v1 0/2] migration: reduce bitmap sync time and make dirty pages converge much more easily Chuang Xu
  2025-12-08 12:09 ` [RFC v1 1/2] vhost: eliminate duplicate dirty_bitmap sync when log shared by multiple devices Chuang Xu
@ 2025-12-08 12:09 ` Chuang Xu
  2025-12-09 21:10   ` Peter Xu
  2025-12-09 21:06 ` [RFC v1 0/2] migration: reduce bitmap sync time and make dirty pages converge much more easily Peter Xu
  2 siblings, 1 reply; 10+ messages in thread
From: Chuang Xu @ 2025-12-08 12:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, sgarzare, richard.henderson, pbonzini, peterx, david, philmd,
	farosas, xuchuangxclwt

From: xuchuangxclwt <xuchuangxclwt@bytedance.com>

When the addresses processed are not aligned, a large number of
clear_dirty ioctl occur (e.g. a 16MB misaligned memory can generate
4096 clear_dirty ioctls), which increases the time required for
bitmap_sync and makes it more difficult for dirty pages to converge.

Attempt to merge those fragmented clear_dirty ioctls.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
---
 accel/tcg/cputlb.c       |  5 +++--
 include/system/physmem.h |  3 ++-
 migration/ram.c          | 17 ++++++++++++++++-
 system/memory.c          |  2 +-
 system/physmem.c         | 19 +++++++++++--------
 5 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index fd1606c856..8a054cb545 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
 void tlb_protect_code(ram_addr_t ram_addr)
 {
     physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
-                                             TARGET_PAGE_SIZE,
-                                             DIRTY_MEMORY_CODE);
+                                         TARGET_PAGE_SIZE,
+                                         DIRTY_MEMORY_CODE,
+                                         true);
 }
 
 /* update the TLB so that writes in physical page 'phys_addr' are no longer
diff --git a/include/system/physmem.h b/include/system/physmem.h
index 879f6eae38..8529f0510a 100644
--- a/include/system/physmem.h
+++ b/include/system/physmem.h
@@ -41,7 +41,8 @@ void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
 
 bool physical_memory_test_and_clear_dirty(ram_addr_t start,
                                           ram_addr_t length,
-                                          unsigned client);
+                                          unsigned client,
+                                          bool clear_dirty);
 
 DirtyBitmapSnapshot *
 physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
diff --git a/migration/ram.c b/migration/ram.c
index 29f016cb25..329168d081 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -995,18 +995,33 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
         }
     } else {
         ram_addr_t offset = rb->offset;
+        unsigned long end, start_page;
+        uint64_t mr_offset, mr_size;
 
         for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
+            /*
+             * Here we don't expect so many clear_dirty, so we call
+             * physical_memory_test_and_clear_dirty with clear_dirty
+             * set to false. Later we'll do make an overall clear_dirty
+             * outside the loop.
+             */
             if (physical_memory_test_and_clear_dirty(
                         start + addr + offset,
                         TARGET_PAGE_SIZE,
-                        DIRTY_MEMORY_MIGRATION)) {
+                        DIRTY_MEMORY_MIGRATION,
+                        false)) {
                 long k = (start + addr) >> TARGET_PAGE_BITS;
                 if (!test_and_set_bit(k, dest)) {
                     num_dirty++;
                 }
             }
         }
+        end = TARGET_PAGE_ALIGN(start + addr + offset + TARGET_PAGE_SIZE)
+              >> TARGET_PAGE_BITS;
+        start_page = (start + offset) >> TARGET_PAGE_BITS;
+        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - offset;
+        mr_size = (end - start_page) << TARGET_PAGE_BITS;
+        memory_region_clear_dirty_bitmap(rb->mr, mr_offset, mr_size);
     }
 
     return num_dirty;
diff --git a/system/memory.c b/system/memory.c
index 8b84661ae3..7b81b84f30 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
 {
     assert(mr->ram_block);
     physical_memory_test_and_clear_dirty(
-        memory_region_get_ram_addr(mr) + addr, size, client);
+        memory_region_get_ram_addr(mr) + addr, size, client, true);
 }
 
 int memory_region_get_fd(MemoryRegion *mr)
diff --git a/system/physmem.c b/system/physmem.c
index c9869e4049..21b2db3884 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1091,8 +1091,9 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
 
 /* Note: start and end must be within the same ram block.  */
 bool physical_memory_test_and_clear_dirty(ram_addr_t start,
-                                              ram_addr_t length,
-                                              unsigned client)
+                                          ram_addr_t length,
+                                          unsigned client,
+                                          bool clear_dirty)
 {
     DirtyMemoryBlocks *blocks;
     unsigned long end, page, start_page;
@@ -1126,9 +1127,11 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
             page += num;
         }
 
-        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
-        mr_size = (end - start_page) << TARGET_PAGE_BITS;
-        memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
+        if (clear_dirty) {
+            mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
+            mr_size = (end - start_page) << TARGET_PAGE_BITS;
+            memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
+        }
     }
 
     if (dirty) {
@@ -1140,9 +1143,9 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
 
 static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
 {
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, true);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, true);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, true);
 }
 
 DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls
  2025-12-08 12:09 ` [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls Chuang Xu
@ 2025-12-09 21:10   ` Peter Xu
  2025-12-10 14:18     ` Chuang Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2025-12-09 21:10 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, mst, sgarzare, richard.henderson, pbonzini, david,
	philmd, farosas

On Mon, Dec 08, 2025 at 08:09:52PM +0800, Chuang Xu wrote:
> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> 
> When the addresses processed are not aligned, a large number of
> clear_dirty ioctl occur (e.g. a 16MB misaligned memory can generate
> 4096 clear_dirty ioctls), which increases the time required for
> bitmap_sync and makes it more difficult for dirty pages to converge.
> 
> Attempt to merge those fragmented clear_dirty ioctls.

(besides separate perf results I requested as in the cover letter reply..)

Could you add something into the commit log explaining at least one example
that you observe?  E.g. what is the VM setup, how many ramblocks are the
ones not aligned?

Have you considered setting rb->clear_bmap when it's available?  It'll
postpone the remote clear even further until page sent.  Logically it
should be more efficient, but it may depend on the size of unaligned
ramblocks that you're hitting indeed, as clear_bmap isn't PAGE_SIZE based
but it can be much bigger.  Some discussion around this would be nice on
how you chose the current solution.

> 
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> ---
>  accel/tcg/cputlb.c       |  5 +++--
>  include/system/physmem.h |  3 ++-
>  migration/ram.c          | 17 ++++++++++++++++-
>  system/memory.c          |  2 +-
>  system/physmem.c         | 19 +++++++++++--------
>  5 files changed, 33 insertions(+), 13 deletions(-)
> 
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index fd1606c856..8a054cb545 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
>  void tlb_protect_code(ram_addr_t ram_addr)
>  {
>      physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
> -                                             TARGET_PAGE_SIZE,
> -                                             DIRTY_MEMORY_CODE);
> +                                         TARGET_PAGE_SIZE,
> +                                         DIRTY_MEMORY_CODE,
> +                                         true);
>  }
>  
>  /* update the TLB so that writes in physical page 'phys_addr' are no longer
> diff --git a/include/system/physmem.h b/include/system/physmem.h
> index 879f6eae38..8529f0510a 100644
> --- a/include/system/physmem.h
> +++ b/include/system/physmem.h
> @@ -41,7 +41,8 @@ void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
>  
>  bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>                                            ram_addr_t length,
> -                                          unsigned client);
> +                                          unsigned client,
> +                                          bool clear_dirty);
>  
>  DirtyBitmapSnapshot *
>  physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
> diff --git a/migration/ram.c b/migration/ram.c
> index 29f016cb25..329168d081 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -995,18 +995,33 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>          }
>      } else {
>          ram_addr_t offset = rb->offset;
> +        unsigned long end, start_page;
> +        uint64_t mr_offset, mr_size;
>  
>          for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
> +            /*
> +             * Here we don't expect so many clear_dirty, so we call
> +             * physical_memory_test_and_clear_dirty with clear_dirty
> +             * set to false. Later we'll do make an overall clear_dirty
> +             * outside the loop.
> +             */
>              if (physical_memory_test_and_clear_dirty(
>                          start + addr + offset,
>                          TARGET_PAGE_SIZE,
> -                        DIRTY_MEMORY_MIGRATION)) {
> +                        DIRTY_MEMORY_MIGRATION,
> +                        false)) {
>                  long k = (start + addr) >> TARGET_PAGE_BITS;
>                  if (!test_and_set_bit(k, dest)) {
>                      num_dirty++;
>                  }
>              }
>          }
> +        end = TARGET_PAGE_ALIGN(start + addr + offset + TARGET_PAGE_SIZE)
> +              >> TARGET_PAGE_BITS;
> +        start_page = (start + offset) >> TARGET_PAGE_BITS;
> +        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - offset;
> +        mr_size = (end - start_page) << TARGET_PAGE_BITS;
> +        memory_region_clear_dirty_bitmap(rb->mr, mr_offset, mr_size);
>      }
>  
>      return num_dirty;
> diff --git a/system/memory.c b/system/memory.c
> index 8b84661ae3..7b81b84f30 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
>  {
>      assert(mr->ram_block);
>      physical_memory_test_and_clear_dirty(
> -        memory_region_get_ram_addr(mr) + addr, size, client);
> +        memory_region_get_ram_addr(mr) + addr, size, client, true);
>  }
>  
>  int memory_region_get_fd(MemoryRegion *mr)
> diff --git a/system/physmem.c b/system/physmem.c
> index c9869e4049..21b2db3884 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1091,8 +1091,9 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
>  
>  /* Note: start and end must be within the same ram block.  */
>  bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> -                                              ram_addr_t length,
> -                                              unsigned client)
> +                                          ram_addr_t length,
> +                                          unsigned client,
> +                                          bool clear_dirty)
>  {
>      DirtyMemoryBlocks *blocks;
>      unsigned long end, page, start_page;
> @@ -1126,9 +1127,11 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>              page += num;
>          }
>  
> -        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
> -        mr_size = (end - start_page) << TARGET_PAGE_BITS;
> -        memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
> +        if (clear_dirty) {
> +            mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
> +            mr_size = (end - start_page) << TARGET_PAGE_BITS;
> +            memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
> +        }
>      }
>  
>      if (dirty) {
> @@ -1140,9 +1143,9 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>  
>  static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
>  {
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, true);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, true);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, true);
>  }
>  
>  DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty
> -- 
> 2.20.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls
  2025-12-09 21:10   ` Peter Xu
@ 2025-12-10 14:18     ` Chuang Xu
  2025-12-10 19:45       ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Chuang Xu @ 2025-12-10 14:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, mst, sgarzare, richard.henderson, pbonzini, david,
	philmd, farosas

Hi, Peter

On 10/12/2025 05:10, Peter Xu wrote:
> On Mon, Dec 08, 2025 at 08:09:52PM +0800, Chuang Xu wrote:
>> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
>>
>> When the addresses processed are not aligned, a large number of
>> clear_dirty ioctl occur (e.g. a 16MB misaligned memory can generate
>> 4096 clear_dirty ioctls), which increases the time required for
>> bitmap_sync and makes it more difficult for dirty pages to converge.
>>
>> Attempt to merge those fragmented clear_dirty ioctls.
> (besides separate perf results I requested as in the cover letter reply..)
>
> Could you add something into the commit log explaining at least one example
> that you observe?  E.g. what is the VM setup, how many ramblocks are the
> ones not aligned?
>
> Have you considered setting rb->clear_bmap when it's available?  It'll
> postpone the remote clear even further until page sent.  Logically it
> should be more efficient, but it may depend on the size of unaligned
> ramblocks that you're hitting indeed, as clear_bmap isn't PAGE_SIZE based
> but it can be much bigger.  Some discussion around this would be nice on
> how you chose the current solution.
>

On my Intel(R) Xeon(R) 6986P-C(previous tests were based on Cascade Lake),
I add some logs. Here are some examples of unaligned memory I observed:
size 1966080: system.flash0
size 131072: /rom@etc/acpi/tables, isa-bios, system.flash1, pc.rom
size 65536: cirrus_vga.rom

Taking system.flash0 as an example, judging from its size, this should 
be the OVMF I'm using.
This memory segment will trigger clear_dirty in both memory_listener 
"kvm-memory" and
memory_listener "kvm-smram" simultaneously, ultimately resulting in a 
total of 960 kvm_ioctl calls.
If a larger OVMF is used, this number will obviously worsen.

On the machine I tested, clear system.flash0 took a total of 49ms,
and clear all unaligned memory took a total of 61ms.

Regarding why the current solution was chosen, because I'm not sure if 
there were any
special considerations in the initial design of clear_dirty_log for not 
applying unaligned memory paths.
Therefore, I chose to keep most of the logic the same as the existing one,
only extracting and merging the actual clear_dirty operations.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls
  2025-12-10 14:18     ` Chuang Xu
@ 2025-12-10 19:45       ` Peter Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2025-12-10 19:45 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, mst, sgarzare, richard.henderson, pbonzini, david,
	philmd, farosas

On Wed, Dec 10, 2025 at 10:18:21PM +0800, Chuang Xu wrote:
> Hi, Peter
> 
> On 10/12/2025 05:10, Peter Xu wrote:
> > On Mon, Dec 08, 2025 at 08:09:52PM +0800, Chuang Xu wrote:
> >> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> >>
> >> When the addresses processed are not aligned, a large number of
> >> clear_dirty ioctl occur (e.g. a 16MB misaligned memory can generate
> >> 4096 clear_dirty ioctls), which increases the time required for
> >> bitmap_sync and makes it more difficult for dirty pages to converge.
> >>
> >> Attempt to merge those fragmented clear_dirty ioctls.
> > (besides separate perf results I requested as in the cover letter reply..)
> >
> > Could you add something into the commit log explaining at least one example
> > that you observe?  E.g. what is the VM setup, how many ramblocks are the
> > ones not aligned?
> >
> > Have you considered setting rb->clear_bmap when it's available?  It'll
> > postpone the remote clear even further until page sent.  Logically it
> > should be more efficient, but it may depend on the size of unaligned
> > ramblocks that you're hitting indeed, as clear_bmap isn't PAGE_SIZE based
> > but it can be much bigger.  Some discussion around this would be nice on
> > how you chose the current solution.
> >
> 
> On my Intel(R) Xeon(R) 6986P-C(previous tests were based on Cascade Lake),
> I add some logs. Here are some examples of unaligned memory I observed:
> size 1966080: system.flash0
> size 131072: /rom@etc/acpi/tables, isa-bios, system.flash1, pc.rom
> size 65536: cirrus_vga.rom
> 
> Taking system.flash0 as an example, judging from its size, this should 
> be the OVMF I'm using.
> This memory segment will trigger clear_dirty in both memory_listener 
> "kvm-memory" and
> memory_listener "kvm-smram" simultaneously, ultimately resulting in a 
> total of 960 kvm_ioctl calls.
> If a larger OVMF is used, this number will obviously worsen.
> 
> On the machine I tested, clear system.flash0 took a total of 49ms,
> and clear all unaligned memory took a total of 61ms.

Thanks for the numbers.  It might be more helpful if you share the other
relevant numbers, e.g. referring to your cover letter definitions,

  - total bitmap sync time (x)
  - iteration transfer time (t)

that'll provide a full picture of how much wasted on per-page log_clear().

You can use those to compare to the clear() operation it took after your
patch applied.  I think that might be a better way to justify the patch.

In general, it looks reasonable to me.

Having a bool parameter for the old physical_memory_test_and_clear_dirty()
is fine but might be slightly error prone for new callers if they set it to
false by accident.

Another way to do it is, since physical_memory_test_and_clear_dirty()
always takes a range, we can pass the bitmap ("dest") into it when
available, then changing the retval to be num_dirty:

  (1) when dest provided, only account it when "dest" bitmap wasn't set
  already

  (2) when dest==NULL, treat all dirty bits into num_dirty

Maybe that'll look better, you can double check.

Looks like patch 1 will be dropped. You can repost this patch alone when
you feel ready whatever way you see fit.

Thanks,

> 
> Regarding why the current solution was chosen, because I'm not sure if 
> there were any
> special considerations in the initial design of clear_dirty_log for not 
> applying unaligned memory paths.
> Therefore, I chose to keep most of the logic the same as the existing one,
> only extracting and merging the actual clear_dirty operations.
> 
> Thanks.
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v1 0/2] migration: reduce bitmap sync time and make dirty pages converge much more easily
  2025-12-08 12:09 [RFC v1 0/2] migration: reduce bitmap sync time and make dirty pages converge much more easily Chuang Xu
  2025-12-08 12:09 ` [RFC v1 1/2] vhost: eliminate duplicate dirty_bitmap sync when log shared by multiple devices Chuang Xu
  2025-12-08 12:09 ` [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls Chuang Xu
@ 2025-12-09 21:06 ` Peter Xu
  2 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2025-12-09 21:06 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, mst, sgarzare, richard.henderson, pbonzini, david,
	philmd, farosas

On Mon, Dec 08, 2025 at 08:09:50PM +0800, Chuang Xu wrote:
> For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and 16 vhost-user-blk(4 queue per blk),
> the sync time is as high as 250ms, while after applying this patch, the sync time is only 10ms.

This is definitely an improvement.  Could I request for a split of perf
results?  As the two patches do totally different things, so I think it
would make sense to know which provided how much benefits.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-12-10 19:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-08 12:09 [RFC v1 0/2] migration: reduce bitmap sync time and make dirty pages converge much more easily Chuang Xu
2025-12-08 12:09 ` [RFC v1 1/2] vhost: eliminate duplicate dirty_bitmap sync when log shared by multiple devices Chuang Xu
2025-12-09 20:47   ` Peter Xu
2025-12-10  6:51     ` Jason Wang
2025-12-10 13:52       ` Chuang Xu
2025-12-08 12:09 ` [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls Chuang Xu
2025-12-09 21:10   ` Peter Xu
2025-12-10 14:18     ` Chuang Xu
2025-12-10 19:45       ` Peter Xu
2025-12-09 21:06 ` [RFC v1 0/2] migration: reduce bitmap sync time and make dirty pages converge much more easily Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).