[Qemu-devel] [RFC] optimization for qcow2 cache get/put

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC] optimization for qcow2 cache get/put
@ 2015-01-26 13:20 Zhang Haoyu
  2015-01-26 14:11 ` Max Reitz
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Zhang Haoyu @ 2015-01-26 13:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Fam Zheng, Stefan Hajnoczi

Hi, all

Regarding too large qcow2 image, e.g., 2TB,
so long disruption happened when performing snapshot,
which was caused by cache update and IO wait.
perf top data shown as below,
   PerfTop:    2554 irqs/sec  kernel: 0.4%  exact:  0.0% [4000Hz cycles],  (target_pid: 34294)
------------------------------------------------------------------------------------------------------------------------

    33.80%  qemu-system-x86_64  [.] qcow2_cache_do_get            
    27.59%  qemu-system-x86_64  [.] qcow2_cache_put               
    15.19%  qemu-system-x86_64  [.] qcow2_cache_entry_mark_dirty  
     5.49%  qemu-system-x86_64  [.] update_refcount               
     3.02%  libpthread-2.13.so  [.] pthread_getspecific           
     2.26%  qemu-system-x86_64  [.] get_refcount                  
     1.95%  qemu-system-x86_64  [.] coroutine_get_thread_state    
     1.32%  qemu-system-x86_64  [.] qcow2_update_snapshot_refcount
     1.20%  qemu-system-x86_64  [.] qemu_coroutine_self           
     1.16%  libz.so.1.2.7       [.] 0x0000000000003018            
     0.95%  qemu-system-x86_64  [.] qcow2_update_cluster_refcount 
     0.91%  qemu-system-x86_64  [.] qcow2_cache_get               
     0.76%  libc-2.13.so        [.] 0x0000000000134e49            
     0.73%  qemu-system-x86_64  [.] bdrv_debug_event              
     0.16%  qemu-system-x86_64  [.] pthread_getspecific@plt       
     0.12%  [kernel]            [k] _raw_spin_unlock_irqrestore   
     0.10%  qemu-system-x86_64  [.] vga_draw_line24_32            
     0.09%  [vdso]              [.] 0x000000000000060c            
     0.09%  qemu-system-x86_64  [.] qcow2_check_metadata_overlap  
     0.08%  [kernel]            [k] do_blockdev_direct_IO  

If expand the cache table size, the IO will be decreased, 
but the calculation time will be grown.
so it's worthy to optimize qcow2 cache get and put algorithm.

My proposal:
get:
using ((use offset >> cluster_bits) % c->size) to locate the cache entry,
raw implementation,
index = (use offset >> cluster_bits) % c->size;
if (c->entries[index].offset == offset) {
    goto found;
}

replace:
c->entries[use offset >> cluster_bits) % c->size].offset = offset;
...

put:
using 64-entries cache table to cache
the recently got c->entries, i.e., cache for cache,
then during put process, firstly search the 64-entries cache,
if not found, then the c->entries.

Any idea?

Thanks,
Zhang Haoyu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [RFC] optimization for qcow2 cache get/put
  2015-01-26 13:20 [Qemu-devel] [RFC] optimization for qcow2 cache get/put Zhang Haoyu
@ 2015-01-26 14:11 ` Max Reitz
  2015-01-27  1:23 ` Zhang Haoyu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Max Reitz @ 2015-01-26 14:11 UTC (permalink / raw)
  To: Zhang Haoyu, qemu-devel
  Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, Stefan Hajnoczi

On 2015-01-26 at 08:20, Zhang Haoyu wrote:
> Hi, all
>
> Regarding too large qcow2 image, e.g., 2TB,
> so long disruption happened when performing snapshot,
> which was caused by cache update and IO wait.
> perf top data shown as below,
>     PerfTop:    2554 irqs/sec  kernel: 0.4%  exact:  0.0% [4000Hz cycles],  (target_pid: 34294)
> ------------------------------------------------------------------------------------------------------------------------
>
>      33.80%  qemu-system-x86_64  [.] qcow2_cache_do_get
>      27.59%  qemu-system-x86_64  [.] qcow2_cache_put
>      15.19%  qemu-system-x86_64  [.] qcow2_cache_entry_mark_dirty
>       5.49%  qemu-system-x86_64  [.] update_refcount
>       3.02%  libpthread-2.13.so  [.] pthread_getspecific
>       2.26%  qemu-system-x86_64  [.] get_refcount
>       1.95%  qemu-system-x86_64  [.] coroutine_get_thread_state
>       1.32%  qemu-system-x86_64  [.] qcow2_update_snapshot_refcount
>       1.20%  qemu-system-x86_64  [.] qemu_coroutine_self
>       1.16%  libz.so.1.2.7       [.] 0x0000000000003018
>       0.95%  qemu-system-x86_64  [.] qcow2_update_cluster_refcount
>       0.91%  qemu-system-x86_64  [.] qcow2_cache_get
>       0.76%  libc-2.13.so        [.] 0x0000000000134e49
>       0.73%  qemu-system-x86_64  [.] bdrv_debug_event
>       0.16%  qemu-system-x86_64  [.] pthread_getspecific@plt
>       0.12%  [kernel]            [k] _raw_spin_unlock_irqrestore
>       0.10%  qemu-system-x86_64  [.] vga_draw_line24_32
>       0.09%  [vdso]              [.] 0x000000000000060c
>       0.09%  qemu-system-x86_64  [.] qcow2_check_metadata_overlap
>       0.08%  [kernel]            [k] do_blockdev_direct_IO
>
> If expand the cache table size, the IO will be decreased,
> but the calculation time will be grown.
> so it's worthy to optimize qcow2 cache get and put algorithm.
>
> My proposal:
> get:
> using ((use offset >> cluster_bits) % c->size) to locate the cache entry,
> raw implementation,
> index = (use offset >> cluster_bits) % c->size;
> if (c->entries[index].offset == offset) {
>      goto found;
> }
>
> replace:
> c->entries[use offset >> cluster_bits) % c->size].offset = offset;

Well, direct-mapped caches do have their benefits, but remember that 
they do have disadvantages, too. Regarding CPU caches, set associative 
caches seem to be largely favored, so that may be a better idea.

CC'ing Kevin, because it's his code.

Max

> ...
>
> put:
> using 64-entries cache table to cache
> the recently got c->entries, i.e., cache for cache,
> then during put process, firstly search the 64-entries cache,
> if not found, then the c->entries.
>
> Any idea?
>
> Thanks,
> Zhang Haoyu
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [RFC] optimization for qcow2 cache get/put
  2015-01-26 13:20 [Qemu-devel] [RFC] optimization for qcow2 cache get/put Zhang Haoyu
  2015-01-26 14:11 ` Max Reitz
@ 2015-01-27  1:23 ` Zhang Haoyu
  2015-01-27  3:53 ` Zhang Haoyu
  2015-03-26 14:33 ` Stefan Hajnoczi
  3 siblings, 0 replies; 5+ messages in thread
From: Zhang Haoyu @ 2015-01-27  1:23 UTC (permalink / raw)
  To: Max Reitz, qemu-devel
  Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, Stefan Hajnoczi


On 2015-01-26 22:11:59, Max Reitz wrote:
>On 2015-01-26 at 08:20, Zhang Haoyu wrote:
> > Hi, all
> >
> > Regarding too large qcow2 image, e.g., 2TB,
> > so long disruption happened when performing snapshot,
>> which was caused by cache update and IO wait.
> > perf top data shown as below,
> >     PerfTop:    2554 irqs/sec  kernel: 0.4%  exact:  0.0% [4000Hz cycles],  (target_pid: 34294)
> > ------------------------------------------------------------------------------------------------------------------------
>>
> >      33.80%  qemu-system-x86_64  [.] qcow2_cache_do_get
> >      27.59%  qemu-system-x86_64  [.] qcow2_cache_put
> >      15.19%  qemu-system-x86_64  [.] qcow2_cache_entry_mark_dirty
> >       5.49%  qemu-system-x86_64  [.] update_refcount
>>       3.02%  libpthread-2.13.so  [.] pthread_getspecific
> >       2.26%  qemu-system-x86_64  [.] get_refcount
> >       1.95%  qemu-system-x86_64  [.] coroutine_get_thread_state
> >       1.32%  qemu-system-x86_64  [.] qcow2_update_snapshot_refcount
>>       1.20%  qemu-system-x86_64  [.] qemu_coroutine_self
> >       1.16%  libz.so.1.2.7       [.] 0x0000000000003018
> >       0.95%  qemu-system-x86_64  [.] qcow2_update_cluster_refcount
> >       0.91%  qemu-system-x86_64  [.] qcow2_cache_get
> >       0.76%  libc-2.13.so        [.] 0x0000000000134e49
>>       0.73%  qemu-system-x86_64  [.] bdrv_debug_event
> >       0.16%  qemu-system-x86_64  [.] pthread_getspecific@plt
> >       0.12%  [kernel]            [k] _raw_spin_unlock_irqrestore
> >       0.10%  qemu-system-x86_64  [.] vga_draw_line24_32
>>       0.09%  [vdso]              [.] 0x000000000000060c
> >       0.09%  qemu-system-x86_64  [.] qcow2_check_metadata_overlap
> >       0.08%  [kernel]            [k] do_blockdev_direct_IO
> >
> > If expand the cache table size, the IO will be decreased,
>> but the calculation time will be grown.
> > so it's worthy to optimize qcow2 cache get and put algorithm.
> >
> > My proposal:
>> get:
> > using ((use offset >> cluster_bits) % c->size) to locate the cache entry,
> > raw implementation,
> > index = (use offset >> cluster_bits) % c->size;
> > if (c->entries[index].offset == offset) {
>>      goto found;
> > }
> >
> > replace:
>> c->entries[use offset >> cluster_bits) % c->size].offset = offset;
> 
> Well, direct-mapped caches do have their benefits, but remember that 
> they do have disadvantages, too. Regarding CPU caches, set associative 
> caches seem to be largely favored, so that may be a better idea.
>
Thanks, Max,
I think if direct-mapped caches were used, we can expand the cache table size
to decrease IOs, and cache location is not time-expensive even cpu cache miss
happened.
Of course set associative caches is preferred regarding cpu caches,
but sequential traverse algorithm only provides more probability
for association, but after running some time, the probability
of association maybe reduced, I guess.
I will test the direct-mapped cache, and test result will be posted soon.

> CC'ing Kevin, because it's his code.
> 
> Max
>
> > ...
> >
> > put:
> > using 64-entries cache table to cache
>> the recently got c->entries, i.e., cache for cache,
> > then during put process, firstly search the 64-entries cache,
> > if not found, then the c->entries.
> >
>> Any idea?
> >
> > Thanks,
> > Zhang Haoyu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [RFC] optimization for qcow2 cache get/put
  2015-01-26 13:20 [Qemu-devel] [RFC] optimization for qcow2 cache get/put Zhang Haoyu
  2015-01-26 14:11 ` Max Reitz
  2015-01-27  1:23 ` Zhang Haoyu
@ 2015-01-27  3:53 ` Zhang Haoyu
  2015-03-26 14:33 ` Stefan Hajnoczi
  3 siblings, 0 replies; 5+ messages in thread
From: Zhang Haoyu @ 2015-01-27  3:53 UTC (permalink / raw)
  To: Max Reitz, qemu-devel
  Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, Stefan Hajnoczi


On 2015-01-27 09:24:13, Zhang Haoyu wrote:
> 
> On 2015-01-26 22:11:59, Max Reitz wrote:
> >On 2015-01-26 at 08:20, Zhang Haoyu wrote:
>> > Hi, all
> > >
> > > Regarding too large qcow2 image, e.g., 2TB,
> > > so long disruption happened when performing snapshot,
> >> which was caused by cache update and IO wait.
> > > perf top data shown as below,
> > >     PerfTop:    2554 irqs/sec  kernel: 0.4%  exact:  0.0% [4000Hz cycles],  (target_pid: 34294)
> > > ------------------------------------------------------------------------------------------------------------------------
> >>
> > >      33.80%  qemu-system-x86_64  [.] qcow2_cache_do_get
> > >      27.59%  qemu-system-x86_64  [.] qcow2_cache_put
> > >      15.19%  qemu-system-x86_64  [.] qcow2_cache_entry_mark_dirty
> > >       5.49%  qemu-system-x86_64  [.] update_refcount
> >>       3.02%  libpthread-2.13.so  [.] pthread_getspecific
> > >       2.26%  qemu-system-x86_64  [.] get_refcount
> > >       1.95%  qemu-system-x86_64  [.] coroutine_get_thread_state
>> >       1.32%  qemu-system-x86_64  [.] qcow2_update_snapshot_refcount
> >>       1.20%  qemu-system-x86_64  [.] qemu_coroutine_self
> > >       1.16%  libz.so.1.2.7       [.] 0x0000000000003018
> > >       0.95%  qemu-system-x86_64  [.] qcow2_update_cluster_refcount
> > >       0.91%  qemu-system-x86_64  [.] qcow2_cache_get
> > >       0.76%  libc-2.13.so        [.] 0x0000000000134e49
> >>       0.73%  qemu-system-x86_64  [.] bdrv_debug_event
> > >       0.16%  qemu-system-x86_64  [.] pthread_getspecific@plt
> > >       0.12%  [kernel]            [k] _raw_spin_unlock_irqrestore
> > >       0.10%  qemu-system-x86_64  [.] vga_draw_line24_32
> >>       0.09%  [vdso]              [.] 0x000000000000060c
> > >       0.09%  qemu-system-x86_64  [.] qcow2_check_metadata_overlap
> > >       0.08%  [kernel]            [k] do_blockdev_direct_IO
> > >
> > > If expand the cache table size, the IO will be decreased,
> >> but the calculation time will be grown.
>> > so it's worthy to optimize qcow2 cache get and put algorithm.
> > >
> > > My proposal:
> >> get:
> > > using ((use offset >> cluster_bits) % c->size) to locate the cache entry,
> > > raw implementation,
> > > index = (use offset >> cluster_bits) % c->size;
> > > if (c->entries[index].offset == offset) {
> >>      goto found;
> > > }
> > >
> > > replace:
> >> c->entries[use offset >> cluster_bits) % c->size].offset = offset;
> > 
> > Well, direct-mapped caches do have their benefits, but remember that 
> > they do have disadvantages, too. Regarding CPU caches, set associative 
>> caches seem to be largely favored, so that may be a better idea.
> >
> Thanks, Max,
> I think if direct-mapped caches were used, we can expand the cache table size
> to decrease IOs, and cache location is not time-expensive even cpu cache miss
> happened.
> Of course set associative caches is preferred regarding cpu caches,
> but sequential traverse algorithm only provides more probability
> for association, but after running some time, the probability
> of association maybe reduced, I guess.
> I will test the direct-mapped cache, and test result will be posted soon.
> 
I've tested direct-mapped cache, the conflicts of cache location caused
about 4000 IOs during performing snapshot for 2TB thin-provision qcow2 image.
But the overhead of qcow2_cache_do_get() significantly decreased from
33.80% to 10.43%.
I'll try two-dimension cache to decrease the mostly IO, even to zero, 
4 as the default size of the second dimension.

Any ideas?

> > CC'ing Kevin, because it's his code.
> > 
> > Max
> >
>> > ...
> > >
> > > put:
> > > using 64-entries cache table to cache
> >> the recently got c->entries, i.e., cache for cache,
> > > then during put process, firstly search the 64-entries cache,
> > > if not found, then the c->entries.
I've tried c->last_used_cache pointer for the most recently got
cache entry, the overhead of qcow2_cache_put() significantly
decreased from 27.59% to 5.38%.
I've also traced c->last_used_cache miss rate, absolutely zero,
I'll test again. 
> > >
> >> Any idea?
> > >
> > > Thanks,
> > > Zhang Haoyu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [RFC] optimization for qcow2 cache get/put
  2015-01-26 13:20 [Qemu-devel] [RFC] optimization for qcow2 cache get/put Zhang Haoyu
                   ` (2 preceding siblings ...)
  2015-01-27  3:53 ` Zhang Haoyu
@ 2015-03-26 14:33 ` Stefan Hajnoczi
  3 siblings, 0 replies; 5+ messages in thread
From: Stefan Hajnoczi @ 2015-03-26 14:33 UTC (permalink / raw)
  To: Zhang Haoyu; +Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2710 bytes --]

On Mon, Jan 26, 2015 at 09:20:00PM +0800, Zhang Haoyu wrote:
> Hi, all
> 
> Regarding too large qcow2 image, e.g., 2TB,
> so long disruption happened when performing snapshot,
> which was caused by cache update and IO wait.

I have CCed Kevin Wolf, the qcow2 maintainer.

> perf top data shown as below,
>    PerfTop:    2554 irqs/sec  kernel: 0.4%  exact:  0.0% [4000Hz cycles],  (target_pid: 34294)
> ------------------------------------------------------------------------------------------------------------------------
> 
>     33.80%  qemu-system-x86_64  [.] qcow2_cache_do_get            
>     27.59%  qemu-system-x86_64  [.] qcow2_cache_put               
>     15.19%  qemu-system-x86_64  [.] qcow2_cache_entry_mark_dirty  
>      5.49%  qemu-system-x86_64  [.] update_refcount               
>      3.02%  libpthread-2.13.so  [.] pthread_getspecific           
>      2.26%  qemu-system-x86_64  [.] get_refcount                  
>      1.95%  qemu-system-x86_64  [.] coroutine_get_thread_state    
>      1.32%  qemu-system-x86_64  [.] qcow2_update_snapshot_refcount
>      1.20%  qemu-system-x86_64  [.] qemu_coroutine_self           
>      1.16%  libz.so.1.2.7       [.] 0x0000000000003018            
>      0.95%  qemu-system-x86_64  [.] qcow2_update_cluster_refcount 
>      0.91%  qemu-system-x86_64  [.] qcow2_cache_get               
>      0.76%  libc-2.13.so        [.] 0x0000000000134e49            
>      0.73%  qemu-system-x86_64  [.] bdrv_debug_event              
>      0.16%  qemu-system-x86_64  [.] pthread_getspecific@plt       
>      0.12%  [kernel]            [k] _raw_spin_unlock_irqrestore   
>      0.10%  qemu-system-x86_64  [.] vga_draw_line24_32            
>      0.09%  [vdso]              [.] 0x000000000000060c            
>      0.09%  qemu-system-x86_64  [.] qcow2_check_metadata_overlap  
>      0.08%  [kernel]            [k] do_blockdev_direct_IO  
> 
> If expand the cache table size, the IO will be decreased, 
> but the calculation time will be grown.
> so it's worthy to optimize qcow2 cache get and put algorithm.
> 
> My proposal:
> get:
> using ((use offset >> cluster_bits) % c->size) to locate the cache entry,
> raw implementation,
> index = (use offset >> cluster_bits) % c->size;
> if (c->entries[index].offset == offset) {
>     goto found;
> }
> 
> replace:
> c->entries[use offset >> cluster_bits) % c->size].offset = offset;
> ...
> 
> put:
> using 64-entries cache table to cache
> the recently got c->entries, i.e., cache for cache,
> then during put process, firstly search the 64-entries cache,
> if not found, then the c->entries.
> 
> Any idea?
> 
> Thanks,
> Zhang Haoyu
> 

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-03-26 14:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-26 13:20 [Qemu-devel] [RFC] optimization for qcow2 cache get/put Zhang Haoyu
2015-01-26 14:11 ` Max Reitz
2015-01-27  1:23 ` Zhang Haoyu
2015-01-27  3:53 ` Zhang Haoyu
2015-03-26 14:33 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).