[Ocfs2-devel] Read IOPS storm in case of reflinking running VM disk

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ocfs2-devel] Read IOPS storm in case of reflinking running VM disk
@ 2015-05-08  5:56 Eugene Istomin
  2015-05-11  8:48 ` Eugene Istomin
  0 siblings, 1 reply; 6+ messages in thread
From: Eugene Istomin @ 2015-05-08  5:56 UTC (permalink / raw)
  To: ocfs2-devel

Hello,

after deploying reflink-based VM snapshots to production servers we discovered 
a performace degradation:

OS: Opensuse 13.1, 13.2
Hypervisors: Xen 4.4, 4.5
Dom0 kernels: 3.12, 3.16, 3.18
DomU kernels: 3.12, 3.16, 3.18
Tested DomU disk backends: tapdisk2, qdisk


1) on DomU (VM) 
#dd if=/dev/zero of=test2 bs=1M count=6000

2) atop on Dom0:
sdb - busy:92% - read:375 - write:130902
Reads are from others VMs, seems OK

3) DomU dd finished:
6291456000 bytes (6.3 GB) copied, 16.6265 s, 378 MB/s

4) Lets start dd again & do a snapshot:
#dd if=/dev/zero of=test2 bs=1M count=6000
#reflink test.raw ref/

5) atop on Dom0:
sdb - busy:97% - read:112740 - write:28037
So, Read IOPS = 112740, why?

6) DomU dd finished:
6291456000 bytes (6.3 GB) copied, 175.45 s, 35.9 MB/s

7) Second & further reflinks do not change the atop stat & dd time
#dd if=/dev/zero of=test2 bs=1M count=6000
#reflink --backup=t test.raw ref/    \\ * n times
~ 6291456000 bytes (6.3 GB) copied, 162.959 s, 38.6 MB/s

The question is why reflinking a running VM disk leads to read IOPS storm?


Thanks!

-- 
Best regards,
Eugene Istomin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Ocfs2-devel] Read IOPS storm in case of reflinking running VM disk
  2015-05-08  5:56 [Ocfs2-devel] Read IOPS storm in case of reflinking running VM disk Eugene Istomin
@ 2015-05-11  8:48 ` Eugene Istomin
  2015-05-18 10:05   ` Eugene Istomin
  0 siblings, 1 reply; 6+ messages in thread
From: Eugene Istomin @ 2015-05-11  8:48 UTC (permalink / raw)
  To: ocfs2-devel

Hello Goldwyn,

Do you know something about such behavior?
The question is why a reflink operation on VM disk leads to plenty of read ops?
Is this related to CoW specific structures? 

We can provide others details & ssh to testbed.

-- 
Best regards,
Eugene Istomin


On Friday, May 08, 2015 08:56:57 AM Eugene Istomin wrote:
> Hello,
> 
> after deploying reflink-based VM snapshots to production servers we
> discovered a performace degradation:
> 
> OS: Opensuse 13.1, 13.2
> Hypervisors: Xen 4.4, 4.5
> Dom0 kernels: 3.12, 3.16, 3.18
> DomU kernels: 3.12, 3.16, 3.18
> Tested DomU disk backends: tapdisk2, qdisk
> 
> 
> 1) on DomU (VM)
> #dd if=/dev/zero of=test2 bs=1M count=6000
> 
> 2) atop on Dom0:
> sdb - busy:92% - read:375 - write:130902
> Reads are from others VMs, seems OK
> 
> 3) DomU dd finished:
> 6291456000 bytes (6.3 GB) copied, 16.6265 s, 378 MB/s
> 
> 4) Lets start dd again & do a snapshot:
> #dd if=/dev/zero of=test2 bs=1M count=6000
> #reflink test.raw ref/
> 
> 5) atop on Dom0:
> sdb - busy:97% - read:112740 - write:28037
> So, Read IOPS = 112740, why?
> 
> 6) DomU dd finished:
> 6291456000 bytes (6.3 GB) copied, 175.45 s, 35.9 MB/s
> 
> 7) Second & further reflinks do not change the atop stat & dd time
> #dd if=/dev/zero of=test2 bs=1M count=6000
> #reflink --backup=t test.raw ref/    \\ * n times
> ~ 6291456000 bytes (6.3 GB) copied, 162.959 s, 38.6 MB/s
> 
> The question is why reflinking a running VM disk leads to read IOPS storm?
> 
> 
> Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Ocfs2-devel] Read IOPS storm in case of reflinking running VM disk
  2015-05-11  8:48 ` Eugene Istomin
@ 2015-05-18 10:05   ` Eugene Istomin
  2015-05-18 17:45     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 6+ messages in thread
From: Eugene Istomin @ 2015-05-18 10:05 UTC (permalink / raw)
  To: ocfs2-devel

Hello, 

ping
-- 
Best regards,
Eugene Istomin

On Monday, May 11, 2015 11:48:11 AM Eugene Istomin wrote:
> Hello Goldwyn,
> 
> Do you know something about such behavior?
> The question is why a reflink operation on VM disk leads to plenty of read
> ops? Is this related to CoW specific structures?
> 
> We can provide others details & ssh to testbed.
> 
> > Hello,
> > 
> > after deploying reflink-based VM snapshots to production servers we
> > discovered a performace degradation:
> > 
> > OS: Opensuse 13.1, 13.2
> > Hypervisors: Xen 4.4, 4.5
> > Dom0 kernels: 3.12, 3.16, 3.18
> > DomU kernels: 3.12, 3.16, 3.18
> > Tested DomU disk backends: tapdisk2, qdisk
> > 
> > 
> > 1) on DomU (VM)
> > #dd if=/dev/zero of=test2 bs=1M count=6000
> > 
> > 2) atop on Dom0:
> > sdb - busy:92% - read:375 - write:130902
> > Reads are from others VMs, seems OK
> > 
> > 3) DomU dd finished:
> > 6291456000 bytes (6.3 GB) copied, 16.6265 s, 378 MB/s
> > 
> > 4) Lets start dd again & do a snapshot:
> > #dd if=/dev/zero of=test2 bs=1M count=6000
> > #reflink test.raw ref/
> > 
> > 5) atop on Dom0:
> > sdb - busy:97% - read:112740 - write:28037
> > So, Read IOPS = 112740, why?
> > 
> > 6) DomU dd finished:
> > 6291456000 bytes (6.3 GB) copied, 175.45 s, 35.9 MB/s
> > 
> > 7) Second & further reflinks do not change the atop stat & dd time
> > #dd if=/dev/zero of=test2 bs=1M count=6000
> > #reflink --backup=t test.raw ref/    \\ * n times
> > ~ 6291456000 bytes (6.3 GB) copied, 162.959 s, 38.6 MB/s
> > 
> > The question is why reflinking a running VM disk leads to read IOPS storm?
> > 
> > 
> > Thanks!
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20150518/fd9107e5/attachment.html 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Ocfs2-devel] Read IOPS storm in case of reflinking running VM disk
  2015-05-18 10:05   ` Eugene Istomin
@ 2015-05-18 17:45     ` Goldwyn Rodrigues
  2015-05-20 22:33       ` Eugene Istomin
  0 siblings, 1 reply; 6+ messages in thread
From: Goldwyn Rodrigues @ 2015-05-18 17:45 UTC (permalink / raw)
  To: ocfs2-devel


Hi Eugene,

Sorry, had been busy with other work and this slipped on the list.

>
>  > Do you know something about such behavior?
>
>  > The question is why a reflink operation on VM disk leads to plenty of
> read
>
>  > ops? Is this related to CoW specific structures?
>

This is in fact related to the CoW. An ocfs2 file is an extent tree, 
which the extent headers marking if the extent is a reflinked or not 
with the number of reflinks.

If you perform a reflink on a file which is being changed constantly, 
not only recreate the extent tree, but also decrease the refcount of the 
ones already present. Add to it, the extents which need to be read for 
replication.


HTH,

>  >
>
>  > We can provide others details & ssh to testbed.
>
>  >
>
>  > > Hello,
>
>  > >
>
>  > > after deploying reflink-based VM snapshots to production servers we
>
>  > > discovered a performace degradation:
>
>  > >
>
>  > > OS: Opensuse 13.1, 13.2
>
>  > > Hypervisors: Xen 4.4, 4.5
>
>  > > Dom0 kernels: 3.12, 3.16, 3.18
>
>  > > DomU kernels: 3.12, 3.16, 3.18
>
>  > > Tested DomU disk backends: tapdisk2, qdisk
>
>  > >
>
>  > >
>
>  > > 1) on DomU (VM)
>
>  > > #dd if=/dev/zero of=test2 bs=1M count=6000
>
>  > >
>
>  > > 2) atop on Dom0:
>
>  > > sdb - busy:92% - read:375 - write:130902
>
>  > > Reads are from others VMs, seems OK
>
>  > >
>
>  > > 3) DomU dd finished:
>
>  > > 6291456000 bytes (6.3 GB) copied, 16.6265 s, 378 MB/s
>
>  > >
>
>  > > 4) Lets start dd again & do a snapshot:
>
>  > > #dd if=/dev/zero of=test2 bs=1M count=6000
>
>  > > #reflink test.raw ref/
>
>  > >
>
>  > > 5) atop on Dom0:
>
>  > > sdb - busy:97% - read:112740 - write:28037
>
>  > > So, Read IOPS = 112740, why?
>
>  > >
>
>  > > 6) DomU dd finished:
>
>  > > 6291456000 bytes (6.3 GB) copied, 175.45 s, 35.9 MB/s
>
>  > >
>
>  > > 7) Second & further reflinks do not change the atop stat & dd time
>
>  > > #dd if=/dev/zero of=test2 bs=1M count=6000
>
>  > > #reflink --backup=t test.raw ref/ \\ * n times
>
>  > > ~ 6291456000 bytes (6.3 GB) copied, 162.959 s, 38.6 MB/s
>
>  > >
>
>  > > The question is why reflinking a running VM disk leads to read IOPS
> storm?
>
>  > >
>
>  > >
>
>  > > Thanks!
>
>  >
>
>  > _______________________________________________
>
>  > Ocfs2-devel mailing list
>
>  > Ocfs2-devel at oss.oracle.com
>
>  > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Ocfs2-devel] Read IOPS storm in case of reflinking running VM disk
  2015-05-18 17:45     ` Goldwyn Rodrigues
@ 2015-05-20 22:33       ` Eugene Istomin
  2015-05-21 11:57         ` Goldwyn Rodrigues
  0 siblings, 1 reply; 6+ messages in thread
From: Eugene Istomin @ 2015-05-20 22:33 UTC (permalink / raw)
  To: ocfs2-devel

Goldwyn,

thanks for the answer!

I read 
https://oss.oracle.com/osswiki/OCFS2(2f)DesignDocs(2f)RefcountTrees.html  
carefully to understand the problem.

As i understand:
There are B-Tree structures for reflink: ocfs2_refcount_tree; 
ocfs2_refcount_block -> ocfs2_refcount_list -> ocfs2_refcount_rec
"The refcount tree root is a refcount block pointed to by i_refcount_loc"
Some operations needs extra uncached lookups
Also i dumped frag/stat/refcount from production hypervisor node using 
debugfs.ocfs2, files are in attach (url as alt way - 
http://public.edss.ee/tmp/debugfs.tar.gz ). 

Hypervisor OCFS2 mount options: 
rw,nosuid,noexec,noatime,heartbeat=none,nointr,data=ordered,errors=remount-
ro,localalloc=2048,coherency=full,user_xattr,acl

Mkfs string:
mkfs.ocfs2 -b 4KB -C 1MB -N 2 -T vmstore -L "storage" --fs-
features=local,backup-super,sparse,unwritten,inline-
data,metaecc,refcount,xattr,indexed-dirs,discontig-bg


Can you please explain why there are so many extent blocks (204)? Is it really 
impossible to store plenty of clusters in single extent (like #25, block 
3874095 -> 20847 clusters)? 

-- 
Best regards,
Eugene Istomin
IT Architect

On Monday, May 18, 2015 12:45:40 PM Goldwyn Rodrigues wrote:
> Hi Eugene,
> 
> Sorry, had been busy with other work and this slipped on the list.
> 
> >  > Do you know something about such behavior?
> >  > 
> >  > The question is why a reflink operation on VM disk leads to plenty of
> > 
> > read
> > 
> >  > ops? Is this related to CoW specific structures?
> 
> This is in fact related to the CoW. An ocfs2 file is an extent tree,
> which the extent headers marking if the extent is a reflinked or not
> with the number of reflinks.
> 
> If you perform a reflink on a file which is being changed constantly,
> not only recreate the extent tree, but also decrease the refcount of the
> ones already present. Add to it, the extents which need to be read for
> replication.
> 
> 
> HTH,
> 
> >  > We can provide others details & ssh to testbed.
> >  > 
> >  > > Hello,
> >  > > 
> >  > > 
> >  > > 
> >  > > after deploying reflink-based VM snapshots to production servers we
> >  > > 
> >  > > discovered a performace degradation:
> >  > > 
> >  > > 
> >  > > 
> >  > > OS: Opensuse 13.1, 13.2
> >  > > 
> >  > > Hypervisors: Xen 4.4, 4.5
> >  > > 
> >  > > Dom0 kernels: 3.12, 3.16, 3.18
> >  > > 
> >  > > DomU kernels: 3.12, 3.16, 3.18
> >  > > 
> >  > > Tested DomU disk backends: tapdisk2, qdisk
> >  > > 
> >  > > 
> >  > > 
> >  > > 
> >  > > 
> >  > > 1) on DomU (VM)
> >  > > 
> >  > > #dd if=/dev/zero of=test2 bs=1M count=6000
> >  > > 
> >  > > 
> >  > > 
> >  > > 2) atop on Dom0:
> >  > > 
> >  > > sdb - busy:92% - read:375 - write:130902
> >  > > 
> >  > > Reads are from others VMs, seems OK
> >  > > 
> >  > > 
> >  > > 
> >  > > 3) DomU dd finished:
> >  > > 
> >  > > 6291456000 bytes (6.3 GB) copied, 16.6265 s, 378 MB/s
> >  > > 
> >  > > 
> >  > > 
> >  > > 4) Lets start dd again & do a snapshot:
> >  > > 
> >  > > #dd if=/dev/zero of=test2 bs=1M count=6000
> >  > > 
> >  > > #reflink test.raw ref/
> >  > > 
> >  > > 
> >  > > 
> >  > > 5) atop on Dom0:
> >  > > 
> >  > > sdb - busy:97% - read:112740 - write:28037
> >  > > 
> >  > > So, Read IOPS = 112740, why?
> >  > > 
> >  > > 
> >  > > 
> >  > > 6) DomU dd finished:
> >  > > 
> >  > > 6291456000 bytes (6.3 GB) copied, 175.45 s, 35.9 MB/s
> >  > > 
> >  > > 
> >  > > 
> >  > > 7) Second & further reflinks do not change the atop stat & dd time
> >  > > 
> >  > > #dd if=/dev/zero of=test2 bs=1M count=6000
> >  > > 
> >  > > #reflink --backup=t test.raw ref/ \\ * n times
> >  > > 
> >  > > ~ 6291456000 bytes (6.3 GB) copied, 162.959 s, 38.6 MB/s
> >  > > 
> >  > > 
> >  > > 
> >  > > The question is why reflinking a running VM disk leads to read IOPS
> > 
> > storm?
> > 
> >  > > Thanks!
> >  > 
> >  > _______________________________________________
> >  > 
> >  > Ocfs2-devel mailing list
> >  > 
> >  > Ocfs2-devel at oss.oracle.com
> >  > 
> >  > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20150521/23cd43e2/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: debugfs.tar.gz
Type: application/x-compressed-tar
Size: 729820 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20150521/23cd43e2/attachment-0001.bin 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Ocfs2-devel] Read IOPS storm in case of reflinking running VM disk
  2015-05-20 22:33       ` Eugene Istomin
@ 2015-05-21 11:57         ` Goldwyn Rodrigues
  0 siblings, 0 replies; 6+ messages in thread
From: Goldwyn Rodrigues @ 2015-05-21 11:57 UTC (permalink / raw)
  To: ocfs2-devel

On 05/20/2015 05:33 PM, Eugene Istomin wrote:
> Goldwyn,
>
> thanks for the answer!
>
> I read
> https://oss.oracle.com/osswiki/OCFS2(2f)DesignDocs(2f)RefcountTrees.html
> carefully to understand the problem.
>
> As i understand:
>
>  1. There are B-Tree structures for reflink: ocfs2_refcount_tree;
>     ocfs2_refcount_block -> ocfs2_refcount_list -> ocfs2_refcount_rec
>  2. "The refcount tree root is a refcount block pointed to by
>     i_refcount_loc"
>  3. Some operations needs extra uncached lookups
>
> Also i dumped frag/stat/refcount from production hypervisor node using
> debugfs.ocfs2, files are in attach (url as alt way -
> http://public.edss.ee/tmp/debugfs.tar.gz ).
>
> Hypervisor OCFS2 mount options:
> rw,nosuid,noexec,noatime,heartbeat=none,nointr,data=ordered,errors=remount-ro,localalloc=2048,coherency=full,user_xattr,acl
>
> Mkfs string:
>
> mkfs.ocfs2 -b 4KB -C 1MB -N 2 -T vmstore -L "storage"
> --fs-features=local,backup-super,sparse,unwritten,inline-data,metaecc,refcount,xattr,indexed-dirs,discontig-bg
>
> Can you please explain why there are so many extent blocks (204)? Is it
> really impossible to store plenty of clusters in single extent (like
> #25, block 3874095 -> 20847 clusters)?
>

A file's extent tree is based on your usage pattern and what is already 
present on disk. Creating a new file, with large block writes, on a new 
filesystem with no other nodes may create a file with small number of 
extents.

Modifying refcounted files can increase number of extents. The answer 
lies in the document you mentioned:

<quote>

Refcount records do not map 1:1 with extent records. A large extent may 
be split by a CoW operation. To unchanged inodes, they have one extent 
record covering the entire extent. The changed inode will have an extent 
record for the unchanged portion and a new extent record for the changed 
portion. The refcount tree will have similarly split the single refcount 
record into two. The changed portion will have decremented the reference 
count by one, as the changed inode is no longer using that physical extent.

</quote>

HTH,

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-05-21 11:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-08  5:56 [Ocfs2-devel] Read IOPS storm in case of reflinking running VM disk Eugene Istomin
2015-05-11  8:48 ` Eugene Istomin
2015-05-18 10:05   ` Eugene Istomin
2015-05-18 17:45     ` Goldwyn Rodrigues
2015-05-20 22:33       ` Eugene Istomin
2015-05-21 11:57         ` Goldwyn Rodrigues

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.