All of lore.kernel.org
 help / color / mirror / Atom feed
* use object size of 32k rather than 4M
@ 2015-12-23 12:00 hzwulibin
       [not found] ` <567A8CD6.10707-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: hzwulibin @ 2015-12-23 12:00 UTC (permalink / raw)
  To: ceph-devel, ceph-users; +Cc: Sage Weil, Haomai Wang

Hi, cephers, Sage and Haomai

Recently we stuck of the performance down problem when recoverying. The scene is simple:
1. run fio with rand write(bs=4k)
2. stop one osd; sleep 10; start the osd
3. the IOPS drop from 6K to about 200

We now know the SSD which that osd on is the bottleneck when recovery. After read the code, we find the IO of that 
SSD come from two ways:
1. normal recovery IO
2. user IO but in the missing list, need to recovery the 4M object first.

So our first step is limit the recovery IO to slow down the stress of that SSD. That helps in some scene, but not this one.


We have 36 OSD with 3 replicas, so when one osd down, about 1/12 objects will be in degraded state.
When we run fio with 4k randwrite, about 1/12 io will stuck and need to recovery the 4M object first.
That really enlarge the stress the that SSD.

In order to reduce the enlarge impact, we want to change the default size of the object from 4M to 32k.

We know that will increase the number of the objects of one OSD and make remove process become longer.

Hmm, here i want to ask your guys is there any other potential problems will 32k size have? If no obvious problem, will could dive into
it and do more test on it.

Many thanks!
 				
--------------
hzwulibin
2015-12-23

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: use object size of 32k rather than 4M
       [not found] ` <567A8CD6.10707-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2015-12-23 12:57   ` Van Leeuwen, Robert
  2015-12-23 13:14     ` [ceph-users] " hzwulibin
  0 siblings, 1 reply; 4+ messages in thread
From: Van Leeuwen, Robert @ 2015-12-23 12:57 UTC (permalink / raw)
  To: hzwulibin, ceph-devel, ceph-users


>In order to reduce the enlarge impact, we want to change the default size of the object from 4M to 32k.
>
>We know that will increase the number of the objects of one OSD and make remove process become longer.
>
>Hmm, here i want to ask your guys is there any other potential problems will 32k size have? If no obvious problem, will could dive into
>it and do more test on it.


I assume the objects on the OSDs filesystem will become 32k when you do this.
So if you have 1TB of data on one OSD you will have 31 million files == 31 million inodes 
This is excluding the directory structure which also might be significant.
If you have 10 OSDs on a server you will easily hit 310 million inodes.
You will need a LOT of memory to make sure the inodes are cached but even then looking up the inode might add significant latency.

My guess is it will be fast in the beginning but it will grind to an hold when the cluster gets fuller due to inodes no longer being in memory.

Also this does not take in any other bottlenecks you might hit in ceph which other users can probably answer better.


Cheers,
Robert van Leeuwen

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ceph-users] use object size of 32k rather than 4M
  2015-12-23 12:57   ` Van Leeuwen, Robert
@ 2015-12-23 13:14     ` hzwulibin
  2015-12-23 13:57       ` Van Leeuwen, Robert
  0 siblings, 1 reply; 4+ messages in thread
From: hzwulibin @ 2015-12-23 13:14 UTC (permalink / raw)
  To: Van Leeuwen, Robert, ceph-devel, ceph-users

Hi, Robert

Thanks for your quick reply. Yeah, the number of file really will be the potential problem. But if just the memory problem, we could use more memory in our OSD
servers.

Also, i tested it on XFS use mdtest, here is the result:


$ sudo ~/wulb/bin/mdtest -I 10000 -z 1 -b 1024 -R -F
--------------------------------------------------------------------------
[[10342,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: 10-180-0-34

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
-- started at 12/23/2015 18:59:16 --

mdtest-1.8.3 was launched with 1 total task(s) on 1 nodes
Command line used: /home/ceph/wulb/bin/mdtest -I 10000 -z 1 -b 1024 -R -F
Path: /home/ceph
FS: 824.5 GiB   Used FS: 4.8%   Inodes: 52.4 Mi   Used Inodes: 0.6%
random seed: 1450868356

1 tasks, 10250000 files

SUMMARY: (of 1 iterations)
   Operation                  Max        Min       Mean    Std Dev
   ---------                  ---        ---       ----    -------
   File creation     :  44660.505  44660.505  44660.505      0.000
   File stat         : 693747.783 693747.783 693747.783      0.000
   File read         : 365319.444 365319.444 365319.444      0.000
   File removal      :  62064.560  62064.560  62064.560      0.000
   Tree creation     :  69680.729  69680.729  69680.729      0.000
   Tree removal      :    352.905    352.905    352.905      0.000


From what i tested, the speed of File stat and File read are not slow down much.  So, could i say the speed of OP like
lookup a file will not decrease much, just increase the number of the files?


------------------				 
hzwulibin
2015-12-23

-------------------------------------------------------------
发件人:"Van Leeuwen, Robert" <rovanleeuwen@ebay.com>
发送日期:2015-12-23 20:57
收件人:hzwulibin,ceph-devel,ceph-users
抄送:
主题:Re: [ceph-users] use object size of 32k rather than 4M


>In order to reduce the enlarge impact, we want to change the default size of the object from 4M to 32k.
>
>We know that will increase the number of the objects of one OSD and make remove process become longer.
>
>Hmm, here i want to ask your guys is there any other potential problems will 32k size have? If no obvious problem, will could dive into
>it and do more test on it.


I assume the objects on the OSDs filesystem will become 32k when you do this.
So if you have 1TB of data on one OSD you will have 31 million files == 31 million inodes 
This is excluding the directory structure which also might be significant.
If you have 10 OSDs on a server you will easily hit 310 million inodes.
You will need a LOT of memory to make sure the inodes are cached but even then looking up the inode might add significant latency.

My guess is it will be fast in the beginning but it will grind to an hold when the cluster gets fuller due to inodes no longer being in memory.

Also this does not take in any other bottlenecks you might hit in ceph which other users can probably answer better.


Cheers,
Robert van Leeuwen


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ceph-users] use object size of 32k rather than 4M
  2015-12-23 13:14     ` [ceph-users] " hzwulibin
@ 2015-12-23 13:57       ` Van Leeuwen, Robert
  0 siblings, 0 replies; 4+ messages in thread
From: Van Leeuwen, Robert @ 2015-12-23 13:57 UTC (permalink / raw)
  To: hzwulibin, ceph-devel, ceph-users

>Thanks for your quick reply. Yeah, the number of file really will be the potential problem. But if just the memory problem, we could use more memory in our OSD
>servers.

Add more mem might not be a viable solution:
Ceph does not say how much data is stores in an inode but the docs say the xattr of ext4 is not big enough.
Assuming xfs will use 512 bytes is probably very optimistic.
So for e.g. 300 million inodes you are talking about, at least, 150GB.

>
>Also, i tested it on XFS use mdtest, here is the result:
>
>
>FS: 824.5 GiB   Used FS: 4.8%   Inodes: 52.4 Mi   Used Inodes: 0.6%

52 million files without extended attributes is probably not a real life scenario for a filled up ceph node with multiple OSDs.

Cheers,
Robert van Leeuwen

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-12-23 13:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-23 12:00 use object size of 32k rather than 4M hzwulibin
     [not found] ` <567A8CD6.10707-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-12-23 12:57   ` Van Leeuwen, Robert
2015-12-23 13:14     ` [ceph-users] " hzwulibin
2015-12-23 13:57       ` Van Leeuwen, Robert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.