* some performance issue @ 2013-02-01 20:20 sheng qiu 2013-02-01 21:10 ` Mark Nelson 0 siblings, 1 reply; 6+ messages in thread From: sheng qiu @ 2013-02-01 20:20 UTC (permalink / raw) To: ceph-devel Hi, i did one experiment which gives some interesting result. i create two OSD (ext4), each is a SSD attached on the same PC. i also configure one monitor and one mds on that PC. so generally, my OSDs, monitor and mds locate on the same node. i set up the ceph service and mount the ceph also on a local directory on that PC. so client, OSDs, monitor and mds all on the same node. i suppose this will exclude the network communication cost. i run fio benchmark which create one 10GB file (larger than main memory) on the ceph mount point. it perform sequential read/write and random read/write on the file, and generate the throughput result. next i umount the ceph and stop ceph service. i create ext4 on the same SSD that used as OSD before. then run the same workloads and get the throughput result. here are the results: (throughput kb/s)Seq-read Rand-read Seq-write Rand-write ceph 7378 4740 790 1211 ext4 58260 17334 54697 34257 as you see, the ceph has huge performance down, even monitor, mds, client and osds locate on the same physical machine. another interesting thing is the seq-write has lower throughput compared with random-write under ceph. not quite clear.... does anyone have idea about why ceph has that performance down? Thanks, Sheng -- Sheng Qiu Texas A & M University Room 332B Wisenbaker email: herbert1984106@gmail.com College Station, TX 77843-3259 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: some performance issue 2013-02-01 20:20 some performance issue sheng qiu @ 2013-02-01 21:10 ` Mark Nelson 2013-02-04 15:36 ` sheng qiu 0 siblings, 1 reply; 6+ messages in thread From: Mark Nelson @ 2013-02-01 21:10 UTC (permalink / raw) To: sheng qiu; +Cc: ceph-devel@vger.kernel.org On 02/01/2013 02:20 PM, sheng qiu wrote: > Hi, > > i did one experiment which gives some interesting result. > > i create two OSD (ext4), each is a SSD attached on the same PC. i also > configure one monitor and one mds on that PC. > so generally, my OSDs, monitor and mds locate on the same node. > > i set up the ceph service and mount the ceph also on a local directory > on that PC. so client, OSDs, monitor and mds all on the same node. > i suppose this will exclude the network communication cost. > > i run fio benchmark which create one 10GB file (larger than main > memory) on the ceph mount point. it perform sequential read/write and > random read/write on the file, and generate the throughput result. > > next i umount the ceph and stop ceph service. i create ext4 on the > same SSD that used as OSD before. then run the same workloads and get > the throughput result. > > here are the results: > > (throughput kb/s)Seq-read Rand-read Seq-write Rand-write > ceph 7378 4740 790 1211 > ext4 58260 17334 54697 34257 > > as you see, the ceph has huge performance down, even monitor, mds, > client and osds locate on the same physical machine. > another interesting thing is the seq-write has lower throughput > compared with random-write under ceph. not quite clear.... > > does anyone have idea about why ceph has that performance down? Hi Sheng, Are you using RBD or CephFS (and kernel or userland clients?) How much replication? Also, what FIO settings? In general, it is difficult to make distributed storage systems perform as well as local storage for small read/write workloads. You need a lot of concurrency to hide the latencies, and if the local storage is incredibly fast (like an SSD!) you have a huge uphill battle. Regarding the network, Even though you ran everything on localhost, ceph is still using TCP sockets to do all of the communication. Having said that, I think we can do better than 790 IOPs for seq writes, even if it's 2x replication. The trick is to find where in the stack things are getting held up. You might want to look at tools like iostat and collectl, and look at some of the op latency data in the ceph admin socket. A basic introduction is described in sebastian's article here: http://www.sebastien-han.fr/blog/2012/08/14/ceph-admin-socket/ > > Thanks, > Sheng > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: some performance issue 2013-02-01 21:10 ` Mark Nelson @ 2013-02-04 15:36 ` sheng qiu 2013-02-04 16:52 ` Chen, Xiaoxi 0 siblings, 1 reply; 6+ messages in thread From: sheng qiu @ 2013-02-04 15:36 UTC (permalink / raw) To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org Hi Mark, thanks a lot for your reply. On Fri, Feb 1, 2013 at 3:10 PM, Mark Nelson <mark.nelson@inktank.com> wrote: > On 02/01/2013 02:20 PM, sheng qiu wrote: >> >> Hi, >> >> i did one experiment which gives some interesting result. >> >> i create two OSD (ext4), each is a SSD attached on the same PC. i also >> configure one monitor and one mds on that PC. >> so generally, my OSDs, monitor and mds locate on the same node. >> >> i set up the ceph service and mount the ceph also on a local directory >> on that PC. so client, OSDs, monitor and mds all on the same node. >> i suppose this will exclude the network communication cost. >> >> i run fio benchmark which create one 10GB file (larger than main >> memory) on the ceph mount point. it perform sequential read/write and >> random read/write on the file, and generate the throughput result. >> >> next i umount the ceph and stop ceph service. i create ext4 on the >> same SSD that used as OSD before. then run the same workloads and get >> the throughput result. >> >> here are the results: >> >> (throughput kb/s)Seq-read Rand-read Seq-write Rand-write >> ceph 7378 4740 790 1211 >> ext4 58260 17334 54697 34257 >> >> as you see, the ceph has huge performance down, even monitor, mds, >> client and osds locate on the same physical machine. >> another interesting thing is the seq-write has lower throughput >> compared with random-write under ceph. not quite clear.... >> >> does anyone have idea about why ceph has that performance down? > > > Hi Sheng, > > Are you using RBD or CephFS (and kernel or userland clients?) How much > replication? Also, what FIO settings? > i am using CephFS and kernel clients. the replication is by default (3?). the FIO is using the ssd-test script, IO request size is 4kb. > In general, it is difficult to make distributed storage systems perform as > well as local storage for small read/write workloads. You need a lot of > concurrency to hide the latencies, and if the local storage is incredibly > fast (like an SSD!) you have a huge uphill battle. > > Regarding the network, Even though you ran everything on localhost, ceph is > still using TCP sockets to do all of the communication. > i guess when it checked the remote ip is actually the local address, it will directly patch the send packets to the receive buffer. right? > Having said that, I think we can do better than 790 IOPs for seq writes, > even if it's 2x replication. The trick is to find where in the stack things > are getting held up. You might want to look at tools like iostat and > collectl, and look at some of the op latency data in the ceph admin socket. > A basic introduction is described in sebastian's article here: > > http://www.sebastien-han.fr/blog/2012/08/14/ceph-admin-socket/ > >> >> Thanks, >> Sheng >> >> > I would try your suggestion to find where the bottleneck is. the reason i did this experiment is just trying to find some potential issues with ceph. i am a Ph.d. student and trying to do some research work on it. i would be happy to hear your suggestions. Thanks, Sheng -- Sheng Qiu Texas A & M University Room 332B Wisenbaker email: herbert1984106@gmail.com College Station, TX 77843-3259 ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: some performance issue 2013-02-04 15:36 ` sheng qiu @ 2013-02-04 16:52 ` Chen, Xiaoxi 2013-02-04 17:15 ` sheng qiu 0 siblings, 1 reply; 6+ messages in thread From: Chen, Xiaoxi @ 2013-02-04 16:52 UTC (permalink / raw) To: sheng qiu, Mark Nelson; +Cc: ceph-devel@vger.kernel.org I doubt your data is correct ,even the ext4 data, have you use O_DIRECT when doing the test? It's unusual to have 2X random write IOPS than random read. CephFS kernel client seems not stable enough, think twice before you use it. From your previous mail I guess you would like to do some caching or dynamic tiring ,introducing ssd into DFS for better performance. There are a lot of layer you can do such kind of caching or migration, you can cache on client side , or do as sage said ,having a disk pool and a ssd pool then migrate data between them, or you can cache inside OSD. We are also interested in similar research. But it's still WIP. -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of sheng qiu Sent: 2013年2月4日 23:37 To: Mark Nelson Cc: ceph-devel@vger.kernel.org Subject: Re: some performance issue Hi Mark, thanks a lot for your reply. On Fri, Feb 1, 2013 at 3:10 PM, Mark Nelson <mark.nelson@inktank.com> wrote: > On 02/01/2013 02:20 PM, sheng qiu wrote: >> >> Hi, >> >> i did one experiment which gives some interesting result. >> >> i create two OSD (ext4), each is a SSD attached on the same PC. i >> also configure one monitor and one mds on that PC. >> so generally, my OSDs, monitor and mds locate on the same node. >> >> i set up the ceph service and mount the ceph also on a local >> directory on that PC. so client, OSDs, monitor and mds all on the same node. >> i suppose this will exclude the network communication cost. >> >> i run fio benchmark which create one 10GB file (larger than main >> memory) on the ceph mount point. it perform sequential read/write and >> random read/write on the file, and generate the throughput result. >> >> next i umount the ceph and stop ceph service. i create ext4 on the >> same SSD that used as OSD before. then run the same workloads and get >> the throughput result. >> >> here are the results: >> >> (throughput kb/s)Seq-read Rand-read Seq-write Rand-write >> ceph 7378 4740 790 1211 >> ext4 58260 17334 54697 34257 >> >> as you see, the ceph has huge performance down, even monitor, mds, >> client and osds locate on the same physical machine. >> another interesting thing is the seq-write has lower throughput >> compared with random-write under ceph. not quite clear.... >> >> does anyone have idea about why ceph has that performance down? > > > Hi Sheng, > > Are you using RBD or CephFS (and kernel or userland clients?) How > much replication? Also, what FIO settings? > i am using CephFS and kernel clients. the replication is by default (3?). the FIO is using the ssd-test script, IO request size is 4kb. > In general, it is difficult to make distributed storage systems > perform as well as local storage for small read/write workloads. You > need a lot of concurrency to hide the latencies, and if the local > storage is incredibly fast (like an SSD!) you have a huge uphill battle. > > Regarding the network, Even though you ran everything on localhost, > ceph is still using TCP sockets to do all of the communication. > i guess when it checked the remote ip is actually the local address, it will directly patch the send packets to the receive buffer. right? > Having said that, I think we can do better than 790 IOPs for seq > writes, even if it's 2x replication. The trick is to find where in > the stack things are getting held up. You might want to look at tools > like iostat and collectl, and look at some of the op latency data in the ceph admin socket. > A basic introduction is described in sebastian's article here: > > http://www.sebastien-han.fr/blog/2012/08/14/ceph-admin-socket/ > >> >> Thanks, >> Sheng >> >> > I would try your suggestion to find where the bottleneck is. the reason i did this experiment is just trying to find some potential issues with ceph. i am a Ph.d. student and trying to do some research work on it. i would be happy to hear your suggestions. Thanks, Sheng -- Sheng Qiu Texas A & M University Room 332B Wisenbaker email: herbert1984106@gmail.com College Station, TX 77843-3259 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: some performance issue 2013-02-04 16:52 ` Chen, Xiaoxi @ 2013-02-04 17:15 ` sheng qiu 2013-02-04 17:29 ` Mark Kampe 0 siblings, 1 reply; 6+ messages in thread From: sheng qiu @ 2013-02-04 17:15 UTC (permalink / raw) To: Chen, Xiaoxi; +Cc: Mark Nelson, ceph-devel@vger.kernel.org Hi Xiaoxi, thanks for your reply. On Mon, Feb 4, 2013 at 10:52 AM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote: > I doubt your data is correct ,even the ext4 data, have you use O_DIRECT when doing the test? It's unusual to have 2X random write IOPS than random read. > i did not use O_DIRECT. so page cache is used during the test. one thing i guess why random write is better than random read is that since the io request size is 4KB, so for each write request if miss on page cache, it will allocate a new page and write the complete 4KB dirty data there (since no partitional writes, no need to fetch the missed data from OSDs). While for read requests, it has to wait until the data are fetched from the OSDs. > CephFS kernel client seems not stable enough, think twice before you use it. > From your previous mail I guess you would like to do some caching or dynamic tiring ,introducing ssd into DFS for better performance. There are a lot of layer you can do such kind of caching or migration, you can cache on client side , or do as sage said ,having a disk pool and a ssd pool then migrate data between them, or you can cache inside OSD. > We are also interested in similar research. But it's still WIP. i think ceph's CRUSH already support that which can create multiple rulesets and specified for HDD pools or SSD pools. So there is not much research work there. Hybrid drive for individual OSD is also not new, many research work proposed hybrid drive management. Thanks a lot for your suggestions. Sheng > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of sheng qiu > Sent: 2013年2月4日 23:37 > To: Mark Nelson > Cc: ceph-devel@vger.kernel.org > Subject: Re: some performance issue > > Hi Mark, > > thanks a lot for your reply. > > On Fri, Feb 1, 2013 at 3:10 PM, Mark Nelson <mark.nelson@inktank.com> wrote: >> On 02/01/2013 02:20 PM, sheng qiu wrote: >>> >>> Hi, >>> >>> i did one experiment which gives some interesting result. >>> >>> i create two OSD (ext4), each is a SSD attached on the same PC. i >>> also configure one monitor and one mds on that PC. >>> so generally, my OSDs, monitor and mds locate on the same node. >>> >>> i set up the ceph service and mount the ceph also on a local >>> directory on that PC. so client, OSDs, monitor and mds all on the same node. >>> i suppose this will exclude the network communication cost. >>> >>> i run fio benchmark which create one 10GB file (larger than main >>> memory) on the ceph mount point. it perform sequential read/write and >>> random read/write on the file, and generate the throughput result. >>> >>> next i umount the ceph and stop ceph service. i create ext4 on the >>> same SSD that used as OSD before. then run the same workloads and get >>> the throughput result. >>> >>> here are the results: >>> >>> (throughput kb/s)Seq-read Rand-read Seq-write Rand-write >>> ceph 7378 4740 790 1211 >>> ext4 58260 17334 54697 34257 >>> >>> as you see, the ceph has huge performance down, even monitor, mds, >>> client and osds locate on the same physical machine. >>> another interesting thing is the seq-write has lower throughput >>> compared with random-write under ceph. not quite clear.... >>> >>> does anyone have idea about why ceph has that performance down? >> >> >> Hi Sheng, >> >> Are you using RBD or CephFS (and kernel or userland clients?) How >> much replication? Also, what FIO settings? >> > > i am using CephFS and kernel clients. the replication is by default (3?). the FIO is using the ssd-test script, IO request size is 4kb. > >> In general, it is difficult to make distributed storage systems >> perform as well as local storage for small read/write workloads. You >> need a lot of concurrency to hide the latencies, and if the local >> storage is incredibly fast (like an SSD!) you have a huge uphill battle. >> >> Regarding the network, Even though you ran everything on localhost, >> ceph is still using TCP sockets to do all of the communication. >> > > i guess when it checked the remote ip is actually the local address, it will directly patch the send packets to the receive buffer. right? > >> Having said that, I think we can do better than 790 IOPs for seq >> writes, even if it's 2x replication. The trick is to find where in >> the stack things are getting held up. You might want to look at tools >> like iostat and collectl, and look at some of the op latency data in the ceph admin socket. >> A basic introduction is described in sebastian's article here: >> >> http://www.sebastien-han.fr/blog/2012/08/14/ceph-admin-socket/ >> >>> >>> Thanks, >>> Sheng >>> >>> >> > > I would try your suggestion to find where the bottleneck is. > the reason i did this experiment is just trying to find some potential issues with ceph. i am a Ph.d. student and trying to do some research work on it. > i would be happy to hear your suggestions. > > Thanks, > Sheng > > -- > Sheng Qiu > Texas A & M University > Room 332B Wisenbaker > email: herbert1984106@gmail.com > College Station, TX 77843-3259 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Sheng Qiu Texas A & M University Room 332B Wisenbaker email: herbert1984106@gmail.com College Station, TX 77843-3259 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: some performance issue 2013-02-04 17:15 ` sheng qiu @ 2013-02-04 17:29 ` Mark Kampe 0 siblings, 0 replies; 6+ messages in thread From: Mark Kampe @ 2013-02-04 17:29 UTC (permalink / raw) To: sheng qiu; +Cc: Chen, Xiaoxi, ceph-devel@vger.kernel.org Writes are intrinsically more expensive (in both the file system and hardware) but it is not uncommon for individual small random writes to substantially outperform reads even if O_DIRECT. If the I/O is not massively parallel, reads are going to be processed one at a time (e.g. ~6ms seek, ~4ms latency, and 27us transfer). Writes, however, are commonly accepted by the drive and then queued, enabling the drive to choose among the competing requests to significantly (e.g. 2-3x) reduce both average seek time and rotational latency. If the I/O is being buffered, the performance advantages for random writes can be even greater (due to a deeper request queue and potential request aggregation). Isolated random reads (with few cache hits) get a much smaller performance boost (if any) from buffered I/O. With massively parallel requests, however, the write advantage should evaporate. On 02/04/2013 09:15 AM, sheng qiu wrote: > Hi Xiaoxi, > > thanks for your reply. > > On Mon, Feb 4, 2013 at 10:52 AM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote: >> I doubt your data is correct ,even the ext4 data, have you use O_DIRECT when doing the test? It's unusual to have 2X random write IOPS than random read. >> > > i did not use O_DIRECT. so page cache is used during the test. > one thing i guess why random write is better than random read is that > since the io request size is 4KB, so for each write request if miss on > page cache, it will allocate a new page and write the complete 4KB > dirty data there (since no partitional writes, no need to fetch the > missed data from OSDs). While for read requests, it has to wait until > the data are fetched from the OSDs. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-02-04 17:29 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-02-01 20:20 some performance issue sheng qiu 2013-02-01 21:10 ` Mark Nelson 2013-02-04 15:36 ` sheng qiu 2013-02-04 16:52 ` Chen, Xiaoxi 2013-02-04 17:15 ` sheng qiu 2013-02-04 17:29 ` Mark Kampe
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.