* SSD journal suggestion @ 2012-11-07 12:13 Gandalf Corvotempesta 2012-11-07 12:17 ` Sage Weil 0 siblings, 1 reply; 31+ messages in thread From: Gandalf Corvotempesta @ 2012-11-07 12:13 UTC (permalink / raw) To: ceph-devel I'm evaluating some SSD drives as journal. Samsung 840 Pro seems to be the fastest in sequential reads and write. What parameter should I consider for a journal? I think that none of read benchmark are influent because when dumping journal to disk, the bottleneck will always be the SAS/SATA write speed. (in this case, the SSD will never reach it's read best performance) So, should I evaluate only the write speed when datas are wrote to the journal ? Sequential or random ? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 12:13 SSD journal suggestion Gandalf Corvotempesta @ 2012-11-07 12:17 ` Sage Weil 2012-11-07 12:28 ` Gandalf Corvotempesta 0 siblings, 1 reply; 31+ messages in thread From: Sage Weil @ 2012-11-07 12:17 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: ceph-devel On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote: > I'm evaluating some SSD drives as journal. > Samsung 840 Pro seems to be the fastest in sequential reads and write. > > What parameter should I consider for a journal? I think that none of > read benchmark are influent because when dumping journal to disk, the > bottleneck will always be the SAS/SATA write speed. (in this case, the > SSD will never reach it's read best performance) > So, should I evaluate only the write speed when datas are wrote to the > journal ? Sequential or random ? Sequential write is the only thing that matters for the osd journal. I'd look at both large writes and small writes. sage ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 12:17 ` Sage Weil @ 2012-11-07 12:28 ` Gandalf Corvotempesta 2012-11-07 15:01 ` Mark Nelson 0 siblings, 1 reply; 31+ messages in thread From: Gandalf Corvotempesta @ 2012-11-07 12:28 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel 2012/11/7 Sage Weil <sage@inktank.com>: > On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote: >> I'm evaluating some SSD drives as journal. >> Samsung 840 Pro seems to be the fastest in sequential reads and write. The 840 Pro seems to reach 485MB/s in sequential write: http://www.storagereview.com/samsung_ssd_840_pro_review ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 12:28 ` Gandalf Corvotempesta @ 2012-11-07 15:01 ` Mark Nelson 2012-11-07 16:12 ` Atchley, Scott 0 siblings, 1 reply; 31+ messages in thread From: Mark Nelson @ 2012-11-07 15:01 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Sage Weil, ceph-devel On 11/07/2012 06:28 AM, Gandalf Corvotempesta wrote: > 2012/11/7 Sage Weil <sage@inktank.com>: >> On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote: >>> I'm evaluating some SSD drives as journal. >>> Samsung 840 Pro seems to be the fastest in sequential reads and write. > > The 840 Pro seems to reach 485MB/s in sequential write: > http://www.storagereview.com/samsung_ssd_840_pro_review > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > I'm using Intel 510s in a test node and can do about 450MB/s per drive. Right now I'm doing 3 journals per SSD, but topping out at about 1.2-1.4GB/s from the client perspective for the node with 15+ drives and 5 SSDs. It's possible newer versions of the code and tuning may increase that. TV pointed me at the new Intel DC S3700 which looks like a very interesting option (the 100GB model for $240). http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed Mark ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 15:01 ` Mark Nelson @ 2012-11-07 16:12 ` Atchley, Scott 2012-11-07 16:20 ` Mark Nelson 0 siblings, 1 reply; 31+ messages in thread From: Atchley, Scott @ 2012-11-07 16:12 UTC (permalink / raw) To: Mark Nelson; +Cc: Gandalf Corvotempesta, Sage Weil, ceph-devel@vger.kernel.org On Nov 7, 2012, at 10:01 AM, Mark Nelson <mark.nelson@inktank.com> wrote: > On 11/07/2012 06:28 AM, Gandalf Corvotempesta wrote: >> 2012/11/7 Sage Weil <sage@inktank.com>: >>> On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote: >>>> I'm evaluating some SSD drives as journal. >>>> Samsung 840 Pro seems to be the fastest in sequential reads and write. >> >> The 840 Pro seems to reach 485MB/s in sequential write: >> http://www.storagereview.com/samsung_ssd_840_pro_review >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > I'm using Intel 510s in a test node and can do about 450MB/s per drive. Is that sequential read or write? Intel lists them at 210-315 MB/s for sequential write. The 520s are rated at 475-520 MB/s seq. write. > Right now I'm doing 3 journals per SSD, but topping out at about > 1.2-1.4GB/s from the client perspective for the node with 15+ drives and > 5 SSDs. It's possible newer versions of the code and tuning may > increase that. What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G Ethernet? Scott > TV pointed me at the new Intel DC S3700 which looks like a very > interesting option (the 100GB model for $240). > > http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed > > Mark > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 16:12 ` Atchley, Scott @ 2012-11-07 16:20 ` Mark Nelson 2012-11-07 16:35 ` Atchley, Scott 0 siblings, 1 reply; 31+ messages in thread From: Mark Nelson @ 2012-11-07 16:20 UTC (permalink / raw) To: Atchley, Scott Cc: Gandalf Corvotempesta, Sage Weil, ceph-devel@vger.kernel.org On 11/07/2012 10:12 AM, Atchley, Scott wrote: > On Nov 7, 2012, at 10:01 AM, Mark Nelson <mark.nelson@inktank.com> wrote: > >> On 11/07/2012 06:28 AM, Gandalf Corvotempesta wrote: >>> 2012/11/7 Sage Weil <sage@inktank.com>: >>>> On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote: >>>>> I'm evaluating some SSD drives as journal. >>>>> Samsung 840 Pro seems to be the fastest in sequential reads and write. >>> >>> The 840 Pro seems to reach 485MB/s in sequential write: >>> http://www.storagereview.com/samsung_ssd_840_pro_review >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> I'm using Intel 510s in a test node and can do about 450MB/s per drive. > > Is that sequential read or write? Intel lists them at 210-315 MB/s for sequential write. The 520s are rated at 475-520 MB/s seq. write. Doh, wrote that too early in the morning after staying all night watching the elections. :) You are correct, it's the 520, not the 510. > >> Right now I'm doing 3 journals per SSD, but topping out at about >> 1.2-1.4GB/s from the client perspective for the node with 15+ drives and >> 5 SSDs. It's possible newer versions of the code and tuning may >> increase that. > > What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G Ethernet? > > Scott > This is 8 concurrent instances of rados bench running on localhost. Ceph is configured with 1x replication. 1.2-1.4GB/s is the aggregate throughput of all of the rados bench instances. >> TV pointed me at the new Intel DC S3700 which looks like a very >> interesting option (the 100GB model for $240). >> >> http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed >> >> Mark >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 16:20 ` Mark Nelson @ 2012-11-07 16:35 ` Atchley, Scott 2012-11-07 16:41 ` Mark Nelson 0 siblings, 1 reply; 31+ messages in thread From: Atchley, Scott @ 2012-11-07 16:35 UTC (permalink / raw) To: Mark Nelson; +Cc: Gandalf Corvotempesta, Sage Weil, ceph-devel@vger.kernel.org On Nov 7, 2012, at 11:20 AM, Mark Nelson <mark.nelson@inktank.com> wrote: >>> Right now I'm doing 3 journals per SSD, but topping out at about >>> 1.2-1.4GB/s from the client perspective for the node with 15+ drives and >>> 5 SSDs. It's possible newer versions of the code and tuning may >>> increase that. >> >> What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G Ethernet? > > This is 8 concurrent instances of rados bench running on localhost. > Ceph is configured with 1x replication. 1.2-1.4GB/s is the aggregate > throughput of all of the rados bench instances. Ok, all local with no communication. Given this level of local performance, what does that translate into when talking over the network? Scott ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 16:35 ` Atchley, Scott @ 2012-11-07 16:41 ` Mark Nelson 2012-11-07 21:11 ` Martin Mailand 0 siblings, 1 reply; 31+ messages in thread From: Mark Nelson @ 2012-11-07 16:41 UTC (permalink / raw) To: Atchley, Scott Cc: Gandalf Corvotempesta, Sage Weil, ceph-devel@vger.kernel.org On 11/07/2012 10:35 AM, Atchley, Scott wrote: > On Nov 7, 2012, at 11:20 AM, Mark Nelson <mark.nelson@inktank.com> wrote: > >>>> Right now I'm doing 3 journals per SSD, but topping out at about >>>> 1.2-1.4GB/s from the client perspective for the node with 15+ drives and >>>> 5 SSDs. It's possible newer versions of the code and tuning may >>>> increase that. >>> >>> What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G Ethernet? >> >> This is 8 concurrent instances of rados bench running on localhost. >> Ceph is configured with 1x replication. 1.2-1.4GB/s is the aggregate >> throughput of all of the rados bench instances. > > Ok, all local with no communication. Given this level of local performance, what does that translate into when talking over the network? > > Scott > Well, local, but still over tcp. Right now I'm focusing on pushing the osds/filestores as far as I can, and after that I'm going to setup a bonded 10GbE network to see what kind of messenger bottlenecks I run into. Sadly the testing is going slower than I would like. Mark ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 16:41 ` Mark Nelson @ 2012-11-07 21:11 ` Martin Mailand 2012-11-07 21:14 ` Gandalf Corvotempesta 0 siblings, 1 reply; 31+ messages in thread From: Martin Mailand @ 2012-11-07 21:11 UTC (permalink / raw) To: Mark Nelson Cc: Atchley, Scott, Gandalf Corvotempesta, Sage Weil, ceph-devel@vger.kernel.org Hi, I have 16 SAS disk on a LSI 9266-8i and 4 Intel 520 SSD on a HBA, the node has dual 10G Ethernet. The clients are 4 nodes with dual 10GeB, as test I use rados bench on each client. The aggregated write speed is around 1,6GB/s with single replication. In the first configuration, I had the SSDs on the raidcontroller as well, but then I saturated the PCIe 2.0 x8 interface of the raidcontroller, therefore I use a second controller for the SSDs. -martin Am 07.11.2012 17:41, schrieb Mark Nelson: > Well, local, but still over tcp. Right now I'm focusing on pushing the > osds/filestores as far as I can, and after that I'm going to setup a > bonded 10GbE network to see what kind of messenger bottlenecks I run > into. Sadly the testing is going slower than I would like. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 21:11 ` Martin Mailand @ 2012-11-07 21:14 ` Gandalf Corvotempesta 2012-11-07 21:35 ` Martin Mailand 0 siblings, 1 reply; 31+ messages in thread From: Gandalf Corvotempesta @ 2012-11-07 21:14 UTC (permalink / raw) To: martin; +Cc: Mark Nelson, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org 2012/11/7 Martin Mailand <martin@tuxadero.com>: > I have 16 SAS disk on a LSI 9266-8i and 4 Intel 520 SSD on a HBA, the node > has dual 10G Ethernet. The clients are 4 nodes with dual 10GeB, as test I > use rados bench on each client. The aggregated write speed is around 1,6GB/s > with single replication. Just for curiosity, which switches do you have? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 21:14 ` Gandalf Corvotempesta @ 2012-11-07 21:35 ` Martin Mailand 2012-11-07 21:44 ` Stefan Priebe 2012-11-07 22:28 ` Gandalf Corvotempesta 0 siblings, 2 replies; 31+ messages in thread From: Martin Mailand @ 2012-11-07 21:35 UTC (permalink / raw) To: Gandalf Corvotempesta Cc: Mark Nelson, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org Hi, I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a Mellanox MSX1016. ATM the Arista is may favourite. For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox ConnectX-3. My favourite is the Intel X520-DA2. -martin Am 07.11.2012 22:14, schrieb Gandalf Corvotempesta: > 2012/11/7 Martin Mailand <martin@tuxadero.com>: >> I have 16 SAS disk on a LSI 9266-8i and 4 Intel 520 SSD on a HBA, the node >> has dual 10G Ethernet. The clients are 4 nodes with dual 10GeB, as test I >> use rados bench on each client. The aggregated write speed is around 1,6GB/s >> with single replication. > > Just for curiosity, which switches do you have? > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 21:35 ` Martin Mailand @ 2012-11-07 21:44 ` Stefan Priebe 2012-11-07 21:55 ` Martin Mailand 2012-11-07 22:28 ` Gandalf Corvotempesta 1 sibling, 1 reply; 31+ messages in thread From: Stefan Priebe @ 2012-11-07 21:44 UTC (permalink / raw) To: martin Cc: Gandalf Corvotempesta, Mark Nelson, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org Am 07.11.2012 22:35, schrieb Martin Mailand: > Hi, > > I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a > Mellanox MSX1016. ATM the Arista is may favourite. > For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox > ConnectX-3. My favourite is the Intel X520-DA2. That's pretty interesting i'll get the HP5900 and HP5920 in a few weeks. HP told me the deep packet buffers of the HP5920 will burst the performance and should be used for storage related stuff. Greets, Stefan ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 21:44 ` Stefan Priebe @ 2012-11-07 21:55 ` Martin Mailand 2012-11-07 21:59 ` Stefan Priebe 0 siblings, 1 reply; 31+ messages in thread From: Martin Mailand @ 2012-11-07 21:55 UTC (permalink / raw) To: Stefan Priebe Cc: Gandalf Corvotempesta, Mark Nelson, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org Hi Stefan, deep buffers means latency spikes, you should go for fast switching latency. The HP5900 has a latency of 1ms, the Arista and Mellanox of 250ns. And I you should think at the price the HP5900 cost 3 times of the Mellanox. -martin Am 07.11.2012 22:44, schrieb Stefan Priebe: > Am 07.11.2012 22:35, schrieb Martin Mailand: >> Hi, >> >> I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a >> Mellanox MSX1016. ATM the Arista is may favourite. >> For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox >> ConnectX-3. My favourite is the Intel X520-DA2. > > That's pretty interesting i'll get the HP5900 and HP5920 in a few weeks. > HP told me the deep packet buffers of the HP5920 will burst the > performance and should be used for storage related stuff. > > Greets, > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 21:55 ` Martin Mailand @ 2012-11-07 21:59 ` Stefan Priebe 2012-11-07 22:13 ` Martin Mailand 0 siblings, 1 reply; 31+ messages in thread From: Stefan Priebe @ 2012-11-07 21:59 UTC (permalink / raw) To: martin Cc: Gandalf Corvotempesta, Mark Nelson, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org Am 07.11.2012 22:55, schrieb Martin Mailand: > Hi Stefan, > > deep buffers means latency spikes, you should go for fast switching > latency. The HP5900 has a latency of 1ms, the Arista and Mellanox of 250ns. HP told me they all use the same ships and Arista measures latency while only one port is in use. HP guarentees the latency when all ports are in use. If this is correct or just somehing hp told me - i don't know. They told me the arista is slower and the statistics are not comporable... > And I you should think at the price the HP5900 cost 3 times of the > Mellanox. Don't know what the Mellanox coests. I get the HP for a really good price below 10.000 €. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 21:59 ` Stefan Priebe @ 2012-11-07 22:13 ` Martin Mailand 0 siblings, 0 replies; 31+ messages in thread From: Martin Mailand @ 2012-11-07 22:13 UTC (permalink / raw) To: Stefan Priebe Cc: Gandalf Corvotempesta, Mark Nelson, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org Hi, I *think* the HP is Broadcom based, the Arista is Fulcrum based, and I don't know which chips Mellanox is using. Our NOC tested both of them, an the Arista was the clear winner, at least in our workload. -martin Am 07.11.2012 22:59, schrieb Stefan Priebe: > HP told me they all use the same ships and Arista measures latency while > only one port is in use. HP guarentees the latency when all ports are in > use. If this is correct or just somehing hp told me - i don't know. They > told me the arista is slower and the statistics are not comporable... ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 21:35 ` Martin Mailand 2012-11-07 21:44 ` Stefan Priebe @ 2012-11-07 22:28 ` Gandalf Corvotempesta 2012-11-07 22:39 ` Martin Mailand 1 sibling, 1 reply; 31+ messages in thread From: Gandalf Corvotempesta @ 2012-11-07 22:28 UTC (permalink / raw) To: martin; +Cc: Mark Nelson, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org 2012/11/7 Martin Mailand <martin@tuxadero.com>: > I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a > Mellanox MSX1016. ATM the Arista is may favourite. Why not infiniband? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 22:28 ` Gandalf Corvotempesta @ 2012-11-07 22:39 ` Martin Mailand 2012-11-07 22:51 ` Gandalf Corvotempesta 0 siblings, 1 reply; 31+ messages in thread From: Martin Mailand @ 2012-11-07 22:39 UTC (permalink / raw) To: Gandalf Corvotempesta Cc: Mark Nelson, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org good question, probably we do not have enough experience with IPoIB. But it looks good on paper, so it's definitely a try worth. -martin Am 07.11.2012 23:28, schrieb Gandalf Corvotempesta: > 2012/11/7 Martin Mailand <martin@tuxadero.com>: >> I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a >> Mellanox MSX1016. ATM the Arista is may favourite. > > Why not infiniband? > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 22:39 ` Martin Mailand @ 2012-11-07 22:51 ` Gandalf Corvotempesta 2012-11-07 23:12 ` Mark Nelson 0 siblings, 1 reply; 31+ messages in thread From: Gandalf Corvotempesta @ 2012-11-07 22:51 UTC (permalink / raw) To: martin; +Cc: Mark Nelson, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org 2012/11/7 Martin Mailand <martin@tuxadero.com>: > But it looks good on paper, so it's definitely a try worth. is at least 4x times faster than 10gbe and AFAIK should have a lower latency. I'm planning to use infiniband as backend storage network, used for OSD replication. 2 HBA for each OSD should give me 80Gbps and full redundancy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 22:51 ` Gandalf Corvotempesta @ 2012-11-07 23:12 ` Mark Nelson 2012-11-08 8:22 ` Gandalf Corvotempesta 0 siblings, 1 reply; 31+ messages in thread From: Mark Nelson @ 2012-11-07 23:12 UTC (permalink / raw) To: Gandalf Corvotempesta Cc: martin, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org On 11/07/2012 04:51 PM, Gandalf Corvotempesta wrote: > 2012/11/7 Martin Mailand <martin@tuxadero.com>: >> But it looks good on paper, so it's definitely a try worth. > > is at least 4x times faster than 10gbe and AFAIK should have a lower latency. > I'm planning to use infiniband as backend storage network, used for > OSD replication. 2 HBA for each OSD should give me 80Gbps and full > redundancy > I haven't done much with IPoIB (just RDMA), but my understanding is that it tends to top out at like 15Gb/s. Some others on this mailing list can probably speak more authoritatively. Even with RDMA you are going to top out at around 3.1-3.2GB/s. This thread may be helpful/interesting: http://comments.gmane.org/gmane.linux.drivers.rdma/12279 Mark ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-07 23:12 ` Mark Nelson @ 2012-11-08 8:22 ` Gandalf Corvotempesta 2012-11-08 13:55 ` Atchley, Scott 0 siblings, 1 reply; 31+ messages in thread From: Gandalf Corvotempesta @ 2012-11-08 8:22 UTC (permalink / raw) To: Mark Nelson; +Cc: martin, Atchley, Scott, Sage Weil, ceph-devel@vger.kernel.org 2012/11/8 Mark Nelson <mark.nelson@inktank.com>: > I haven't done much with IPoIB (just RDMA), but my understanding is that it > tends to top out at like 15Gb/s. Some others on this mailing list can > probably speak more authoritatively. Even with RDMA you are going to top > out at around 3.1-3.2GB/s. 15Gb/s is still faster than 10Gbe But this speed limit seems to be kernel-related and should be the same even in a 10Gbe environment, or not? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-08 8:22 ` Gandalf Corvotempesta @ 2012-11-08 13:55 ` Atchley, Scott 2012-11-08 14:39 ` Mark Nelson 0 siblings, 1 reply; 31+ messages in thread From: Atchley, Scott @ 2012-11-08 13:55 UTC (permalink / raw) To: Gandalf Corvotempesta Cc: Mark Nelson, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: > 2012/11/8 Mark Nelson <mark.nelson@inktank.com>: >> I haven't done much with IPoIB (just RDMA), but my understanding is that it >> tends to top out at like 15Gb/s. Some others on this mailing list can >> probably speak more authoritatively. Even with RDMA you are going to top >> out at around 3.1-3.2GB/s. > > 15Gb/s is still faster than 10Gbe > But this speed limit seems to be kernel-related and should be the same > even in a 10Gbe environment, or not? We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf We looked at their interrupt affinity setting scripts and then wrote our own. Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. We are getting a new test cluster with FDR HCAs and I will look into those as well. Scott ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-08 13:55 ` Atchley, Scott @ 2012-11-08 14:39 ` Mark Nelson 2012-11-08 15:00 ` Atchley, Scott 0 siblings, 1 reply; 31+ messages in thread From: Mark Nelson @ 2012-11-08 14:39 UTC (permalink / raw) To: Atchley, Scott Cc: Gandalf Corvotempesta, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org On 11/08/2012 07:55 AM, Atchley, Scott wrote: > On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: > >> 2012/11/8 Mark Nelson <mark.nelson@inktank.com>: >>> I haven't done much with IPoIB (just RDMA), but my understanding is that it >>> tends to top out at like 15Gb/s. Some others on this mailing list can >>> probably speak more authoritatively. Even with RDMA you are going to top >>> out at around 3.1-3.2GB/s. >> >> 15Gb/s is still faster than 10Gbe >> But this speed limit seems to be kernel-related and should be the same >> even in a 10Gbe environment, or not? > > We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. > > For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. Scott, this is very interesting! Does setting the interrupt affinity make the biggest difference then when you have concurrent netperf processes going? For some reason I thought that setting interrupt affinity wasn't even guaranteed in linux any more, but this is just some half-remembered recollection from a year or two ago. > > We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: > > http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf > > We looked at their interrupt affinity setting scripts and then wrote our own. > > Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. > > We are getting a new test cluster with FDR HCAs and I will look into those as well. Nice! At some point I'll probably try to justify getting some FDR cards in house. I'd definitely like to hear how FDR ends up working for you. > > Scott > Mark ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-08 14:39 ` Mark Nelson @ 2012-11-08 15:00 ` Atchley, Scott 2012-11-08 15:02 ` Atchley, Scott 2012-11-08 20:12 ` Joseph Glanville 0 siblings, 2 replies; 31+ messages in thread From: Atchley, Scott @ 2012-11-08 15:00 UTC (permalink / raw) To: Mark Nelson Cc: Gandalf Corvotempesta, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org On Nov 8, 2012, at 9:39 AM, Mark Nelson <mark.nelson@inktank.com> wrote: > On 11/08/2012 07:55 AM, Atchley, Scott wrote: >> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: >> >>> 2012/11/8 Mark Nelson <mark.nelson@inktank.com>: >>>> I haven't done much with IPoIB (just RDMA), but my understanding is that it >>>> tends to top out at like 15Gb/s. Some others on this mailing list can >>>> probably speak more authoritatively. Even with RDMA you are going to top >>>> out at around 3.1-3.2GB/s. >>> >>> 15Gb/s is still faster than 10Gbe >>> But this speed limit seems to be kernel-related and should be the same >>> even in a 10Gbe environment, or not? >> >> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. >> >> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. > > Scott, this is very interesting! Does setting the interrupt affinity > make the biggest difference then when you have concurrent netperf > processes going? For some reason I thought that setting interrupt > affinity wasn't even guaranteed in linux any more, but this is just some > half-remembered recollection from a year or two ago. We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity: Default (irqbalance running) 12.8 Gb/s IRQ balance off 13.0 Gb/s Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream. >> We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: >> >> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf >> >> We looked at their interrupt affinity setting scripts and then wrote our own. >> >> Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. >> >> We are getting a new test cluster with FDR HCAs and I will look into those as well. > > Nice! At some point I'll probably try to justify getting some FDR cards > in house. I'd definitely like to hear how FDR ends up working for you. I'll post the numbers when I get access after they are set up. Scott ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-08 15:00 ` Atchley, Scott @ 2012-11-08 15:02 ` Atchley, Scott 2012-11-08 16:19 ` Andrey Korolyov 2012-11-08 20:12 ` Joseph Glanville 1 sibling, 1 reply; 31+ messages in thread From: Atchley, Scott @ 2012-11-08 15:02 UTC (permalink / raw) To: Mark Nelson Cc: Gandalf Corvotempesta, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org On Nov 8, 2012, at 10:00 AM, Scott Atchley <atchleyes@ornl.gov> wrote: > On Nov 8, 2012, at 9:39 AM, Mark Nelson <mark.nelson@inktank.com> wrote: > >> On 11/08/2012 07:55 AM, Atchley, Scott wrote: >>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: >>> >>>> 2012/11/8 Mark Nelson <mark.nelson@inktank.com>: >>>>> I haven't done much with IPoIB (just RDMA), but my understanding is that it >>>>> tends to top out at like 15Gb/s. Some others on this mailing list can >>>>> probably speak more authoritatively. Even with RDMA you are going to top >>>>> out at around 3.1-3.2GB/s. >>>> >>>> 15Gb/s is still faster than 10Gbe >>>> But this speed limit seems to be kernel-related and should be the same >>>> even in a 10Gbe environment, or not? >>> >>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. >>> >>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. >> >> Scott, this is very interesting! Does setting the interrupt affinity >> make the biggest difference then when you have concurrent netperf >> processes going? For some reason I thought that setting interrupt >> affinity wasn't even guaranteed in linux any more, but this is just some >> half-remembered recollection from a year or two ago. > > We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity: > > Default (irqbalance running) 12.8 Gb/s > IRQ balance off 13.0 Gb/s > Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script > > When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream. Note, I used hwloc to determine which socket was closer to the mlx4 device on our dual socket machines. On these nodes, hwloc reported that both sockets were equally close, but a colleague has machines where one socket is closer than the other. In that case, bind to the closer socket (or to cores within the closer socket). > >>> We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: >>> >>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf >>> >>> We looked at their interrupt affinity setting scripts and then wrote our own. >>> >>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. >>> >>> We are getting a new test cluster with FDR HCAs and I will look into those as well. >> >> Nice! At some point I'll probably try to justify getting some FDR cards >> in house. I'd definitely like to hear how FDR ends up working for you. > > I'll post the numbers when I get access after they are set up. > > Scott > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-08 15:02 ` Atchley, Scott @ 2012-11-08 16:19 ` Andrey Korolyov 2012-11-08 18:03 ` Atchley, Scott 0 siblings, 1 reply; 31+ messages in thread From: Andrey Korolyov @ 2012-11-08 16:19 UTC (permalink / raw) To: Atchley, Scott Cc: Mark Nelson, Gandalf Corvotempesta, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott <atchleyes@ornl.gov> wrote: > On Nov 8, 2012, at 10:00 AM, Scott Atchley <atchleyes@ornl.gov> wrote: > >> On Nov 8, 2012, at 9:39 AM, Mark Nelson <mark.nelson@inktank.com> wrote: >> >>> On 11/08/2012 07:55 AM, Atchley, Scott wrote: >>>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: >>>> >>>>> 2012/11/8 Mark Nelson <mark.nelson@inktank.com>: >>>>>> I haven't done much with IPoIB (just RDMA), but my understanding is that it >>>>>> tends to top out at like 15Gb/s. Some others on this mailing list can >>>>>> probably speak more authoritatively. Even with RDMA you are going to top >>>>>> out at around 3.1-3.2GB/s. >>>>> >>>>> 15Gb/s is still faster than 10Gbe >>>>> But this speed limit seems to be kernel-related and should be the same >>>>> even in a 10Gbe environment, or not? >>>> >>>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. >>>> >>>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. >>> >>> Scott, this is very interesting! Does setting the interrupt affinity >>> make the biggest difference then when you have concurrent netperf >>> processes going? For some reason I thought that setting interrupt >>> affinity wasn't even guaranteed in linux any more, but this is just some >>> half-remembered recollection from a year or two ago. >> >> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity: >> >> Default (irqbalance running) 12.8 Gb/s >> IRQ balance off 13.0 Gb/s >> Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script >> >> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream. > Did you tried Mellanox-baked modules for 2.6.32 before that? > Note, I used hwloc to determine which socket was closer to the mlx4 device on our dual socket machines. On these nodes, hwloc reported that both sockets were equally close, but a colleague has machines where one socket is closer than the other. In that case, bind to the closer socket (or to cores within the closer socket). > >> >>>> We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: >>>> >>>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf >>>> >>>> We looked at their interrupt affinity setting scripts and then wrote our own. >>>> >>>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. >>>> >>>> We are getting a new test cluster with FDR HCAs and I will look into those as well. >>> >>> Nice! At some point I'll probably try to justify getting some FDR cards >>> in house. I'd definitely like to hear how FDR ends up working for you. >> >> I'll post the numbers when I get access after they are set up. >> >> Scott >> > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-08 16:19 ` Andrey Korolyov @ 2012-11-08 18:03 ` Atchley, Scott 0 siblings, 0 replies; 31+ messages in thread From: Atchley, Scott @ 2012-11-08 18:03 UTC (permalink / raw) To: Andrey Korolyov Cc: Mark Nelson, Gandalf Corvotempesta, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org On Nov 8, 2012, at 11:19 AM, Andrey Korolyov <andrey@xdel.ru> wrote: > On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott <atchleyes@ornl.gov> wrote: >> On Nov 8, 2012, at 10:00 AM, Scott Atchley <atchleyes@ornl.gov> wrote: >> >>> On Nov 8, 2012, at 9:39 AM, Mark Nelson <mark.nelson@inktank.com> wrote: >>> >>>> On 11/08/2012 07:55 AM, Atchley, Scott wrote: >>>>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: >>>>> >>>>>> 2012/11/8 Mark Nelson <mark.nelson@inktank.com>: >>>>>>> I haven't done much with IPoIB (just RDMA), but my understanding is that it >>>>>>> tends to top out at like 15Gb/s. Some others on this mailing list can >>>>>>> probably speak more authoritatively. Even with RDMA you are going to top >>>>>>> out at around 3.1-3.2GB/s. >>>>>> >>>>>> 15Gb/s is still faster than 10Gbe >>>>>> But this speed limit seems to be kernel-related and should be the same >>>>>> even in a 10Gbe environment, or not? >>>>> >>>>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. >>>>> >>>>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. >>>> >>>> Scott, this is very interesting! Does setting the interrupt affinity >>>> make the biggest difference then when you have concurrent netperf >>>> processes going? For some reason I thought that setting interrupt >>>> affinity wasn't even guaranteed in linux any more, but this is just some >>>> half-remembered recollection from a year or two ago. >>> >>> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity: >>> >>> Default (irqbalance running) 12.8 Gb/s >>> IRQ balance off 13.0 Gb/s >>> Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script >>> >>> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream. >> > > Did you tried Mellanox-baked modules for 2.6.32 before that? That came with RHEL6? No. Scott > >> Note, I used hwloc to determine which socket was closer to the mlx4 device on our dual socket machines. On these nodes, hwloc reported that both sockets were equally close, but a colleague has machines where one socket is closer than the other. In that case, bind to the closer socket (or to cores within the closer socket). >> >>> >>>>> We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: >>>>> >>>>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf >>>>> >>>>> We looked at their interrupt affinity setting scripts and then wrote our own. >>>>> >>>>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. >>>>> >>>>> We are getting a new test cluster with FDR HCAs and I will look into those as well. >>>> >>>> Nice! At some point I'll probably try to justify getting some FDR cards >>>> in house. I'd definitely like to hear how FDR ends up working for you. >>> >>> I'll post the numbers when I get access after they are set up. >>> >>> Scott >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion 2012-11-08 15:00 ` Atchley, Scott 2012-11-08 15:02 ` Atchley, Scott @ 2012-11-08 20:12 ` Joseph Glanville 2012-11-08 21:21 ` SSD journal suggestion / rsockets Dieter Kasper 1 sibling, 1 reply; 31+ messages in thread From: Joseph Glanville @ 2012-11-08 20:12 UTC (permalink / raw) To: Atchley, Scott Cc: Mark Nelson, Gandalf Corvotempesta, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org On 9 November 2012 02:00, Atchley, Scott <atchleyes@ornl.gov> wrote: > On Nov 8, 2012, at 9:39 AM, Mark Nelson <mark.nelson@inktank.com> wrote: > >> On 11/08/2012 07:55 AM, Atchley, Scott wrote: >>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: >>> >>>> 2012/11/8 Mark Nelson <mark.nelson@inktank.com>: >>>>> I haven't done much with IPoIB (just RDMA), but my understanding is that it >>>>> tends to top out at like 15Gb/s. Some others on this mailing list can >>>>> probably speak more authoritatively. Even with RDMA you are going to top >>>>> out at around 3.1-3.2GB/s. >>>> >>>> 15Gb/s is still faster than 10Gbe >>>> But this speed limit seems to be kernel-related and should be the same >>>> even in a 10Gbe environment, or not? >>> >>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. >>> >>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. >> >> Scott, this is very interesting! Does setting the interrupt affinity >> make the biggest difference then when you have concurrent netperf >> processes going? For some reason I thought that setting interrupt >> affinity wasn't even guaranteed in linux any more, but this is just some >> half-remembered recollection from a year or two ago. > > We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity: > > Default (irqbalance running) 12.8 Gb/s > IRQ balance off 13.0 Gb/s > Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script > > When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream. > >>> We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: >>> >>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf >>> >>> We looked at their interrupt affinity setting scripts and then wrote our own. >>> >>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. >>> >>> We are getting a new test cluster with FDR HCAs and I will look into those as well. >> >> Nice! At some point I'll probably try to justify getting some FDR cards >> in house. I'd definitely like to hear how FDR ends up working for you. > > I'll post the numbers when I get access after they are set up. > > Scott > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html If you are running Ceph purely in userspace you could try using rsockets. rsockets is a pure userspace implementation of sockets over RDMA. It has much much lower latency and close to native throughput. My guess is rsockets will probably work perfectly and should give you 95% of theoretical max performance. I would like to see a somewhat native implementation of RDMA in Ceph one day. I was doing some preliminary work on it 1.5 years ago when Ceph was first gaining traction but we didn't end up putting our focus on Ceph and as such I never got anywhere with it. In theory one only needs to use RDMA for the fast path to gain alot of benefit. This can be done even in the RBD kernel module with the RDMA-CM which will interact nicely across kernelspace and userspace (they actually share he same API thankfully). Joseph. -- CTO | Orion Virtualisation Solutions | www.orionvm.com.au Phone: 1300 56 99 52 | Mobile: 0428 754 846 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion / rsockets 2012-11-08 20:12 ` Joseph Glanville @ 2012-11-08 21:21 ` Dieter Kasper 2012-11-08 22:00 ` Joseph Glanville 0 siblings, 1 reply; 31+ messages in thread From: Dieter Kasper @ 2012-11-08 21:21 UTC (permalink / raw) To: Joseph Glanville Cc: Atchley, Scott, Mark Nelson, Gandalf Corvotempesta, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org, Andreas Bluemle, Dieter Kasper (KD) Joseph, I've downloaded and read the presentation from 'Sean Hefty / Intel Corporation' about rsockets, which sounds very promising to me. Can you please teach me how to get access to the rsockets source ? Thanks, -Dieter On Thu, Nov 08, 2012 at 09:12:45PM +0100, Joseph Glanville wrote: > On 9 November 2012 02:00, Atchley, Scott <atchleyes@ornl.gov> wrote: > > On Nov 8, 2012, at 9:39 AM, Mark Nelson <mark.nelson@inktank.com> wrote: > > > >> On 11/08/2012 07:55 AM, Atchley, Scott wrote: > >>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: > >>> > >>>> 2012/11/8 Mark Nelson <mark.nelson@inktank.com>: > >>>>> I haven't done much with IPoIB (just RDMA), but my understanding is that it > >>>>> tends to top out at like 15Gb/s. Some others on this mailing list can > >>>>> probably speak more authoritatively. Even with RDMA you are going to top > >>>>> out at around 3.1-3.2GB/s. > >>>> > >>>> 15Gb/s is still faster than 10Gbe > >>>> But this speed limit seems to be kernel-related and should be the same > >>>> even in a 10Gbe environment, or not? > >>> > >>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. > >>> > >>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. > >> > >> Scott, this is very interesting! Does setting the interrupt affinity > >> make the biggest difference then when you have concurrent netperf > >> processes going? For some reason I thought that setting interrupt > >> affinity wasn't even guaranteed in linux any more, but this is just some > >> half-remembered recollection from a year or two ago. > > > > We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity: > > > > Default (irqbalance running) 12.8 Gb/s > > IRQ balance off 13.0 Gb/s > > Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script > > > > When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream. > > > >>> We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: > >>> > >>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf > >>> > >>> We looked at their interrupt affinity setting scripts and then wrote our own. > >>> > >>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. > >>> > >>> We are getting a new test cluster with FDR HCAs and I will look into those as well. > >> > >> Nice! At some point I'll probably try to justify getting some FDR cards > >> in house. I'd definitely like to hear how FDR ends up working for you. > > > > I'll post the numbers when I get access after they are set up. > > > > Scott > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > If you are running Ceph purely in userspace you could try using rsockets. > rsockets is a pure userspace implementation of sockets over RDMA. It > has much much lower latency and close to native throughput. > My guess is rsockets will probably work perfectly and should give you > 95% of theoretical max performance. > > I would like to see a somewhat native implementation of RDMA in Ceph one day. > I was doing some preliminary work on it 1.5 years ago when Ceph was > first gaining traction but we didn't end up putting our focus on Ceph > and as such I never got anywhere with it. > In theory one only needs to use RDMA for the fast path to gain alot of > benefit. This can be done even in the RBD kernel module with the > RDMA-CM which will interact nicely across kernelspace and userspace > (they actually share he same API thankfully). > > Joseph. > > -- > CTO | Orion Virtualisation Solutions | www.orionvm.com.au > Phone: 1300 56 99 52 | Mobile: 0428 754 846 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion / rsockets 2012-11-08 21:21 ` SSD journal suggestion / rsockets Dieter Kasper @ 2012-11-08 22:00 ` Joseph Glanville 2012-11-09 14:43 ` Atchley, Scott 0 siblings, 1 reply; 31+ messages in thread From: Joseph Glanville @ 2012-11-08 22:00 UTC (permalink / raw) To: Dieter Kasper Cc: Atchley, Scott, Mark Nelson, Gandalf Corvotempesta, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org, Andreas Bluemle On 9 November 2012 08:21, Dieter Kasper <d.kasper@kabelmail.de> wrote: > Joseph, > > I've downloaded and read the presentation from 'Sean Hefty / Intel Corporation' > about rsockets, which sounds very promising to me. > Can you please teach me how to get access to the rsockets source ? > > Thanks, > -Dieter > > rsockets is distributed as part of librdmacm. You can clone the git repository here: git://beany.openfabrics.org/~shefty/librdmacm.git I recommend using the latest master as it features much better support for forking. Joseph. -- CTO | Orion Virtualisation Solutions | www.orionvm.com.au Phone: 1300 56 99 52 | Mobile: 0428 754 846 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion / rsockets 2012-11-08 22:00 ` Joseph Glanville @ 2012-11-09 14:43 ` Atchley, Scott 2012-11-09 23:41 ` Joseph Glanville 0 siblings, 1 reply; 31+ messages in thread From: Atchley, Scott @ 2012-11-09 14:43 UTC (permalink / raw) To: Joseph Glanville Cc: Dieter Kasper, Mark Nelson, Gandalf Corvotempesta, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org, Andreas Bluemle On Nov 8, 2012, at 5:00 PM, Joseph Glanville <joseph.glanville@orionvm.com.au> wrote: > On 9 November 2012 08:21, Dieter Kasper <d.kasper@kabelmail.de> wrote: >> Joseph, >> >> I've downloaded and read the presentation from 'Sean Hefty / Intel Corporation' >> about rsockets, which sounds very promising to me. >> Can you please teach me how to get access to the rsockets source ? >> >> Thanks, >> -Dieter >> >> > > rsockets is distributed as part of librdmacm. You can clone the git > repository here: > git://beany.openfabrics.org/~shefty/librdmacm.git > > I recommend using the latest master as it features much better support > for forking. I would be interested in hearing about how it works at scale. I do not know if Sean uses dedicated send and receive buffers per connection or a shared receive queue or shared send queue. Scaling might be an issue or it might not. Scott ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: SSD journal suggestion / rsockets 2012-11-09 14:43 ` Atchley, Scott @ 2012-11-09 23:41 ` Joseph Glanville 0 siblings, 0 replies; 31+ messages in thread From: Joseph Glanville @ 2012-11-09 23:41 UTC (permalink / raw) To: Atchley, Scott Cc: Dieter Kasper, Mark Nelson, Gandalf Corvotempesta, martin@tuxadero.com, Sage Weil, ceph-devel@vger.kernel.org, Andreas Bluemle On 10 November 2012 01:43, Atchley, Scott <atchleyes@ornl.gov> wrote: > On Nov 8, 2012, at 5:00 PM, Joseph Glanville <joseph.glanville@orionvm.com.au> wrote: > >> On 9 November 2012 08:21, Dieter Kasper <d.kasper@kabelmail.de> wrote: >>> Joseph, >>> >>> I've downloaded and read the presentation from 'Sean Hefty / Intel Corporation' >>> about rsockets, which sounds very promising to me. >>> Can you please teach me how to get access to the rsockets source ? >>> >>> Thanks, >>> -Dieter >>> >>> >> >> rsockets is distributed as part of librdmacm. You can clone the git >> repository here: >> git://beany.openfabrics.org/~shefty/librdmacm.git >> >> I recommend using the latest master as it features much better support >> for forking. > > I would be interested in hearing about how it works at scale. I do not know if Sean uses dedicated send and receive buffers per connection or a shared receive queue or shared send queue. Scaling might be an issue or it might not. > > Scott Hi Scott, It uses RC QPs so it's scalability is as good as any other large scale app with similar HCA resource requirements. As an aside: As I understand it Mellanox is working on a new piece of hardware called Connect-IB (somewhat of a successor to Connext-X 3) that will use a temporary/virtual resource mapping for RC QPs that should make this a non-issue. Press release is here: http://www.mellanox.com/content/pages.php?pg=press_release_item&rec_id=814 The HCA is just in general much beefier in terms of available resources for QPs and MRs so it's a big boon for big clusters than need to use all-to-all RC QPs. That being said I don't have a computer with enough nodes for it to make a difference. :( Joseph. -- CTO | Orion Virtualisation Solutions | www.orionvm.com.au Phone: 1300 56 99 52 | Mobile: 0428 754 846 ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2012-11-09 23:41 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-11-07 12:13 SSD journal suggestion Gandalf Corvotempesta 2012-11-07 12:17 ` Sage Weil 2012-11-07 12:28 ` Gandalf Corvotempesta 2012-11-07 15:01 ` Mark Nelson 2012-11-07 16:12 ` Atchley, Scott 2012-11-07 16:20 ` Mark Nelson 2012-11-07 16:35 ` Atchley, Scott 2012-11-07 16:41 ` Mark Nelson 2012-11-07 21:11 ` Martin Mailand 2012-11-07 21:14 ` Gandalf Corvotempesta 2012-11-07 21:35 ` Martin Mailand 2012-11-07 21:44 ` Stefan Priebe 2012-11-07 21:55 ` Martin Mailand 2012-11-07 21:59 ` Stefan Priebe 2012-11-07 22:13 ` Martin Mailand 2012-11-07 22:28 ` Gandalf Corvotempesta 2012-11-07 22:39 ` Martin Mailand 2012-11-07 22:51 ` Gandalf Corvotempesta 2012-11-07 23:12 ` Mark Nelson 2012-11-08 8:22 ` Gandalf Corvotempesta 2012-11-08 13:55 ` Atchley, Scott 2012-11-08 14:39 ` Mark Nelson 2012-11-08 15:00 ` Atchley, Scott 2012-11-08 15:02 ` Atchley, Scott 2012-11-08 16:19 ` Andrey Korolyov 2012-11-08 18:03 ` Atchley, Scott 2012-11-08 20:12 ` Joseph Glanville 2012-11-08 21:21 ` SSD journal suggestion / rsockets Dieter Kasper 2012-11-08 22:00 ` Joseph Glanville 2012-11-09 14:43 ` Atchley, Scott 2012-11-09 23:41 ` Joseph Glanville
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.