* Large numbers of OSD per node @ 2012-11-05 7:14 Andrew Thrift 2012-11-05 11:01 ` Wido den Hollander 0 siblings, 1 reply; 13+ messages in thread From: Andrew Thrift @ 2012-11-05 7:14 UTC (permalink / raw) To: ceph-devel@vger.kernel.org Hi, We are evaluating CEPH for deployment. I was wondering if there are any current "best practices" around the number of OSD's per node ? e.g. We are looking at deploying 3 nodes, each with 72x SAS disks, and 2x 10gigabit Ethernet bonded. Would this best be configured as 72 OSD's per node. Or would we be better to using raid5 to have 18 OSD's per node ? Regards, Andrew ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-05 7:14 Large numbers of OSD per node Andrew Thrift @ 2012-11-05 11:01 ` Wido den Hollander 2012-11-05 12:45 ` Mark Nelson 0 siblings, 1 reply; 13+ messages in thread From: Wido den Hollander @ 2012-11-05 11:01 UTC (permalink / raw) To: Andrew Thrift; +Cc: ceph-devel@vger.kernel.org Hi, On 05-11-12 08:14, Andrew Thrift wrote: > Hi, > > We are evaluating CEPH for deployment. > > I was wondering if there are any current "best practices" around the > number of OSD's per node ? > > > e.g. We are looking at deploying 3 nodes, each with 72x SAS disks, and > 2x 10gigabit Ethernet bonded. > > Would this best be configured as 72 OSD's per node. > > Or would we be better to using raid5 to have 18 OSD's per node ? > You should be aware of a large data movement when using 3 nodes. I myself am I fan of going with a lot of smaller nodes instead of building big nodes. With 3 such nodes you'd probably be going 2x replication? Otherwise you can never recover when one of the 3 nodes completely burns down to the ground. If you have 72 1TB disks in such a node you could in theory be moving 72TB, that would put a lot of stress on the other two nodes and you would need a lot of memory and CPU power. You might be better of by going for 27 nodes with 8 disks each, or have 18 nodes with 12 disks? When a node fails the recovery will be much easier on your cluster. You can also take out a node for maintenance when needed. Another thing you should be aware of is status "D". What if a filesystem inside one of your big machines hangs and one of the OSDs hangs in status "D", waiting for I/O which will never come? You'd be forced to reboot that node and that would again take 72TB of data offline. I am not aware of anybody using such big nodes in production. It could work, but you will need a lot of memory and a lot of CPU. The recommendation is 1GB/1Ghz per OSD, so you'd be looking at at least 72GB of memory and 72Ghz of CPU power. Wido > > > > Regards, > > > > > > Andrew > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-05 11:01 ` Wido den Hollander @ 2012-11-05 12:45 ` Mark Nelson 2012-11-06 2:05 ` Andrew Thrift 0 siblings, 1 reply; 13+ messages in thread From: Mark Nelson @ 2012-11-05 12:45 UTC (permalink / raw) To: Wido den Hollander; +Cc: Andrew Thrift, ceph-devel@vger.kernel.org On 11/05/2012 05:01 AM, Wido den Hollander wrote: > Hi, > > On 05-11-12 08:14, Andrew Thrift wrote: >> Hi, >> >> We are evaluating CEPH for deployment. >> >> I was wondering if there are any current "best practices" around the >> number of OSD's per node ? >> >> >> e.g. We are looking at deploying 3 nodes, each with 72x SAS disks, and >> 2x 10gigabit Ethernet bonded. >> >> Would this best be configured as 72 OSD's per node. >> >> Or would we be better to using raid5 to have 18 OSD's per node ? >> > > You should be aware of a large data movement when using 3 nodes. > > I myself am I fan of going with a lot of smaller nodes instead of > building big nodes. > > With 3 such nodes you'd probably be going 2x replication? Otherwise you > can never recover when one of the 3 nodes completely burns down to the > ground. > > If you have 72 1TB disks in such a node you could in theory be moving > 72TB, that would put a lot of stress on the other two nodes and you > would need a lot of memory and CPU power. > > You might be better of by going for 27 nodes with 8 disks each, or have > 18 nodes with 12 disks? > > When a node fails the recovery will be much easier on your cluster. > > You can also take out a node for maintenance when needed. > > Another thing you should be aware of is status "D". What if a filesystem > inside one of your big machines hangs and one of the OSDs hangs in > status "D", waiting for I/O which will never come? > > You'd be forced to reboot that node and that would again take 72TB of > data offline. > > I am not aware of anybody using such big nodes in production. It could > work, but you will need a lot of memory and a lot of CPU. > > The recommendation is 1GB/1Ghz per OSD, so you'd be looking at at least > 72GB of memory and 72Ghz of CPU power. > > Wido To echo what Wido is saying here, we've not really extensively tested configurations with nodes that big at Inktank either. The biggest test node we have in-house is a 36-drive SC847a, and that was a pretty recent acquisition. Nodes that large are definitely bigger than what most people are looking at right now. For a deployment of the size you are talking about, I think you'd probably be better served with 24 disk or less nodes and picking up more of them. You'll likely have better performance and fewer problems if a node goes down. It is lower density, but I think in this case using up a few extra U will be worth it. Having said that, my guess is that if you were to use 72 drive nodes, you'd probably be best off doing a raid-5 or raid-6 and doing something like 12 6-drive OSDs. Be mindful of what drives, expanders, and controllers you pick. -- Mark Nelson Performance Engineer Inktank ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-05 12:45 ` Mark Nelson @ 2012-11-06 2:05 ` Andrew Thrift 2012-11-06 9:10 ` Wido den Hollander 0 siblings, 1 reply; 13+ messages in thread From: Andrew Thrift @ 2012-11-06 2:05 UTC (permalink / raw) To: ceph-devel, mark.nelson, wido Mark, Wido, Thank you very much for your informed responses. What you have mentioned makes a lot of sense. If we had a single node completely fail, we would have 72TB of data that needed to be replicated to a new OSD. This would take approximately 10.5 hours to complete over 2x Bonded 10gig connections, and would put the other two nodes under significant load while the data is replicated. We were looking at using CEPH "Heads" with SAS enclosres as a lower cost solution than buying more nodes. I can however see the IO/resiliency benefits of more nodes. Regards, Andrew On 11/6/2012 1:45 AM, Mark Nelson wrote: > On 11/05/2012 05:01 AM, Wido den Hollander wrote: >> Hi, >> >> On 05-11-12 08:14, Andrew Thrift wrote: >>> Hi, >>> >>> We are evaluating CEPH for deployment. >>> >>> I was wondering if there are any current "best practices" around the >>> number of OSD's per node ? >>> >>> >>> e.g. We are looking at deploying 3 nodes, each with 72x SAS disks, and >>> 2x 10gigabit Ethernet bonded. >>> >>> Would this best be configured as 72 OSD's per node. >>> >>> Or would we be better to using raid5 to have 18 OSD's per node ? >>> >> >> You should be aware of a large data movement when using 3 nodes. >> >> I myself am I fan of going with a lot of smaller nodes instead of >> building big nodes. >> >> With 3 such nodes you'd probably be going 2x replication? Otherwise you >> can never recover when one of the 3 nodes completely burns down to the >> ground. >> >> If you have 72 1TB disks in such a node you could in theory be moving >> 72TB, that would put a lot of stress on the other two nodes and you >> would need a lot of memory and CPU power. >> >> You might be better of by going for 27 nodes with 8 disks each, or have >> 18 nodes with 12 disks? >> >> When a node fails the recovery will be much easier on your cluster. >> >> You can also take out a node for maintenance when needed. >> >> Another thing you should be aware of is status "D". What if a filesystem >> inside one of your big machines hangs and one of the OSDs hangs in >> status "D", waiting for I/O which will never come? >> >> You'd be forced to reboot that node and that would again take 72TB of >> data offline. >> >> I am not aware of anybody using such big nodes in production. It could >> work, but you will need a lot of memory and a lot of CPU. >> >> The recommendation is 1GB/1Ghz per OSD, so you'd be looking at at least >> 72GB of memory and 72Ghz of CPU power. >> >> Wido > > > > To echo what Wido is saying here, we've not really extensively tested > configurations with nodes that big at Inktank either. The biggest test > node we have in-house is a 36-drive SC847a, and that was a pretty recent > acquisition. Nodes that large are definitely bigger than what most > people are looking at right now. > > For a deployment of the size you are talking about, I think you'd > probably be better served with 24 disk or less nodes and picking up more > of them. You'll likely have better performance and fewer problems if a > node goes down. It is lower density, but I think in this case using up > a few extra U will be worth it. > > Having said that, my guess is that if you were to use 72 drive nodes, > you'd probably be best off doing a raid-5 or raid-6 and doing something > like 12 6-drive OSDs. Be mindful of what drives, expanders, and > controllers you pick. > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-06 2:05 ` Andrew Thrift @ 2012-11-06 9:10 ` Wido den Hollander 2012-11-06 9:36 ` Gandalf Corvotempesta 0 siblings, 1 reply; 13+ messages in thread From: Wido den Hollander @ 2012-11-06 9:10 UTC (permalink / raw) To: Andrew Thrift; +Cc: ceph-devel, mark.nelson On 06-11-12 03:05, Andrew Thrift wrote: > Mark, Wido, > > Thank you very much for your informed responses. > You're welcome! > What you have mentioned makes a lot of sense. > > If we had a single node completely fail, we would have 72TB of data that > needed to be replicated to a new OSD. This would take approximately > 10.5 hours to complete over 2x Bonded 10gig connections, and would put > the other two nodes under significant load while the data is replicated. > You shouldn't only think about a complete failure solution. The distributed architecture of Ceph also gives you the freedom to take out a node whenever you want to do maintenance or just don't trust the node and you want to investigate. The scenario is still the same. Use smaller nodes so taking out one node (for what reason) doesn't impact your cluster that much. Wido > We were looking at using CEPH "Heads" with SAS enclosres as a lower cost > solution than buying more nodes. I can however see the IO/resiliency > benefits of more nodes. > > > > Regards, > > > > > Andrew > > > > > > > On 11/6/2012 1:45 AM, Mark Nelson wrote: >> On 11/05/2012 05:01 AM, Wido den Hollander wrote: >>> Hi, >>> >>> On 05-11-12 08:14, Andrew Thrift wrote: >>>> Hi, >>>> >>>> We are evaluating CEPH for deployment. >>>> >>>> I was wondering if there are any current "best practices" around the >>>> number of OSD's per node ? >>>> >>>> >>>> e.g. We are looking at deploying 3 nodes, each with 72x SAS disks, and >>>> 2x 10gigabit Ethernet bonded. >>>> >>>> Would this best be configured as 72 OSD's per node. >>>> >>>> Or would we be better to using raid5 to have 18 OSD's per node ? >>>> >>> >>> You should be aware of a large data movement when using 3 nodes. >>> >>> I myself am I fan of going with a lot of smaller nodes instead of >>> building big nodes. >>> >>> With 3 such nodes you'd probably be going 2x replication? Otherwise you >>> can never recover when one of the 3 nodes completely burns down to the >>> ground. >>> >>> If you have 72 1TB disks in such a node you could in theory be moving >>> 72TB, that would put a lot of stress on the other two nodes and you >>> would need a lot of memory and CPU power. >>> >>> You might be better of by going for 27 nodes with 8 disks each, or have >>> 18 nodes with 12 disks? >>> >>> When a node fails the recovery will be much easier on your cluster. >>> >>> You can also take out a node for maintenance when needed. >>> >>> Another thing you should be aware of is status "D". What if a filesystem >>> inside one of your big machines hangs and one of the OSDs hangs in >>> status "D", waiting for I/O which will never come? >>> >>> You'd be forced to reboot that node and that would again take 72TB of >>> data offline. >>> >>> I am not aware of anybody using such big nodes in production. It could >>> work, but you will need a lot of memory and a lot of CPU. >>> >>> The recommendation is 1GB/1Ghz per OSD, so you'd be looking at at least >>> 72GB of memory and 72Ghz of CPU power. >>> >>> Wido >> >> >> >> To echo what Wido is saying here, we've not really extensively tested >> configurations with nodes that big at Inktank either. The biggest test >> node we have in-house is a 36-drive SC847a, and that was a pretty recent >> acquisition. Nodes that large are definitely bigger than what most >> people are looking at right now. >> >> For a deployment of the size you are talking about, I think you'd >> probably be better served with 24 disk or less nodes and picking up more >> of them. You'll likely have better performance and fewer problems if a >> node goes down. It is lower density, but I think in this case using up >> a few extra U will be worth it. >> >> Having said that, my guess is that if you were to use 72 drive nodes, >> you'd probably be best off doing a raid-5 or raid-6 and doing something >> like 12 6-drive OSDs. Be mindful of what drives, expanders, and >> controllers you pick. >> > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-06 9:10 ` Wido den Hollander @ 2012-11-06 9:36 ` Gandalf Corvotempesta 2012-11-06 9:46 ` Wido den Hollander 0 siblings, 1 reply; 13+ messages in thread From: Gandalf Corvotempesta @ 2012-11-06 9:36 UTC (permalink / raw) To: Wido den Hollander; +Cc: Andrew Thrift, ceph-devel, mark.nelson 2012/11/6 Wido den Hollander <wido@widodh.nl>: > You shouldn't only think about a complete failure solution. The distributed > architecture of Ceph also gives you the freedom to take out a node whenever > you want to do maintenance or just don't trust the node and you want to > investigate. > > The scenario is still the same. Use smaller nodes so taking out one node > (for what reason) doesn't impact your cluster that much. Here: http://ceph.com/docs/master/install/hardware-recommendations/ is wrote that a production cluster has been made with many R515 with 12 disks of 3TB each. This give us 36TB of storage. is this configuration considered good ? I'm planning to have the same server. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-06 9:36 ` Gandalf Corvotempesta @ 2012-11-06 9:46 ` Wido den Hollander 2012-11-06 10:20 ` Gandalf Corvotempesta 2012-11-06 10:24 ` Gandalf Corvotempesta 0 siblings, 2 replies; 13+ messages in thread From: Wido den Hollander @ 2012-11-06 9:46 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Andrew Thrift, ceph-devel, mark.nelson On 06-11-12 10:36, Gandalf Corvotempesta wrote: > 2012/11/6 Wido den Hollander <wido@widodh.nl>: >> You shouldn't only think about a complete failure solution. The distributed >> architecture of Ceph also gives you the freedom to take out a node whenever >> you want to do maintenance or just don't trust the node and you want to >> investigate. >> >> The scenario is still the same. Use smaller nodes so taking out one node >> (for what reason) doesn't impact your cluster that much. > > Here: > http://ceph.com/docs/master/install/hardware-recommendations/ > is wrote that a production cluster has been made with many R515 with > 12 disks of 3TB each. This give us 36TB of storage. > > is this configuration considered good ? I'm planning to have the same server. > -- It works for them. There is no journaling though, but this setup is only being used for the RADOS Gateway, not for RBD. You might want to insert a couple of SSDs in there to do journaling for you. But the rule also applies here. When you use 3 of these machines, when loosing one, you will lose 33% of your cluster. The setup described on that page has 90 nodes, so one node failing is a little over 1% of the cluster which fails. Wido > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-06 9:46 ` Wido den Hollander @ 2012-11-06 10:20 ` Gandalf Corvotempesta 2012-11-06 10:24 ` Gandalf Corvotempesta 1 sibling, 0 replies; 13+ messages in thread From: Gandalf Corvotempesta @ 2012-11-06 10:20 UTC (permalink / raw) To: Wido den Hollander; +Cc: Andrew Thrift, ceph-devel, mark.nelson 2012/11/6 Wido den Hollander <wido@widodh.nl>: > It works for them. There is no journaling though, but this setup is only > being used for the RADOS Gateway, not for RBD. > > You might want to insert a couple of SSDs in there to do journaling for you. Yes, I'll add 2 SSD on each server, probably in RAID1, because i'll also install the OS on it. > But the rule also applies here. When you use 3 of these machines, when > loosing one, you will lose 33% of your cluster. > > The setup described on that page has 90 nodes, so one node failing is a > little over 1% of the cluster which fails. Actually I'll start with 2 or 3 3TB disks for each node and I'll have 3 nodes as testbed and 5 as production. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-06 9:46 ` Wido den Hollander 2012-11-06 10:20 ` Gandalf Corvotempesta @ 2012-11-06 10:24 ` Gandalf Corvotempesta 2012-11-06 11:05 ` Stefan Kleijkers 1 sibling, 1 reply; 13+ messages in thread From: Gandalf Corvotempesta @ 2012-11-06 10:24 UTC (permalink / raw) To: Wido den Hollander; +Cc: Andrew Thrift, ceph-devel, mark.nelson 2012/11/6 Wido den Hollander <wido@widodh.nl>: > The setup described on that page has 90 nodes, so one node failing is a > little over 1% of the cluster which fails. I think i'm missing something. In case of a failure, they will always have to resync 36 TB of data, no matter if they have 90 servers. Each server is 36TB, so every times they need to resync the whole server. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-06 10:24 ` Gandalf Corvotempesta @ 2012-11-06 11:05 ` Stefan Kleijkers 2012-11-06 11:31 ` Gandalf Corvotempesta 0 siblings, 1 reply; 13+ messages in thread From: Stefan Kleijkers @ 2012-11-06 11:05 UTC (permalink / raw) To: Gandalf Corvotempesta Cc: Wido den Hollander, Andrew Thrift, ceph-devel, mark.nelson On 11/06/2012 11:24 AM, Gandalf Corvotempesta wrote: > 2012/11/6 Wido den Hollander <wido@widodh.nl>: >> The setup described on that page has 90 nodes, so one node failing is a >> little over 1% of the cluster which fails. > I think i'm missing something. > In case of a failure, they will always have to resync 36 TB of data, > no matter if they have 90 servers. > Each server is 36TB, so every times they need to resync the whole server. > Well you have to keep in mind that when a node fails the PG's that resided on that node have to be redistributed over all the other nodes. So you begin moving about 1% of the data between all the remaining nodes/osds (coming from an OSD that has the remaining replica of the pg to the new OSD that will get a replica). So you move from and to all the remaining osd's and that will give you a lot of bandwidth and therefor fast recorvery to a consistent state. Stefan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-06 11:05 ` Stefan Kleijkers @ 2012-11-06 11:31 ` Gandalf Corvotempesta 2012-11-06 11:51 ` Stefan Kleijkers 0 siblings, 1 reply; 13+ messages in thread From: Gandalf Corvotempesta @ 2012-11-06 11:31 UTC (permalink / raw) To: Stefan Kleijkers Cc: Wido den Hollander, Andrew Thrift, ceph-devel, mark.nelson 2012/11/6 Stefan Kleijkers <stefan@unilogicnetworks.net>: > Well you have to keep in mind that when a node fails the PG's that resided > on that node have to be redistributed over all the other nodes. So you begin > moving about 1% of the data between all the remaining nodes/osds (coming > from an OSD that has the remaining replica of the pg to the new OSD that > will get a replica). So you move from and to all the remaining osd's and > that will give you a lot of bandwidth and therefor fast recorvery to a > consistent state. Ok, but in this case, 1% is still 36TB of data. There are no difference between 3 nodes with 36TB of data each or 90 nodes with 36TB of data each. In case of a node failure, you always have to move 36TB of data, no matter on how many nodes do you have. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-06 11:31 ` Gandalf Corvotempesta @ 2012-11-06 11:51 ` Stefan Kleijkers 2012-11-06 12:51 ` Gandalf Corvotempesta 0 siblings, 1 reply; 13+ messages in thread From: Stefan Kleijkers @ 2012-11-06 11:51 UTC (permalink / raw) To: Gandalf Corvotempesta Cc: Wido den Hollander, Andrew Thrift, ceph-devel, mark.nelson On 11/06/2012 12:31 PM, Gandalf Corvotempesta wrote: > 2012/11/6 Stefan Kleijkers <stefan@unilogicnetworks.net>: >> Well you have to keep in mind that when a node fails the PG's that resided >> on that node have to be redistributed over all the other nodes. So you begin >> moving about 1% of the data between all the remaining nodes/osds (coming >> from an OSD that has the remaining replica of the pg to the new OSD that >> will get a replica). So you move from and to all the remaining osd's and >> that will give you a lot of bandwidth and therefor fast recorvery to a >> consistent state. > Ok, but in this case, 1% is still 36TB of data. > There are no difference between 3 nodes with 36TB of data each or 90 > nodes with 36TB of data each. > In case of a node failure, you always have to move 36TB of data, no > matter on how many nodes do you have. > True, but it's a huge difference if you have to redistribute the 36T between 2 remaining nodes or between 89 remaining nodes. And with such a few nodes you hit probably a couple of other bottlenecks like CPU power per node, networking bandwidth per node, etc... I have noticed this the hard way with 3 nodes and 24 disks/osds per node. Stefan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Large numbers of OSD per node 2012-11-06 11:51 ` Stefan Kleijkers @ 2012-11-06 12:51 ` Gandalf Corvotempesta 0 siblings, 0 replies; 13+ messages in thread From: Gandalf Corvotempesta @ 2012-11-06 12:51 UTC (permalink / raw) To: Stefan Kleijkers Cc: Wido den Hollander, Andrew Thrift, ceph-devel, mark.nelson 2012/11/6 Stefan Kleijkers <stefan@unilogicnetworks.net>: > True, but it's a huge difference if you have to redistribute the 36T between > 2 remaining nodes or between 89 remaining nodes. And with such a few nodes > you hit probably a couple of other bottlenecks like CPU power per node, > networking bandwidth per node, etc... I have noticed this the hard way with > 3 nodes and 24 disks/osds per node. Ok, now it's clear. In a 90 nodes cluster, 36TB will move 400GB for each node while in a 10 nodes cluster, the same 36TB will move 3.6TB for each node. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2012-11-06 12:51 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-11-05 7:14 Large numbers of OSD per node Andrew Thrift 2012-11-05 11:01 ` Wido den Hollander 2012-11-05 12:45 ` Mark Nelson 2012-11-06 2:05 ` Andrew Thrift 2012-11-06 9:10 ` Wido den Hollander 2012-11-06 9:36 ` Gandalf Corvotempesta 2012-11-06 9:46 ` Wido den Hollander 2012-11-06 10:20 ` Gandalf Corvotempesta 2012-11-06 10:24 ` Gandalf Corvotempesta 2012-11-06 11:05 ` Stefan Kleijkers 2012-11-06 11:31 ` Gandalf Corvotempesta 2012-11-06 11:51 ` Stefan Kleijkers 2012-11-06 12:51 ` Gandalf Corvotempesta
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.