From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bob Peterson Date: Tue, 21 Oct 2014 08:30:22 -0400 (EDT) Subject: [Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps In-Reply-To: <544627AB.7070203@redhat.com> References: <544627AB.7070203@redhat.com> Message-ID: <689783740.8277121.1413894622053.JavaMail.zimbra@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit ----- Original Message ----- > I assume that you are trying to get the number of nodes here? I'm not > sure that this is a good way to do that. I would expect that with older > userspace, num_slots might indeed be 0 so that is something that needs > to be checked. Also, I suspect that you want to know how many nodes > could be in the cluster, rather than how many there are now, otherwise > there will be odd results when mounting the cluster. > > Counting the number of journals would be simpler I think, and less > likely to give odd results. (snip) My original version used the number of journals, which is fairly easy. The problem is, customers often allocate extra journals to their file system, anticipating that they will add more nodes in the future. Case in point, our own performance group who has a four-node cluster, but allocated 5 journals in mkfs. Doing so tends to leave large gaps that will never be used until space gets low, and then it's a chaos of all the nodes all trying to use those shunned rgrps at the same time. I know people don't need to do that anymore, and it's a carry-over from the GFS1 days, but people still do it. I don't know of a better way to determine the number of nodes. The DLM would know, but it doesn't share that information in any other way other than the recovery code that I'm currently using with this patch. I'm open to suggestions if there's a better way. > This existing gfs2_rgrp_congested() function should be giving the answer > as to which rgrp should be preferred, so the question is whether that is > giving the wrong answer for some reason? I think that needs to be looked > into and fixed if required, The trouble is this: The gfs2_rgrp_congested() function tells you if the rgrp is congested at any given moment in time, and that's highly variable. What tends to happen is that all the nodes create a bunch of files in a haphazard fashion, as part of initialization. At the time, each node (accurately) sees that there is _currently_ no congestion, so they all decide to use rgrp X. They all make big multi-block reservations in rgrp X. Then they all proceed to fight over who has the lock for rgrp X. Two reasons: (a) when the initial files are set up, there are too few samples to get any degree of accuracy with regard to congestion, and (b) there really ISN'T any contention during setup because no one has begun to do any serious writing: there's a trickle-in effect. The problem is that once you've chosen a rgrp, you tend to stick with it, due to reservations and due to the way "goal blocks" work, both of which preempt searching for a different rgrp. Ordinarily, you would think the problem would get better (and therefore faster) with time because there are more samples, and better information regarding which rgrps really are congested, but in actual practice, it doesn't work like that. All the nodes continue to fight over the same rgrps. I suspect this is because in many use cases, workloads are evenly distributed to the worker nodes, so they all go through phases of (1) setup, (2) analysis of data, (3) writing, and they often hit the same phases at roughly the same times (because of the even distribution of the workload). Experience has shown (both in GFS1 from prior years and GFS2) that letting each node pick a unique subset of rgrps results in the least amount of contention. Regards, Bob Peterson Red Hat File Systems