* Cleancache and shared filesystems @ 2011-05-27 13:51 Steven Whitehouse 2011-05-27 15:31 ` Dan Magenheimer 0 siblings, 1 reply; 9+ messages in thread From: Steven Whitehouse @ 2011-05-27 13:51 UTC (permalink / raw) To: Dan Magenheimer; +Cc: linux-kernel Hi, I'm trying to figure out what I would need to do in order to get GFS2 to work with cleancache. Looking at the OCFS2 implementation leaves me with some questions about how this is supposed to work. The docs say that the cleancache_init_shared_fs() function is supposed to take a 128 bit UUID plus the sb. In OCFS2 it is passed a pointer to a 32 bit little endian quantity as the UUID: __le32 uuid_net_key; ... memcpy(&uuid_net_key, di->id2.i_super.s_uuid, sizeof(uuid_net_key)); ... cleancache_init_shared_fs((char *)&uuid_net_key, sb); and in the Xen backend driver this then appears to be dereferenced as if its two 64 bit values, which doesn't look right to me. Also, since the sb has a UUID field in it anyway, is there some reason why that cannot be used directly rather than passing the uuid as a separate variable? Steve. ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Cleancache and shared filesystems 2011-05-27 13:51 Cleancache and shared filesystems Steven Whitehouse @ 2011-05-27 15:31 ` Dan Magenheimer 2011-05-27 16:19 ` Steven Whitehouse 0 siblings, 1 reply; 9+ messages in thread From: Dan Magenheimer @ 2011-05-27 15:31 UTC (permalink / raw) To: Steven Whitehouse; +Cc: linux-kernel, sunil.mushran > From: Steven Whitehouse [mailto:swhiteho@redhat.com] > Sent: Friday, May 27, 2011 7:52 AM > Cc: linux-kernel@vger.kernel.org > Subject: Cleancache and shared filesystems > > Hi, > > I'm trying to figure out what I would need to do in order to get GFS2 > to > work with cleancache. Looking at the OCFS2 implementation leaves me > with > some questions about how this is supposed to work. The docs say that > the > cleancache_init_shared_fs() function is supposed to take a 128 bit UUID > plus the sb. > > In OCFS2 it is passed a pointer to a 32 bit little endian quantity as > the UUID: > > __le32 uuid_net_key; > > ... > > memcpy(&uuid_net_key, di->id2.i_super.s_uuid, sizeof(uuid_net_key)); > > ... > > cleancache_init_shared_fs((char *)&uuid_net_key, sb); > > and in the Xen backend driver this then appears to be dereferenced as > if > its two 64 bit values, which doesn't look right to me. Hi Steve -- Thanks for looking at cleancache! Hmmm... yes... it looks like you are right and the cleancache interface code for ocfs2 has bit-rotted over time and a bad value is being used for the uuid. This would result in pages not being shared between clustered VMs on the same physical machine, but would only result in a lower performance advantage, not any other visible effects, which would explain why it has gone undiscovered for so long. Thanks for pointing this out as it is definitely a bug! > Also, since the sb has a UUID field in it anyway, is there some reason > why that cannot be used directly rather than passing the uuid as a > separate variable? Forgive me but I am not a clustered FS expert (even for ocfs2): If the UUID field in the sb is always 128-bits and is always identical for all cluster nodes, and this fact is consistent across all clustered FS's at mount time, I agree that there is no need to pass the uuid as a parameter in cleancache_init_shared_fs as it can be derived in the body of cleancache_init_shared_fs and then passed to __cleancache_init_shared_fs (which only cares that it gets 128-bits and probably has no business digging through a superblock). OTOH, this call is made only once per mount so there's no significant advantage in changing this... or am I missing something? Thanks, Dan ==== Thanks... for the memory! I really could use more / my throughput's on the floor The balloon is flat / my swap disk's fat / I've OOM's in store Overcommitted so much (with apologies to Bob Hope) ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Cleancache and shared filesystems 2011-05-27 15:31 ` Dan Magenheimer @ 2011-05-27 16:19 ` Steven Whitehouse 2011-05-27 16:31 ` Dan Magenheimer 2011-05-27 23:33 ` Joel Becker 0 siblings, 2 replies; 9+ messages in thread From: Steven Whitehouse @ 2011-05-27 16:19 UTC (permalink / raw) To: Dan Magenheimer; +Cc: linux-kernel, sunil.mushran Hi, On Fri, 2011-05-27 at 08:31 -0700, Dan Magenheimer wrote: > > From: Steven Whitehouse [mailto:swhiteho@redhat.com] > > Sent: Friday, May 27, 2011 7:52 AM > > Cc: linux-kernel@vger.kernel.org > > Subject: Cleancache and shared filesystems > > > > Hi, > > > > I'm trying to figure out what I would need to do in order to get GFS2 > > to > > work with cleancache. Looking at the OCFS2 implementation leaves me > > with > > some questions about how this is supposed to work. The docs say that > > the > > cleancache_init_shared_fs() function is supposed to take a 128 bit UUID > > plus the sb. > > > > In OCFS2 it is passed a pointer to a 32 bit little endian quantity as > > the UUID: > > > > __le32 uuid_net_key; > > > > ... > > > > memcpy(&uuid_net_key, di->id2.i_super.s_uuid, sizeof(uuid_net_key)); > > > > ... > > > > cleancache_init_shared_fs((char *)&uuid_net_key, sb); > > > > and in the Xen backend driver this then appears to be dereferenced as > > if > > its two 64 bit values, which doesn't look right to me. > > Hi Steve -- > > Thanks for looking at cleancache! > > Hmmm... yes... it looks like you are right and the cleancache > interface code for ocfs2 has bit-rotted over time and a bad value > is being used for the uuid. This would result in pages not being > shared between clustered VMs on the same physical machine, but > would only result in a lower performance advantage, not any > other visible effects, which would explain why it has gone > undiscovered for so long. > > Thanks for pointing this out as it is definitely a bug! > Ok, thanks for confirming. > > Also, since the sb has a UUID field in it anyway, is there some reason > > why that cannot be used directly rather than passing the uuid as a > > separate variable? > > Forgive me but I am not a clustered FS expert (even for ocfs2): > If the UUID field in the sb is always 128-bits and is always > identical for all cluster nodes, and this fact is consistent > across all clustered FS's at mount time, I agree that there is > no need to pass the uuid as a parameter in > cleancache_init_shared_fs as it can be derived in the body of > cleancache_init_shared_fs and then passed to > __cleancache_init_shared_fs (which only cares that it gets > 128-bits and probably has no business digging through a > superblock). OTOH, this call is made only once per mount > so there's no significant advantage in changing this... or am > I missing something? > > Thanks, > Dan > The point was really just a question to see if I'd understood what was intended at this point of the code. It might be cleaner though to introduce a sb flag to say whether a particular fs is shared or not as a generic feature. Then the same init function could be used for both shared and non-shared filesystems presumably? The way that GFS2 has worked in terms of unique filesystem IDs, is to have a filesystem "name" which is a combination of a cluster name and a filesystem specific part which are separated with a colon. This has been used as the identifier which gives the unique ID for any particular filesystem, and it is also the volume label for the filesystem. In the early GFS2 code, we introduced, in addition a UUID as well. So that should also be a unique ID across the cluster. That does mean that it is possible (though very unlikely) that there will be GFS2 filesystems with a zero UUID in existence. That is easily fixable though with tunegfs2. So I think that the UUID is ok for this particular purpose, but if it was possible to use the filesystem "name" instead that would be more consistent with the rest of GFS2. I don't think its a big issue though. I suspect that for GFS2 we'd need a patch looking something like this (untested) based on what I think is the case so far: diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index 8ac9ae1..e807850 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -18,6 +18,7 @@ #include <linux/mount.h> #include <linux/gfs2_ondisk.h> #include <linux/quotaops.h> +#include <linux/cleancache.h> #include "gfs2.h" #include "incore.h" @@ -1187,6 +1188,12 @@ static int fill_super(struct super_block *sb, struct gfs2_args *args, int silent gfs2_glock_dq_uninit(&mount_gh); gfs2_online_uevent(sdp); + if (ls->ls_ops == &gfs2_dlm_ops) { + if (gfs2_uuid_valid(sb->s_uuid)) + cleancache_init_shared_fs(sb->s_uuid, sb); + } else { + cleancache_init_fs(sb); + } return 0; fail_threads: I would also be interested to know if there are any plans for a KVM backend for cleancache, Steve. ^ permalink raw reply related [flat|nested] 9+ messages in thread
* RE: Cleancache and shared filesystems 2011-05-27 16:19 ` Steven Whitehouse @ 2011-05-27 16:31 ` Dan Magenheimer 2011-05-27 23:33 ` Joel Becker 1 sibling, 0 replies; 9+ messages in thread From: Dan Magenheimer @ 2011-05-27 16:31 UTC (permalink / raw) To: Steven Whitehouse; +Cc: linux-kernel, Sunil Mushran > > Forgive me but I am not a clustered FS expert (even for ocfs2): > > If the UUID field in the sb is always 128-bits and is always > > identical for all cluster nodes, and this fact is consistent > > across all clustered FS's at mount time, I agree that there is > > no need to pass the uuid as a parameter in > > cleancache_init_shared_fs as it can be derived in the body of > > cleancache_init_shared_fs and then passed to > > __cleancache_init_shared_fs (which only cares that it gets > > 128-bits and probably has no business digging through a > > superblock). OTOH, this call is made only once per mount > > so there's no significant advantage in changing this... or am > > I missing something? > > > The point was really just a question to see if I'd understood what was > intended at this point of the code. It might be cleaner though to > introduce a sb flag to say whether a particular fs is shared or not as > a > generic feature. Then the same init function could be used for both > shared and non-shared filesystems presumably? True. I believe I had just one function long ago but an early reviewer insisted that: func_does_this() func_does_this_and_also_X() was more proper in the kernel than func_does_this(parameter_selecting_X_or_notX) > The way that GFS2 has worked in terms of unique filesystem IDs, is to > have a filesystem "name" which is a combination of a cluster name and a > filesystem specific part which are separated with a colon. This has > been > used as the identifier which gives the unique ID for any particular > filesystem, and it is also the volume label for the filesystem. > > In the early GFS2 code, we introduced, in addition a UUID as well. So > that should also be a unique ID across the cluster. That does mean that > it is possible (though very unlikely) that there will be GFS2 > filesystems with a zero UUID in existence. That is easily fixable > though > with tunegfs2. > > So I think that the UUID is ok for this particular purpose, but if it > was possible to use the filesystem "name" instead that would be more > consistent with the rest of GFS2. I don't think its a big issue though. > > I suspect that for GFS2 we'd need a patch looking something like this > (untested) based on what I think is the case so far: > > diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c > index 8ac9ae1..e807850 100644 > --- a/fs/gfs2/ops_fstype.c > +++ b/fs/gfs2/ops_fstype.c > @@ -18,6 +18,7 @@ > #include <linux/mount.h> > #include <linux/gfs2_ondisk.h> > #include <linux/quotaops.h> > +#include <linux/cleancache.h> > > #include "gfs2.h" > #include "incore.h" > @@ -1187,6 +1188,12 @@ static int fill_super(struct super_block *sb, > struct gfs2_args *args, int silent > > gfs2_glock_dq_uninit(&mount_gh); > gfs2_online_uevent(sdp); > + if (ls->ls_ops == &gfs2_dlm_ops) { > + if (gfs2_uuid_valid(sb->s_uuid)) > + cleancache_init_shared_fs(sb->s_uuid, sb); should this be &sb->s_uuid[0]? (or I guess it is the same thing since it is an array). > + } else { > + cleancache_init_fs(sb); > + } > return 0; > > fail_threads: > > > I would also be interested to know if there are any plans for a KVM > backend for cleancache, I've had a couple of inquiries and have actually done some work (not released yet) on a multiple-physical machine user for cleancache (called RAMster) that extends tmem.c in zcache to support multiple clients, which would make a KVM implementation fairly easy. But AFAIK, nobody has started developing a KVM backend... are you interested? Dan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Cleancache and shared filesystems 2011-05-27 16:19 ` Steven Whitehouse 2011-05-27 16:31 ` Dan Magenheimer @ 2011-05-27 23:33 ` Joel Becker 2011-05-31 8:58 ` Steven Whitehouse 2011-05-31 14:51 ` Dan Magenheimer 1 sibling, 2 replies; 9+ messages in thread From: Joel Becker @ 2011-05-27 23:33 UTC (permalink / raw) To: Steven Whitehouse; +Cc: Dan Magenheimer, linux-kernel, sunil.mushran On Fri, May 27, 2011 at 05:19:39PM +0100, Steven Whitehouse wrote: > + if (ls->ls_ops == &gfs2_dlm_ops) { > + if (gfs2_uuid_valid(sb->s_uuid)) > + cleancache_init_shared_fs(sb->s_uuid, sb); > + } else { > + cleancache_init_fs(sb); > + } Hey Dan, Steven makes a good point here. ocfs2 could also take advantage of local filesystem behavior when running in local mode. Joel -- Life's Little Instruction Book #69 "Whistle" http://www.jlbec.org/ jlbec@evilplan.org ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Cleancache and shared filesystems 2011-05-27 23:33 ` Joel Becker @ 2011-05-31 8:58 ` Steven Whitehouse 2011-05-31 15:13 ` Dan Magenheimer 2011-05-31 14:51 ` Dan Magenheimer 1 sibling, 1 reply; 9+ messages in thread From: Steven Whitehouse @ 2011-05-31 8:58 UTC (permalink / raw) To: Joel Becker; +Cc: Dan Magenheimer, linux-kernel, sunil.mushran Hi, On Fri, 2011-05-27 at 16:33 -0700, Joel Becker wrote: > On Fri, May 27, 2011 at 05:19:39PM +0100, Steven Whitehouse wrote: > > + if (ls->ls_ops == &gfs2_dlm_ops) { > > + if (gfs2_uuid_valid(sb->s_uuid)) > > + cleancache_init_shared_fs(sb->s_uuid, sb); > > + } else { > > + cleancache_init_fs(sb); > > + } > > Hey Dan, > Steven makes a good point here. ocfs2 could also take advantage > of local filesystem behavior when running in local mode. > > Joel > There is a further issue as well - cleancache will only work when all nodes can see the same shared cache, so we will need a mount option to disable cleancache in the case we have (for example) a cluster of virtual machines split over multiple physical hosts. In fact, I think from the principle of least surprise this had better default to off and be enabled explicitly. Otherwise I can see that people will shoot themselves in the foot which will be very easy since there is no automatic way that I can see to verify that all nodes are looking at the same cache, Steve. ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Cleancache and shared filesystems 2011-05-31 8:58 ` Steven Whitehouse @ 2011-05-31 15:13 ` Dan Magenheimer 0 siblings, 0 replies; 9+ messages in thread From: Dan Magenheimer @ 2011-05-31 15:13 UTC (permalink / raw) To: Steven Whitehouse, Joel Becker; +Cc: linux-kernel, Sunil Mushran > From: Steven Whitehouse [mailto:swhiteho@redhat.com] > Sent: Tuesday, May 31, 2011 2:58 AM > To: Joel Becker > Cc: Dan Magenheimer; linux-kernel@vger.kernel.org; Sunil Mushran > Subject: Re: Cleancache and shared filesystems > > Hi, > > On Fri, 2011-05-27 at 16:33 -0700, Joel Becker wrote: > > On Fri, May 27, 2011 at 05:19:39PM +0100, Steven Whitehouse wrote: > > > + if (ls->ls_ops == &gfs2_dlm_ops) { > > > + if (gfs2_uuid_valid(sb->s_uuid)) > > > + cleancache_init_shared_fs(sb->s_uuid, sb); > > > + } else { > > > + cleancache_init_fs(sb); > > > + } > > > > Hey Dan, > > Steven makes a good point here. ocfs2 could also take advantage > > of local filesystem behavior when running in local mode. > > > > Joel > > > > There is a further issue as well - cleancache will only work when all > nodes can see the same shared cache, so we will need a mount option to > disable cleancache in the case we have (for example) a cluster of > virtual machines split over multiple physical hosts. > > In fact, I think from the principle of least surprise this had better > default to off and be enabled explicitly. Otherwise I can see that > people will shoot themselves in the foot which will be very easy since > there is no automatic way that I can see to verify that all nodes are > looking at the same cache, Though it's been nearly two years now since I thought through this, I remember being concerned about that issue too. But, for ocfs2 at least, cleancache hooks are embedded in all the right places in VFS that the ocfs2 code that cross-invalidates stale page cache pages on different nodes also ensures coherence of cleancache and it all just worked, whether VMs are split across hosts or not. This may or may not be true for GFS2... for example, btrfs required one cleancache hook outside of VFS to function correctly. Again, I am pretty ignorant about shared filesystems so please correct me if I am missing anything important. Also, I checked and Xen tmem (the only current user of cleancache for which cluster-sharing makes sense) uses 128-bit -1 as its internal "don't share" indicator. So you are correct that multiple non-shared VMs using uuid==0 could potentially cause data corruption if they share a physical machine and your code snippet above is needed (assuming gfs2_uuid_valid returns false for uuid==0?). Dan --- Thanks... for the memory! I really could use more / my throughput's on the floor The balloon is flat / my swap disk's fat / I've OOM's in store Overcommitted so much (with apologies to Bob Hope) ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Cleancache and shared filesystems 2011-05-27 23:33 ` Joel Becker 2011-05-31 8:58 ` Steven Whitehouse @ 2011-05-31 14:51 ` Dan Magenheimer 2011-05-31 21:54 ` Joel Becker 1 sibling, 1 reply; 9+ messages in thread From: Dan Magenheimer @ 2011-05-31 14:51 UTC (permalink / raw) To: Joel Becker, Steven Whitehouse; +Cc: linux-kernel, Sunil Mushran > From: Joel Becker [mailto:jlbec@evilplan.org] > Sent: Friday, May 27, 2011 5:34 PM > To: Steven Whitehouse > Cc: Dan Magenheimer; linux-kernel@vger.kernel.org; Sunil Mushran > Subject: Re: Cleancache and shared filesystems > > On Fri, May 27, 2011 at 05:19:39PM +0100, Steven Whitehouse wrote: > > + if (ls->ls_ops == &gfs2_dlm_ops) { > > + if (gfs2_uuid_valid(sb->s_uuid)) > > + cleancache_init_shared_fs(sb->s_uuid, sb); > > + } else { > > + cleancache_init_fs(sb); > > + } > > Hey Dan, > Steven makes a good point here. ocfs2 could also take advantage > of local filesystem behavior when running in local mode. Hi Joel -- I guess the semantics need to be more clearly defined (or perhaps changed if the shared-fs community wants), but if cleancache_init_shared_fs is called only by a single node, cleancache still enables all the same functionality** as cleancache_init_fs. I'm not sure I fully understand the semantics of local mode though, so please clarify if you think I am wrong or misunderstanding your point. Thanks, Dan --- Thanks... for the memory! I really could use more / my throughput's on the floor The balloon is flat / my swap disk's fat / I've OOM's in store Overcommitted so much (with apologies to Bob Hope) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Cleancache and shared filesystems 2011-05-31 14:51 ` Dan Magenheimer @ 2011-05-31 21:54 ` Joel Becker 0 siblings, 0 replies; 9+ messages in thread From: Joel Becker @ 2011-05-31 21:54 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Steven Whitehouse, linux-kernel, Sunil Mushran On Tue, May 31, 2011 at 07:51:36AM -0700, Dan Magenheimer wrote: > > Hey Dan, > > Steven makes a good point here. ocfs2 could also take advantage > > of local filesystem behavior when running in local mode. > > I guess the semantics need to be more clearly defined > (or perhaps changed if the shared-fs community wants), > but if cleancache_init_shared_fs is called only by > a single node, cleancache still enables all the same > functionality** as cleancache_init_fs. I don't see the ** reference in the footnotes ;-) You're saying that, for a single caller, you will properly keep the pagecache bits around in cleancache as you shrink the balloon and give them back when requested? So an ocfs2 calling cleancache_init_share_fs() in only one VM will have the same page lifetimes (including life inside cleancache but not in guest pagecache) as a similar ext3? If so, there are no changes needed at all. > I'm not sure I fully understand the semantics of > local mode though, so please clarify if you think > I am wrong or misunderstanding your point. ocfs2 local mode means that it is not a cluster filesystem. The cluster services are not enabled, and ocfs2 behaves like xfs/extN/btrfs. Joel -- "Lately I've been talking in my sleep. Can't imagine what I'd have to say. Except my world will be right When love comes back my way." http://www.jlbec.org/ jlbec@evilplan.org ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-05-31 21:54 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-27 13:51 Cleancache and shared filesystems Steven Whitehouse 2011-05-27 15:31 ` Dan Magenheimer 2011-05-27 16:19 ` Steven Whitehouse 2011-05-27 16:31 ` Dan Magenheimer 2011-05-27 23:33 ` Joel Becker 2011-05-31 8:58 ` Steven Whitehouse 2011-05-31 15:13 ` Dan Magenheimer 2011-05-31 14:51 ` Dan Magenheimer 2011-05-31 21:54 ` Joel Becker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox