Re: [PATCH] ndctl: add a BTT check utility

From: "Verma, Vishal L" <vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: "jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org"
	<jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org"
	<linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org>
Subject: Re: [PATCH] ndctl: add a BTT check utility
Date: Tue, 24 Jan 2017 20:59:40 +0000	[thread overview]
Message-ID: <1485291510.4857.8.camel@intel.com> (raw)
In-Reply-To: <x49vat5ljtz.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>

On Mon, 2017-01-23 at 13:20 -0500, Jeff Moyer wrote:
> 
> 
> OK, yeah, let's definitely put tests in the tree for anything that is
> easy to do.  Can't we create test nfits large enough for multiple
> arenas?  One thing we could do is keep around bogus metadata images
> and
> just shove them into the raw device.  Only if that's easier than
> programmatically perturbuing things.
> 
> Also, we may consider a btt metadata dump utility, much like file
> systems have.1

Yep I will work on tests. No, the ARENA_MAX_SIZE is 512G, so the kernel
won't create a second arena unless the namespace is larger than that,
and nfit_test namespaces are only 32M or so..

Dan and I were also talking about a a metadata dump utility similar to
filesystems. I'll work on this next.

> 
> > > Checking should not be an optional feature.  Just always build it
> > > in.
> > 
> > You mean checking metadata when bringing up the namespace? The
> > kernel
> > does essential checks of course, but do you mean the more involved
> > ones
> > like log/map entry bounds, the backup info block etc?
> 
> I mean't the "check" ndctl command shouldn't be selectable via
> configure.  I think Dan had the same comment.

Ah yes I'll make that change.

> 
> > > There seems to be a lot of repeated code sequences around reading
> > > and
> > > verifying the btt info blocks.  Perhaps that can be factored
> > > out?  Maybe
> > > get rid of the first_sb special snowflake and roll that into the
> > > arena
> > > discovery?
> 
> You didn't comment on this one... did you miss it, or will you
> investigate, or do you think it's a bogus comment?

I just missed it. I did notice the read+verify sequence, maybe in the
least, I could wrap those two into a wrapper function that combines the
two operations.

I think the stuff we do in first_sb is required to find (recover) the
info block if one is not readily found. It is a special snowflake just
because it uses some heuristics to try and find a usable info block,
where as discover_arenas just walks through the namespace, and knows
what to expect at each step. I'll stare at it harder and see if I can
consolidate/refactor anything :)

> 
> > > > diff --git a/ndctl/btt-structs.h b/ndctl/btt-structs.h
> > > > +#define IB_FLAG_ERROR 0x00000001
> > > > +#define IB_FLAG_ERROR_MASK 0x00000001
> > > 
> > > IB_FLAG_ERROR is defined but never used/checked.  Of course,
> > > neither
> > > the
> > > kernel nor this utility will set that flag.  We should probably
> > > decide
> > > whether it makes sense to do so.
> > 
> > True, will remove.
> 
> Well, hold on.  This flag exists in the spec, so at the very least we
> should consider where it should be set, and what to do if it /is/ set.
> Let's not just remove it.

I said that because we know currently in the kernel we have no way of
setting it. I think the first thing we'd need to do is figure out what
cases it will be set under.. And then figure out what the checker can do
about it.

> 
> > > I find it odd that the caller has to pass in the offset 4k before
> > > the
> > > start of the info block.  It seems natural for the first info
> > > block,
> > > but
> > > for the backup, it's totally bogus.  ;-)
> > 
> > I agree, and I hated it too :)
> > But I wanted to keep the +4K addition in the lowest level
> > functions..
> > Have a better idea? :)  This does kinda kill the idea of "oh the 4k
> > offsetting will be completely hidden from the higher level logic"
> > 
> > The opposite of this would be sprinkling a number of +4K everywhere
> > else, and I think with this we only have to do the inverse in a
> > couple
> > of places..
> 
> I guess I'd have to see the result of inverting the roles.  Feel free
> to
> hold off on this one, I don't think it's critical.
> 
> > > Again, z and e are unused.  Nuke 'em.
> > 
> > True, I just copied these over from the kernel implementation, and
> > wasn't sure if I'd ever be using z and e. Currently the kernel
> > doesn't
> > even set those, so it is a bit redundant. If the kernel does grow
> > the
> > ability to use them in future, there could be use for these, but I
> > suppose I should remove the dead code until that happens :)
> 
> Every single state of those two bits is valid.  How would you ever do
> something useful with them?

Agreed, I don't see a scenario where the checker would attempt to modify
the bits.
> 
> > > Should we perform a load of the info block after the msync to
> > > ensure
> > > it
> > > was written (to trigger an MCE if one is going to happen)?  That
> > > same
> > > question could be asked for each place where an update is made.
> > 
> > Good question. I guess the probability of those locations having a
> > bad
> > cell is low because we did just read them to find out there is a
> > problem, and subsequently writing them should have a low probability
> > of
> > triggering a CMCI. But re-reading to make sure it was correctly
> > written
> > is not a bad idea. Dan, thoughts?
> 
> This one's not critical.  But it's a checker tool, so it might be nice
> to be paranoid.
> 
> > > This isn't an error.  We expect this to happen if power is
> > > suddenly
> > > lost, and the format is built to recover from it.  If you log it
> > > at
> > > all, I would make it an informational message.
> > > 
> > > I wonder whether we should even fix it.  I worry about having the
> > > same
> > > recovery code in multiple places (the kernel and here).  What do
> > > you
> > > think?
> > 
> > Hm, I agree about the message, in that it is routine. But I added
> > the
> > repairing so that a user may just decide to do a check using this
> > after
> > a power failure, and we can easily fix it when we have the chance. I
> > do
> > agree about the check/correction in multiple places, I wouldn't be
> > too
> > opposed to removing it entirely from here.
> 
> If you remove it, put a big comment in there.  :)  I'll leave the
> ultimate decision up to you.  It's not complex recovery, so we
> *should*
> be okay to have that logic in multiple places...

I can remove it, this makes the entire function (btt_check_log_map)
disappear, and I'll put a comment in the main checker loop saying that
this is a normal condition.

> 
> > > > +	/*
> > > > +	 * Attempt 3: use info2off as-is, and check if we find
> > > > a
> > > > valid info
> > > > +	 * block at that location.
> > > > +	 */
> > > > +	 offset = le32toh(btt_sb[0].info2off);
> > > > +	 if (offset) {
> > > 
> > > Maybe do some bounds checking on offset?
> > 
> > If the offset is bogus either the read will fail or the verify will,
> > but
> > yes I could add checking too.
> 
> Again, not critical.  I'll leave it up to you.
> 
> > > > +		printf("Attempting to recover info-block from
> > > > info2
> > > > offset (as-is) %#llx\n", offset);
> > > > +		rc = btt_read_info(bttc, &btt_sb[1], offset);
> > > > +		if (rc)
> > > > +			goto out;
> > > > +		rc = btt_info_verify(bttc, &btt_sb[1]);
> > > > +		if (rc == 0) {
> > > > +			rc = btt_write_info(bttc, &btt_sb[1],
> > > > 0);
> > > > +			goto out;
> > > > +		}
> > > > +	} else
> > > > +		rc = -ENXIO;
> > > 
> > > You could also look for the primary info blocks for the other
> > > arenas.
> > 
> > That seems kind of difficult as we don't really have any information
> > on
> > where the other arenas might be.
> > We could do a full calculation of "this is where I would lay out the
> > arenas if I had this rawsize", but I was hoping to avoid doing
> > that..
> 
> OK.  If we get to here, things are pretty well screwed.  Feel free to
> ignore that suggestion.
> 
> > > > + out:
> > > > +	free(btt_sb);
> > > > +	return rc;
> > > > +}
> > > > +
> > > > +static int namespace_check(struct ndctl_region *region,
> > > > +		struct ndctl_namespace *ndns)
> > > > +{
> > > 
> > > namespace_check sounds generic, but it really only works for the
> > > btt.
> > > Also, should we verify that the namespace was configured with a
> > > btt
> > > before even attempting to check it?
> > 
> > True. I did think of checking if this is a btt namespace before
> > checking, but I think if the namespace is disabled, my understanding
> > is
> > ndctl and libndctl have no idea whether it is a btt namespace or
> > not.
> > The BTT is discovered by the kernel when doing it's probe, and once
> > it
> > is up, ndctl can see that, but until then, it can really only find
> > out
> > if there is a valid BTT by trying to read the metadata itself.
> 
> Yeah.  At some point I guess we have to trust the admin not to try to
> fix the btt on a non-btt device.  However, if we allowed them to run
> the
> command on a pmemNs device, that would reduce the opportunities for
> failure.

I'm liking the --force option for this (running on a 'live' namespace),
and even if someone does run it on a non-BTT namespace, we fail quickly
enough with a helpful message..

> 
> > > > +	raw_mode = ndctl_namespace_get_raw_mode(ndns);
> > > > +	ndctl_namespace_set_raw_mode(ndns, 1);
> > > > +	rc = ndctl_namespace_enable(ndns);
> > > > +	if (rc != 0)
> > > > +		error("%s: failed to enable raw mode\n",
> > > > devname);
> > > > +	ndctl_namespace_set_raw_mode(ndns, raw_mode);
> > > 
> > > What the heck do the above 6 lines do?  Please explain.
> > 
> > Linda had the same question so I suppose I should at least have a
> > comment there :)
> > But the idea is we get the existing mode (raw or not), and save it
> > in
> > raw_mode. Then try to enable it with raw_mode = 1. If it fails,
> > restore
> > the saved raw_mode.
> 
> I had more basic questions, like what the heck is going on?  :-) I've
> done some more reading, and this is what I think is happening.
> 
> 	/* In typical usage, the current raw_mode should be false. */
> 	raw_mode = ndctl_namespace_get_raw_mode(ndns);
> 	/*
> 	 * Putting the namespace into raw mode will allow us to access
> 	 * the btt metadata.
> 	 */
> 	ndctl_namespace_set_raw_mode(ndns, 1);
> 	/*
> 	 * Now enable the namespace.  This will result in a pmem device
> 	 * node showing up in /dev that is in raw mode.
> 	 */
> 	rc = ndctl_namespace_enable(ndns);
> 	if (rc != 0)
> 		error("%s: failed to enable raw mode\n", devname);
> 	sprintf(path, "/dev/%s",
> ndctl_namespace_get_block_device(ndns));
> 
> And move the resetting of raw_mode down into the out: label.  That
> will
> go a long way towards clearing up the confusion I had when reading the
> code for the first time.  I think you should also mention that a
> pre-condition for this function is that the namespace is disabled, and
> you should check for it and provide a useful error message if it's
> already enabled.

Yes your interpretation is correct, and I will steal some of the above
comments and put them in :)

> 
> > Maybe this is a bit pointless? But we should try and keep things in
> > the
> > same state the user had, right? :)
> 
> I definitely agree we should keep things in the same state.  I really
> didn't have a good grasp of what was going on there, and comments
> would
> have saved me some time digging through both ndctl and kernel code.

> 
> Cheers,
> Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm