Re: [PATCH] ndctl: add a BTT check utility

From: "Verma, Vishal L" <vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: "jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org"
	<jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org"
	<linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org>
Subject: Re: [PATCH] ndctl: add a BTT check utility
Date: Sat, 21 Jan 2017 00:41:33 +0000	[thread overview]
Message-ID: <1484959224.4857.6.camel@intel.com> (raw)
In-Reply-To: <x49lgu52uk0.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>

On Fri, 2017-01-20 at 18:14 -0500, Jeff Moyer wrote:
> Hi, Vishal,
> 
> Vishal Verma <vishal.l.verma@intel.com> writes:
> 
> > Add the check-namespace command to ndctl. This will check the BTT
> > metadata layout for the given namespace, and if requested, correct
> > any
> > errors found. Not all metadata corruption is detectable or fixable.
> 
> First, thanks a lot for tackling this!  Overall, the code looks pretty
> clean.  I've got a few general comments, and then you can scroll down
> for code review.  I didn't get through the whole thing, so I'll try to
> pick up where I left off next week.

Thanks Jeff!

> 
> How was this tested?  Did you perturb the btt in ways that you expect
> can and can't be fixed?  Can you add tests for that?

Yes, so for some things I tested by essentially hand editing the
metadata. But for other things - mainly multi-arena testing, and the
different recovery attempts for first_sb, I modified the kernel btt to
produce small 8MB arenas, and did the same in the checker. For the
former, those can be 'formalized' as ndctl unit tests, and I will start
doing that. But for the latter, I don't know if there's an easy way to
get full coverage on those cases.

> 
> Checking should not be an optional feature.  Just always build it in.

You mean checking metadata when bringing up the namespace? The kernel
does essential checks of course, but do you mean the more involved ones
like log/map entry bounds, the backup info block etc?

> 
> Why should a user have to disable a namespace in order to run the
> checker?  I guess that's the analog of unmounting an fs?  Still, seems
> like we could just do that for the user.

Mostly so that the user can be sure nothing else is using it when the
check starts? We could disable it for them, but I just thought this
would be safer? An option could be providing a --force option like I
think destroy or reconfig do..

> 
> Can this code in any way be shared with libpmemblk?  They have their
> own
> checker (which does different things), and it seems like a lost
> opportunity to not be able to leverage the same code in both places.

We started out with that, in fact with depending on libpmemblk's checker
and calling into their library, but it was odd for ndctl (a core system
library of sorts) to depend on libpmemblk (an 'application' library).
Then we considered moving the libpmemblk checker parts into libndctl,
and making pmemblk depend on us, but they also used their library on
windows, so we couldn't just move it. Finally we just decided to have a
copy of the code in ndctl, so I rewrote it in a ndctl native way.

> 
> Overall, the code could use more comments.

Ok, I'll add some descriptions to the trickier sections.

> 
> Does this really all belong in builtin-xaction-namespace.c?  It seems
> too btt specific for that.  Maybe it's time to create a new file.
> I'll leave that up to Dan for his preference.

Dan had the same feedback, will move to a new file.

> 
> There seems to be a lot of repeated code sequences around reading and
> verifying the btt info blocks.  Perhaps that can be factored
> out?  Maybe
> get rid of the first_sb special snowflake and roll that into the arena
> discovery?
> 
> > diff --git a/ndctl/btt-structs.h b/ndctl/btt-structs.h
> > +#define IB_FLAG_ERROR 0x00000001
> > +#define IB_FLAG_ERROR_MASK 0x00000001
> 
> IB_FLAG_ERROR is defined but never used/checked.  Of course, neither
> the
> kernel nor this utility will set that flag.  We should probably decide
> whether it makes sense to do so.

True, will remove.

> 
> > +static int btt_read_info(struct btt_chk *bttc, struct btt_sb
> > *btt_sb, __u64 off)
> > +{
> > +	int fd, rc = 0;
> > +	ssize_t size;
> > +
> > +	fd = open(bttc->path, O_RDONLY);
> > +	if (fd < 0) {
> > +		error("unable to open %s: %s\n", bttc->path,
> > strerror(errno));
> > +		return -errno;
> > +	}
> > +
> > +	size = pread(fd, btt_sb, sizeof(*btt_sb), off + SZ_4K);
> > +	if (size != sizeof(*btt_sb)) {
> > +		error("unable to read first info block: %s\n",
> > strerror(errno));
> > +		rc = -errno;
> 
> Short reads will not set errno, so you'll return 0 incorrectly, here.

Good catch, I'll fix.

> 
> > +	}
> > +	close(fd);
> > +	return rc;
> > +}
> 
> I find it odd that the caller has to pass in the offset 4k before the
> start of the info block.  It seems natural for the first info block,
> but
> for the backup, it's totally bogus.  ;-)

I agree, and I hated it too :)
But I wanted to keep the +4K addition in the lowest level functions..
Have a better idea? :)  This does kinda kill the idea of "oh the 4k
offsetting will be completely hidden from the higher level logic"

The opposite of this would be sprinkling a number of +4K everywhere
else, and I think with this we only have to do the inverse in a couple
of places..

> 
> > +static int btt_copy_to_info2(struct arena_info *a)
> > +{
> > +	if (dryrun) {
> > +		error("Arena %d: BTT info2 needs to be restored\n",
> > a->num);
> > +		return dryrun_msg();
> > +	}
> > +	printf("Arena %d: Restoring BTT info2\n", a->num);
> > +	memcpy(a->map.info2, a->map.info, BTT_PG_SIZE);
> 
> missing msync();

Good catch.

> 
> > +static int btt_map_read(struct arena_info *a, __u32 lba, __u32
> > *mapping,
> > +			int *trim, int *error)
> > +{
> > +	__u32 raw_mapping, postmap, ze, z_flag, e_flag;
> > +
> > +	raw_mapping = le32toh(a->map.map[lba]);
> > +	z_flag = (raw_mapping & MAP_TRIM_MASK) >> MAP_TRIM_SHIFT;
> > +	e_flag = (raw_mapping & MAP_ERR_MASK) >> MAP_ERR_SHIFT;
> > +	ze = (z_flag << 1) + e_flag;
> > +	postmap = raw_mapping & MAP_LBA_MASK;
> > +
> > +	/* Reuse the {z,e}_flag variables for *trim and *error */
> > +	z_flag = 0;
> > +	e_flag = 0;
> > +
> > +	switch (ze) {
> > +	case 0:
> > +		/* Initial state. Return postmap = premap */
> > +		*mapping = lba;
> > +		break;
> > +	case 1:
> > +		*mapping = postmap;
> > +		e_flag = 1;
> > +		break;
> > +	case 2:
> > +		*mapping = postmap;
> > +		z_flag = 1;
> > +		break;
> > +	case 3:
> > +		*mapping = postmap;
> > +		break;
> > +	default:
> > +		return -EIO;
> > +	}
> > +
> > +	if (trim)
> > +		*trim = z_flag;
> > +	if (error)
> > +		*error = e_flag;
> > +
> > +	return 0;
> > +}
> 
> No caller uses z or e, so get rid of those args.  Also, your default
> entry in the switch statement can never be hit.  Fixing those, the
> function is just:
> 
> /*
>  * btt_map_lookup
>  *
>  * Given a pre-map Arena Block Address, return the post-map ABA.
>  */
> static u32 btt_map_lookup(struct arena_info *a, __u32 lba)
> {
>         return (le32toh(a->map.map[lba]) & MAP_LBA_MASK);
> }
> 
> You can take or leave the function name change.  Please keep the
> comment, and maybe kernel-doc-ify it.
> 
> > +static int btt_map_write(struct arena_info *a, __u32 lba, __u32
> > mapping,
> > +			__u32 z_flag, __u32 e_flag)
> > +{
> 
> Again, z and e are unused.  Nuke 'em.

True, I just copied these over from the kernel implementation, and
wasn't sure if I'd ever be using z and e. Currently the kernel doesn't
even set those, so it is a bit redundant. If the kernel does grow the
ability to use them in future, there could be use for these, but I
suppose I should remove the dead code until that happens :)

> 
> [snip]
> > +	a->map.map[lba] = htole32(mapping);
> > +	if (msync((void *)rounddown((__u64)&a->map.map[lba],
> > BTT_PG_SIZE),
> > +		BTT_PG_SIZE, MS_SYNC) < 0)
> > +		return errno;
> 
> You are aligning msync calls to the BTT_PG_SIZE, which is wrong.  It
> should be aligned to the system page size.
> 
> In fact, you abuse BTT_PG_SIZE all over the place.  You use it to
> mean:
> - system page size
> - size of the btt info block
> - the btt page size ;-)
> 
> It would be nice if you could clean that up.

Yes.. I was just (ab)using the fact that they're going to be the same,
but I should distinguish between the different usages better. I'll fix
it up.

> 
> Should we perform a load of the info block after the msync to ensure
> it
> was written (to trigger an MCE if one is going to happen)?  That same
> question could be asked for each place where an update is made.

Good question. I guess the probability of those locations having a bad
cell is low because we did just read them to find out there is a
problem, and subsequently writing them should have a low probability of
triggering a CMCI. But re-reading to make sure it was correctly written
is not a bad idea. Dan, thoughts?

> 
> > +static void btt_parse_meta(struct arena_info *arena, struct btt_sb
> > *btt_sb,
> > +				__u64 arena_off)
> > +{
> > +	arena->internal_nlba = le32toh(btt_sb->internal_nlba);
> > +	arena->internal_lbasize = le32toh(btt_sb-
> > >internal_lbasize);
> > +	arena->external_nlba = le32toh(btt_sb->external_nlba);
> > +	arena->external_lbasize = le32toh(btt_sb-
> > >external_lbasize);
> > +	arena->nfree = le32toh(btt_sb->nfree);
> > +	arena->version_major = le16toh(btt_sb->version_major);
> > +	arena->version_minor = le16toh(btt_sb->version_minor);
> > +
> > +	arena->nextoff = (btt_sb->nextoff == 0) ? 0 : (arena_off +
> > +			le64toh(btt_sb->nextoff));
> > +	arena->infooff = arena_off;
> > +	arena->dataoff = arena_off + le64toh(btt_sb->dataoff);
> > +	arena->mapoff = arena_off + le64toh(btt_sb->mapoff);
> > +	arena->logoff = arena_off + le64toh(btt_sb->logoff);
> > +	arena->info2off = arena_off + le64toh(btt_sb->info2off);
> > +
> > +	arena->size = (le64toh(btt_sb->nextoff) > 0)
> > +		? (le64toh(btt_sb->nextoff))
> > +		: (arena->info2off - arena->infooff + BTT_PG_SIZE);
> > +
> > +	arena->flags = le32toh(btt_sb->flags);
> > +}
> 
> Maybe it doesn't belong in the parse routine, but there is no sanity
> checking on this data, and there really should be.

Good point, I'll add some.

> 
> > +/* Check that log entries are self consistent */
> > +static int btt_check_map_entries(struct arena_info *a)
> > +{
> > +	int rc = 0, z, e;
> > +	unsigned int i;
> > +	__u32 mapping;
> > +
> > +	for (i = 0; i < a->external_nlba; i++) {
> > +		rc = btt_map_read(a, i, &mapping, &z, &e);
> > +		if (rc)
> > +			return rc;
> > +		if (mapping >= a->internal_nlba)
> > +			return BTT_MAP_OOB;
> > +	}
> > +	return 0;
> > +}
> 
> libpmemblk's version of this keeps a bitmap of allocated blocks and
> free
> blocks (from the flog), and then generates a warning if there are
> duplicate blocks allocated or unreachable blocks.  I think we should
> do
> the same.

Yep I saw that, I'll add it.

> 
> 
> > +/* Check that each flog entry has the correct corresponding map
> > entry */
> > +static int btt_check_log_map(struct arena_info *a)
> > +{
> > +	unsigned int i;
> > +	__u32 mapping;
> > +	int rc = 0;
> > +
> > +	for (i = 0; i < a->nfree; i++) {
> > +		struct log_entry log;
> > +
> > +		rc = btt_log_read(a, i, &log);
> > +		if (rc)
> > +			return rc;
> > +		rc = btt_map_read(a, log.lba, &mapping, NULL,
> > NULL);
> > +		if (rc)
> > +			return rc;
> > +
> > +		/*
> > +		 * Case where the flog was written, but map
> > couldn't be updated.
> > +		 * The kernel should also be able to detect and fix
> > this condition
> > +		 */
> > +		if (log.new_map != mapping && log.old_map ==
> > mapping) {
> > +			error("arena %d: log[%d].new_map (%#x)
> > doesn't match map[%#x] (%#x)",
> > +				a->num, i, log.new_map, log.lba,
> > mapping);
> 
> This isn't an error.  We expect this to happen if power is suddenly
> lost, and the format is built to recover from it.  If you log it at
> all,
> I would make it an informational message.
> 
> I wonder whether we should even fix it.  I worry about having the same
> recovery code in multiple places (the kernel and here).  What do you
> think?

Hm, I agree about the message, in that it is routine. But I added the
repairing so that a user may just decide to do a check using this after
a power failure, and we can easily fix it when we have the chance. I do
agree about the check/correction in multiple places, I wouldn't be too
opposed to removing it entirely from here.

> 
> > +static int btt_recover_first_sb(struct btt_chk *bttc)
> > +{
> 
> [snip]
> > +	/*
> > +	 * Attempt 2: From the very end of 'rawsize', try to copy
> > the fields
> > +	 * that are constant in every arena (only valid when
> > multiple arenas
> > +	 * are present)
> > +	 */
> > +	if (est_arenas > 1) {
> > +		offset = rounddown(bttc->rawsize - remaining,
> > BTT_PG_SIZE) -
> > +			2 * BTT_PG_SIZE;
> > +		printf("Attempting to recover info-block from end
> > offset %#llx\n", offset);
> > +		rc = btt_read_info(bttc, &btt_sb[1], offset);
> > +		if (rc)
> > +			goto out;
> 
> How about we verify the info block before blindlly copying in the
> data?
> I know you verify the result, but it's weird.

Fair point, and should be easy to do.

> 
> > +		/* copy over the arena0 specific fields from
> > btt_sb[0] */
> > +		btt_sb[1].flags = btt_sb[0].flags;
> > +		btt_sb[1].external_nlba = btt_sb[0].external_nlba;
> > +		btt_sb[1].internal_nlba = btt_sb[0].internal_nlba;
> > +		btt_sb[1].nextoff = btt_sb[0].nextoff;
> > +		btt_sb[1].dataoff = btt_sb[0].dataoff;
> > +		btt_sb[1].mapoff = btt_sb[0].mapoff;
> > +		btt_sb[1].logoff = btt_sb[0].logoff;
> > +		btt_sb[1].info2off = btt_sb[0].info2off;
> > +		btt_sb[1].checksum = btt_sb[0].checksum;
> > +		rc = btt_info_verify(bttc, &btt_sb[1]);
> > +		if (rc == 0) {
> > +			rc = btt_write_info(bttc, &btt_sb[1], 0);
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Attempt 3: use info2off as-is, and check if we find a
> > valid info
> > +	 * block at that location.
> > +	 */
> > +	 offset = le32toh(btt_sb[0].info2off);
> > +	 if (offset) {
> 
> Maybe do some bounds checking on offset?

If the offset is bogus either the read will fail or the verify will, but
yes I could add checking too.

> 
> > +		printf("Attempting to recover info-block from info2
> > offset (as-is) %#llx\n", offset);
> > +		rc = btt_read_info(bttc, &btt_sb[1], offset);
> > +		if (rc)
> > +			goto out;
> > +		rc = btt_info_verify(bttc, &btt_sb[1]);
> > +		if (rc == 0) {
> > +			rc = btt_write_info(bttc, &btt_sb[1], 0);
> > +			goto out;
> > +		}
> > +	} else
> > +		rc = -ENXIO;
> 
> You could also look for the primary info blocks for the other arenas.

That seems kind of difficult as we don't really have any information on
where the other arenas might be.
We could do a full calculation of "this is where I would lay out the
arenas if I had this rawsize", but I was hoping to avoid doing that..

> 
> > + out:
> > +	free(btt_sb);
> > +	return rc;
> > +}
> > +
> > +static int namespace_check(struct ndctl_region *region,
> > +		struct ndctl_namespace *ndns)
> > +{
> 
> namespace_check sounds generic, but it really only works for the btt.
> Also, should we verify that the namespace was configured with a btt
> before even attempting to check it?

True. I did think of checking if this is a btt namespace before
checking, but I think if the namespace is disabled, my understanding is
ndctl and libndctl have no idea whether it is a btt namespace or not.
The BTT is discovered by the kernel when doing it's probe, and once it
is up, ndctl can see that, but until then, it can really only find out
if there is a valid BTT by trying to read the metadata itself.

If there is a future, different metadata format, I'd imagine its
checking would also begin here, but branch off based on which format was
found.

> 
> > +	raw_mode = ndctl_namespace_get_raw_mode(ndns);
> > +	ndctl_namespace_set_raw_mode(ndns, 1);
> > +	rc = ndctl_namespace_enable(ndns);
> > +	if (rc != 0)
> > +		error("%s: failed to enable raw mode\n", devname);
> > +	ndctl_namespace_set_raw_mode(ndns, raw_mode);
> 
> What the heck do the above 6 lines do?  Please explain.

Linda had the same question so I suppose I should at least have a
comment there :)
But the idea is we get the existing mode (raw or not), and save it in
raw_mode. Then try to enable it with raw_mode = 1. If it fails, restore
the saved raw_mode.

Maybe this is a bit pointless? But we should try and keep things in the
same state the user had, right? :)

> 
> Thanks!
> Jeff

Thanks for the review, I'll start working on the changes and send out a
v2 next week!
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm