From: "Verma, Vishal L" <vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: "jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org"
<jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org"
<linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org>
Subject: Re: [PATCH] ndctl: add a BTT check utility
Date: Tue, 24 Jan 2017 20:59:40 +0000 [thread overview]
Message-ID: <1485291510.4857.8.camel@intel.com> (raw)
In-Reply-To: <x49vat5ljtz.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
On Mon, 2017-01-23 at 13:20 -0500, Jeff Moyer wrote:
>
>
> OK, yeah, let's definitely put tests in the tree for anything that is
> easy to do. Can't we create test nfits large enough for multiple
> arenas? One thing we could do is keep around bogus metadata images
> and
> just shove them into the raw device. Only if that's easier than
> programmatically perturbuing things.
>
> Also, we may consider a btt metadata dump utility, much like file
> systems have.1
Yep I will work on tests. No, the ARENA_MAX_SIZE is 512G, so the kernel
won't create a second arena unless the namespace is larger than that,
and nfit_test namespaces are only 32M or so..
Dan and I were also talking about a a metadata dump utility similar to
filesystems. I'll work on this next.
>
> > > Checking should not be an optional feature. Just always build it
> > > in.
> >
> > You mean checking metadata when bringing up the namespace? The
> > kernel
> > does essential checks of course, but do you mean the more involved
> > ones
> > like log/map entry bounds, the backup info block etc?
>
> I mean't the "check" ndctl command shouldn't be selectable via
> configure. I think Dan had the same comment.
Ah yes I'll make that change.
>
> > > There seems to be a lot of repeated code sequences around reading
> > > and
> > > verifying the btt info blocks. Perhaps that can be factored
> > > out? Maybe
> > > get rid of the first_sb special snowflake and roll that into the
> > > arena
> > > discovery?
>
> You didn't comment on this one... did you miss it, or will you
> investigate, or do you think it's a bogus comment?
I just missed it. I did notice the read+verify sequence, maybe in the
least, I could wrap those two into a wrapper function that combines the
two operations.
I think the stuff we do in first_sb is required to find (recover) the
info block if one is not readily found. It is a special snowflake just
because it uses some heuristics to try and find a usable info block,
where as discover_arenas just walks through the namespace, and knows
what to expect at each step. I'll stare at it harder and see if I can
consolidate/refactor anything :)
>
> > > > diff --git a/ndctl/btt-structs.h b/ndctl/btt-structs.h
> > > > +#define IB_FLAG_ERROR 0x00000001
> > > > +#define IB_FLAG_ERROR_MASK 0x00000001
> > >
> > > IB_FLAG_ERROR is defined but never used/checked. Of course,
> > > neither
> > > the
> > > kernel nor this utility will set that flag. We should probably
> > > decide
> > > whether it makes sense to do so.
> >
> > True, will remove.
>
> Well, hold on. This flag exists in the spec, so at the very least we
> should consider where it should be set, and what to do if it /is/ set.
> Let's not just remove it.
I said that because we know currently in the kernel we have no way of
setting it. I think the first thing we'd need to do is figure out what
cases it will be set under.. And then figure out what the checker can do
about it.
>
> > > I find it odd that the caller has to pass in the offset 4k before
> > > the
> > > start of the info block. It seems natural for the first info
> > > block,
> > > but
> > > for the backup, it's totally bogus. ;-)
> >
> > I agree, and I hated it too :)
> > But I wanted to keep the +4K addition in the lowest level
> > functions..
> > Have a better idea? :) This does kinda kill the idea of "oh the 4k
> > offsetting will be completely hidden from the higher level logic"
> >
> > The opposite of this would be sprinkling a number of +4K everywhere
> > else, and I think with this we only have to do the inverse in a
> > couple
> > of places..
>
> I guess I'd have to see the result of inverting the roles. Feel free
> to
> hold off on this one, I don't think it's critical.
>
> > > Again, z and e are unused. Nuke 'em.
> >
> > True, I just copied these over from the kernel implementation, and
> > wasn't sure if I'd ever be using z and e. Currently the kernel
> > doesn't
> > even set those, so it is a bit redundant. If the kernel does grow
> > the
> > ability to use them in future, there could be use for these, but I
> > suppose I should remove the dead code until that happens :)
>
> Every single state of those two bits is valid. How would you ever do
> something useful with them?
Agreed, I don't see a scenario where the checker would attempt to modify
the bits.
>
> > > Should we perform a load of the info block after the msync to
> > > ensure
> > > it
> > > was written (to trigger an MCE if one is going to happen)? That
> > > same
> > > question could be asked for each place where an update is made.
> >
> > Good question. I guess the probability of those locations having a
> > bad
> > cell is low because we did just read them to find out there is a
> > problem, and subsequently writing them should have a low probability
> > of
> > triggering a CMCI. But re-reading to make sure it was correctly
> > written
> > is not a bad idea. Dan, thoughts?
>
> This one's not critical. But it's a checker tool, so it might be nice
> to be paranoid.
>
> > > This isn't an error. We expect this to happen if power is
> > > suddenly
> > > lost, and the format is built to recover from it. If you log it
> > > at
> > > all, I would make it an informational message.
> > >
> > > I wonder whether we should even fix it. I worry about having the
> > > same
> > > recovery code in multiple places (the kernel and here). What do
> > > you
> > > think?
> >
> > Hm, I agree about the message, in that it is routine. But I added
> > the
> > repairing so that a user may just decide to do a check using this
> > after
> > a power failure, and we can easily fix it when we have the chance. I
> > do
> > agree about the check/correction in multiple places, I wouldn't be
> > too
> > opposed to removing it entirely from here.
>
> If you remove it, put a big comment in there. :) I'll leave the
> ultimate decision up to you. It's not complex recovery, so we
> *should*
> be okay to have that logic in multiple places...
I can remove it, this makes the entire function (btt_check_log_map)
disappear, and I'll put a comment in the main checker loop saying that
this is a normal condition.
>
> > > > + /*
> > > > + * Attempt 3: use info2off as-is, and check if we find
> > > > a
> > > > valid info
> > > > + * block at that location.
> > > > + */
> > > > + offset = le32toh(btt_sb[0].info2off);
> > > > + if (offset) {
> > >
> > > Maybe do some bounds checking on offset?
> >
> > If the offset is bogus either the read will fail or the verify will,
> > but
> > yes I could add checking too.
>
> Again, not critical. I'll leave it up to you.
>
> > > > + printf("Attempting to recover info-block from
> > > > info2
> > > > offset (as-is) %#llx\n", offset);
> > > > + rc = btt_read_info(bttc, &btt_sb[1], offset);
> > > > + if (rc)
> > > > + goto out;
> > > > + rc = btt_info_verify(bttc, &btt_sb[1]);
> > > > + if (rc == 0) {
> > > > + rc = btt_write_info(bttc, &btt_sb[1],
> > > > 0);
> > > > + goto out;
> > > > + }
> > > > + } else
> > > > + rc = -ENXIO;
> > >
> > > You could also look for the primary info blocks for the other
> > > arenas.
> >
> > That seems kind of difficult as we don't really have any information
> > on
> > where the other arenas might be.
> > We could do a full calculation of "this is where I would lay out the
> > arenas if I had this rawsize", but I was hoping to avoid doing
> > that..
>
> OK. If we get to here, things are pretty well screwed. Feel free to
> ignore that suggestion.
>
> > > > + out:
> > > > + free(btt_sb);
> > > > + return rc;
> > > > +}
> > > > +
> > > > +static int namespace_check(struct ndctl_region *region,
> > > > + struct ndctl_namespace *ndns)
> > > > +{
> > >
> > > namespace_check sounds generic, but it really only works for the
> > > btt.
> > > Also, should we verify that the namespace was configured with a
> > > btt
> > > before even attempting to check it?
> >
> > True. I did think of checking if this is a btt namespace before
> > checking, but I think if the namespace is disabled, my understanding
> > is
> > ndctl and libndctl have no idea whether it is a btt namespace or
> > not.
> > The BTT is discovered by the kernel when doing it's probe, and once
> > it
> > is up, ndctl can see that, but until then, it can really only find
> > out
> > if there is a valid BTT by trying to read the metadata itself.
>
> Yeah. At some point I guess we have to trust the admin not to try to
> fix the btt on a non-btt device. However, if we allowed them to run
> the
> command on a pmemNs device, that would reduce the opportunities for
> failure.
I'm liking the --force option for this (running on a 'live' namespace),
and even if someone does run it on a non-BTT namespace, we fail quickly
enough with a helpful message..
>
> > > > + raw_mode = ndctl_namespace_get_raw_mode(ndns);
> > > > + ndctl_namespace_set_raw_mode(ndns, 1);
> > > > + rc = ndctl_namespace_enable(ndns);
> > > > + if (rc != 0)
> > > > + error("%s: failed to enable raw mode\n",
> > > > devname);
> > > > + ndctl_namespace_set_raw_mode(ndns, raw_mode);
> > >
> > > What the heck do the above 6 lines do? Please explain.
> >
> > Linda had the same question so I suppose I should at least have a
> > comment there :)
> > But the idea is we get the existing mode (raw or not), and save it
> > in
> > raw_mode. Then try to enable it with raw_mode = 1. If it fails,
> > restore
> > the saved raw_mode.
>
> I had more basic questions, like what the heck is going on? :-) I've
> done some more reading, and this is what I think is happening.
>
> /* In typical usage, the current raw_mode should be false. */
> raw_mode = ndctl_namespace_get_raw_mode(ndns);
> /*
> * Putting the namespace into raw mode will allow us to access
> * the btt metadata.
> */
> ndctl_namespace_set_raw_mode(ndns, 1);
> /*
> * Now enable the namespace. This will result in a pmem device
> * node showing up in /dev that is in raw mode.
> */
> rc = ndctl_namespace_enable(ndns);
> if (rc != 0)
> error("%s: failed to enable raw mode\n", devname);
> sprintf(path, "/dev/%s",
> ndctl_namespace_get_block_device(ndns));
>
> And move the resetting of raw_mode down into the out: label. That
> will
> go a long way towards clearing up the confusion I had when reading the
> code for the first time. I think you should also mention that a
> pre-condition for this function is that the namespace is disabled, and
> you should check for it and provide a useful error message if it's
> already enabled.
Yes your interpretation is correct, and I will steal some of the above
comments and put them in :)
>
> > Maybe this is a bit pointless? But we should try and keep things in
> > the
> > same state the user had, right? :)
>
> I definitely agree we should keep things in the same state. I really
> didn't have a good grasp of what was going on there, and comments
> would
> have saved me some time digging through both ndctl and kernel code.
>
> Cheers,
> Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
next prev parent reply other threads:[~2017-01-24 20:59 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-20 3:12 [PATCH] ndctl: add a BTT check utility Vishal Verma
2017-01-20 20:04 ` Dan Williams
2017-01-23 17:42 ` Ross Zwisler
2017-01-23 17:49 ` Dan Williams
2017-01-20 21:00 ` Linda Knippers
2017-01-24 22:58 ` Verma, Vishal L
[not found] ` <20170120031234.24226-1-vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-01-20 23:14 ` Jeff Moyer
[not found] ` <x49lgu52uk0.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2017-01-21 0:41 ` Verma, Vishal L
[not found] ` <1484959224.4857.6.camel-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-01-23 18:20 ` Jeff Moyer
[not found] ` <x49vat5ljtz.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2017-01-24 20:59 ` Verma, Vishal L [this message]
2017-01-31 0:12 ` Verma, Vishal L
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1485291510.4857.8.camel@intel.com \
--to=vishal.l.verma-ral2jqcrhueavxtiumwx3w@public.gmane.org \
--cc=jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.