From: "Verma, Vishal L" <vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: "jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org"
<jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org"
<linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org>
Subject: Re: [ndctl PATCH v2 2/3] ndctl: add a BTT check utility
Date: Tue, 28 Feb 2017 22:31:17 +0000 [thread overview]
Message-ID: <1488321002.4873.17.camel@intel.com> (raw)
In-Reply-To: <x49a89610mr.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
On Tue, 2017-02-28 at 16:11 -0500, Jeff Moyer wrote:
> Hi, Vishal,
>
> Vishal Verma <vishal.l.verma@intel.com> writes:
>
> > Add the check-namespace command to ndctl. This will check the BTT
> > metadata layout for the given namespace, and if requested, correct
> > any
> > errors found. Not all metadata corruption is detectable or fixable.
>
> Thanks for the nice changelog in the 0th message, but it would be nice
> to include relevant entries in the individual patches.
Ok, will do.
>
> - Differentiate the use off BTT_PG_SIZE, sysconf(_SC_PAGESIZE), and
> SZ_4K
> (for the fixed start offset) in the different places they're used
> (Jeff)
>
> I think you could go even further with this. For example, you coud
> use
> BTT_START_OFFSET for the initial offset from the beginning of the raw
> namespace to the first btt info block. Using the page size for the
> size
> of the info block is also less informative than it could be. Instead,
> consider adding BTT_INFO_SIZE. After that, all of your math starts to
> look a bit more readable to my eye.
I thought of something along these lines, but ended up not doing this to
stay consistent with the 'BTT_PG_SIZE' usage in the kernel. But maybe
the kernel could use some cleaning up in this regard :)
>
> There's no support for parsing the badblocks list or handling
> SIGBUS/SIGSEGV. I think that's something that should be added, but it
> could be done in a follow-up patch.
Yep, I'll send follow on patches for this.
>
> You go over 80 columns all over the place. Would you mind fixing that
> up?
I think I only do that for strings, which I thought was acceptable, but
I'll do a sweep and make sure it is only for strings. I think checkpatch
will in fact produce a warning for split strings..
>
> More comments inlined below.
>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Jeff Moyer <jmoyer@redhat.com>
> > Cc: Linda Knippers <linda.knippers@hpe.com>
> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> > ---
> > Documentation/Makefile.am | 1 +
> > Documentation/ndctl-check-namespace.txt | 64 +++
> > Documentation/ndctl.txt | 1 +
> > Makefile.am | 4 +-
> > builtin.h | 1 +
> > ccan/bitmap/LICENSE | 1 +
> > ccan/bitmap/bitmap.c | 125 +++++
> > ccan/bitmap/bitmap.h | 243 +++++++++
>
> I agree with Dan, please split out the bitmap code to a separate
> patch.
Yep, already have that queued.
>
> > +/**
> > + * btt_write_info - write an info block to the given offset
> > + * @bttc: the main btt_chk structure for this btt
> > + * @btt_sb: struct btt_sb where the info block will be
> > copied from
> > + * @offset: offset in the raw namespace to write the info
> > block to
> > + *
> > + * The initial 4K offset will get added in by this routine. This
> > will
>
> Still not a fan of this. :)
I don't know if I'm seeing a cleaner way out... :)
>
> > + * also use 'pwrite' to write the info block, and not mmap+stores
> > as this
> > + * is used before the mappings are set up.
> > + */
> > +static int btt_write_info(struct btt_chk *bttc, struct btt_sb
> > *btt_sb, u64 off)
> > +{
> > + ssize_t size;
> > + int rc = 0;
> > +
> > + if (!bttc->opts->repair) {
> > + error("BTT info block at offset %#lx needs to be
> > restored\n", off);
> > + repair_msg();
> > + return -1;
> > + }
> > + printf("Restoring BTT info block at offset %#lx\n", off);
>
> The offsets you print are off by 4k. You do that in just about every
> place you print out the offset of the btt info.
Off by 4k due to the 4k initial offset right? I think we should print
the raw namespace offsets, including the initial 4k every time. Let em
make sure that is the case wherever we print them, and if it isn't then
I'll fix those.
>
> > +
> > + size = pwrite(bttc->fd, btt_sb, sizeof(*btt_sb), off +
> > SZ_4K);
> > + if (size != sizeof(*btt_sb)) {
> > + error("unable to write the info block: %s\n",
> > strerror(errno));
> > + rc = -errno;
> > + }
>
> If you get a short write, errno will not be valid. You could very
> well
> return success from this routine when it should have failed.
Ah, I think I fixed it for the read case, but neglected the write.. Will
fix.
>
> I think you're missing an fsync() here.
Good catch, I'll add it.
>
> > + return rc;
> > +}
> > +
> > +/**
> > + * btt_copy_to_info2 - restore the backup info block using the main
> > one
> > + * @a: the arena_info handle for this arena
> > + *
> > + * Called when a corrupted backup info block is detected. Copies
> > the
> > + * main info block over to the backup location. This is done using
> > + * mmap + stores, and thus needs a msync.
> > + */
> > +static int btt_copy_to_info2(struct arena_info *a)
> > +{
> > + void *ms_align;
> > + size_t ms_size;
> > +
> > + if (!a->bttc->opts->repair) {
> > + error("Arena %d: BTT info2 needs to be restored\n",
> > a->num);
> > + return repair_msg();
> > + }
> > + printf("Arena %d: Restoring BTT info2\n", a->num);
> > + memcpy(a->map.info2, a->map.info, BTT_PG_SIZE);
> > +
> > + ms_align = (void *)rounddown((u64)a->map.info2,
> > + sysconf(_SC_PAGESIZE));
> > + ms_size = max(BTT_PG_SIZE, sysconf(_SC_PAGESIZE));
>
> You call sysconf 4 times. You might consider storing the result in a
> global.
Yep, will do.
>
> [snip]
>
> > +/* Check that each flog entry has the correct corresponding map
> > entry */
> > +static int btt_check_log_map(struct arena_info *a)
> > +{
> > + unsigned int i;
> > + u32 mapping;
> > + int rc = 0;
> > +
> > + for (i = 0; i < a->nfree; i++) {
> > + struct log_entry log;
> > +
> > + rc = btt_log_read(a, i, &log);
> > + if (rc)
> > + return rc;
> > + mapping = btt_map_lookup(a, log.lba);
> > +
> > + /*
> > + * Case where the flog was written, but map
> > couldn't be updated.
> > + * The kernel should also be able to detect and fix
> > this condition
> > + */
> > + if (log.new_map != mapping && log.old_map ==
> > mapping) {
> > + inform("arena %d: log[%d].new_map (%#x)
> > doesn't match map[%#x] (%#x)",
> > + a->num, i, log.new_map, log.lba,
> > mapping);
> > + rc = btt_map_write(a, log.lba,
> > log.new_map);
> > + if (rc)
> > + return BTT_LOG_MAP_ERR;
>
> What do you think about storing the return code but continuing to loop
> through the rest of the log?
Yep can do. I've been defaulting to exit on first error, but I can see
where it makes sense to keep going and report multiple errors, and do
that.
>
> [snip]
>
> > +static int btt_check_bitmap(struct arena_info *a)
> > +{
> > + bitmap_t *bm;
> > + u32 i, mapping;
> > + int rc;
> > +
> > + bm = bitmap_alloc0(a->internal_nlba);
> > + if (bm == NULL)
> > + return -ENOMEM;
> > +
> > + /* map 'external_nlba' number of map entries */
> > + for (i = 0; i < a->external_nlba; i++) {
> > + mapping = btt_map_lookup(a, i);
> > + if (bitmap_test_bit(bm, mapping)) {
> > + inform("arena %d: internal block %#x is
> > referenced by two map entries",
> > + a->num, mapping);
> > + rc = BTT_BITMAP_ERROR;
> > + goto out;
> > + }
> > + bitmap_set_bit(bm, mapping);
> > + }
> > +
> > + /* map 'nfree' number of flog entries */
> > + for (i = 0; i < a->nfree; i++) {
> > + struct log_entry log;
> > +
> > + rc = btt_log_read(a, i, &log);
> > + if (rc)
> > + goto out;
> > + if (bitmap_test_bit(bm, log.old_map)) {
> > + inform("arena %d: internal block %#x is
> > referenced by two map/log entries",
> > + a->num, log.old_map);
> > + rc = BTT_BITMAP_ERROR;
> > + goto out;
> > + }
> > + bitmap_set_bit(bm, log.old_map);
> > + }
> > +
> > + /* check that the bitmap is full */
> > + if (!bitmap_full(bm, a->internal_nlba))
> > + rc = BTT_BITMAP_ERROR;
>
> This seems like a lost opportunity. If a block isn't referenced,
> shouldn't we link it back in?
I don't know if we have enough information to do that?
If a block is referenced by two map entries, and another block is not
referenced at all, we can guess that one of the map entries should
actually point to that unreferenced block (and even that is
speculation), but we still don't know which of the two entries
referencing the same block to 'move'..
>
> [snip]
>
> > +static int btt_create_mappings(struct btt_chk *bttc)
> > +{
> > + struct arena_info *a;
> > + int mmap_flags;
> > + int i;
> > +
> > + if (!bttc->opts->repair)
> > + mmap_flags = PROT_READ;
> > + else
> > + mmap_flags = PROT_READ|PROT_WRITE;
> > +
> > + for (i = 0; i < bttc->num_arenas; i++) {
> > + a = &bttc->arena[i];
> > + a->map.info_len = BTT_PG_SIZE;
> > + a->map.info = mmap(NULL, a->map.info_len,
> > mmap_flags,
> > + MAP_SHARED, bttc->fd, a->infooff + SZ_4K);
>
> Is the offset in the mmap call correct for arenas other than the
> first?
Yes.. sb->infooff is relative to that arena, but when we populate
arena->infooff in btt_parse_meta(), we do:
arena->infooff = arena_off;
Where arena_off is passed in as the raw offset for the start of the
arena.
This makes all arena->*off 'absolute', i.e. based on the raw namespace,
and the mmap calls can just use that offset directly. It is a bit
confusing, and I'll add a comment explaining this, but I believe this is
also what we do in the kernel :)
>
> Cheers,
> Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
next prev parent reply other threads:[~2017-02-28 22:31 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-22 22:47 [ndctl PATCH v2 0/3] Add ndctl check-namespace Vishal Verma
2017-02-22 22:47 ` [ndctl PATCH v2 1/3] ndctl: move the fletcher64 routine to util/ Vishal Verma
2017-02-22 22:47 ` [ndctl PATCH v2 2/3] ndctl: add a BTT check utility Vishal Verma
2017-02-24 1:14 ` Dan Williams
2017-02-28 21:11 ` Jeff Moyer
[not found] ` <x49a89610mr.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2017-02-28 22:31 ` Verma, Vishal L [this message]
[not found] ` <1488321002.4873.17.camel-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-03-01 15:02 ` Jeff Moyer
[not found] ` <x49inntt4z6.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2017-03-01 15:54 ` Dan Williams
[not found] ` <CAPcyv4g+UfCR2x3KwWB+yu6fRh5UcVRvdAdrx3ri8bpJx+r8YQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-01 20:27 ` Verma, Vishal L
2017-03-01 20:26 ` Verma, Vishal L
2017-02-22 22:47 ` [ndctl PATCH v2 3/3] ndctl, test: Add a unit test for the BTT checker Vishal Verma
[not found] ` <20170222224724.7696-4-vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-02-28 21:19 ` Jeff Moyer
[not found] ` <x494lze10ae.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2017-02-28 21:49 ` Verma, Vishal L
[not found] ` <1488318514.4873.12.camel-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-02-28 21:58 ` Jeff Moyer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1488321002.4873.17.camel@intel.com \
--to=vishal.l.verma-ral2jqcrhueavxtiumwx3w@public.gmane.org \
--cc=jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.