All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nick Dokos <nicholas.dokos@hp.com>
To: Andreas Dilger <adilger@sun.com>
Cc: Valerie Aurora <vaurora@redhat.com>, linux-ext4@vger.kernel.org
Cc: nicholas.dokos@hp.com
Subject: Re: ll_ver_fs data verification failure - 96TB fs
Date: Thu, 06 Aug 2009 18:01:04 -0400	[thread overview]
Message-ID: <18945.1249596064@alphaville.usa.hp.com> (raw)
In-Reply-To: Message from Nick Dokos <nicholas.dokos@hp.com> of "Thu, 06 Aug 2009 17:43:00 EDT." <18822.1249594980@alphaville.usa.hp.com>

Nick Dokos <nicholas.dokos@hp.com> wrote:

> Nick Dokos <nicholas.dokos@hp.com> wrote:
> 
> > > On Aug 06, 2009  16:37 -0400, Nick Dokos wrote:
> > > > I did that to begin with but the problem turns out to be much more
> > > > mundane: there was an IO error on one of the volumes. It wasn't quite
> > > > obvious (no red lights going off) but there *was* a message in
> > > > /var/log/messages - unfortunately I missed it. I eventually recreated
> > > > the error by trying to read the file with ``od -c'' and then went back
> > > > and found the original error. I don't know why/how ll_ver_fs managed to
> > > > read the offset and come up with a 1M difference[1] -- ``od -c'' failed with
> > > > a big thud.
> > > 
> > > Can you have a look at the error handling in ll_ver_fs at that point?
> > > It seems that it might just have re-used the previous 1MB buffer, but
> > > didn't detect/report the error from the read, which would itself be bad.
> > > 
> > 
> > It looks right to me:
> > 
> > ,----
> > |               ...
> > | 		if (read(fd, chunk_buf, chunksize) < 0) {
> > | 			fprintf(stderr, "\n%s: read %s+%llu failed: %s\n",
> > | 				progname, file, offset, strerror(errno));
> > | 			return 1;
> > | 		}
> > | 		if (verify_chunk(chunk_buf, chunksize, offset, time_st,
> > | 				 inode_st, file) != 0)
> > | 			return 1;
> > |               ...
> > `----
> > 
> > The read() should have failed (and I should have gotten a different error
> > message) but somehow it didn't - instead, verify_chunk() was called and
> > *that* detected the mismatch.
> > 
> 
> The offset starts at 0 and is marched along at a 1MB stride, so it's
> conceivable that the read() returned 0 (avoiding the error return) and
> did not touch chunk_buf (thereby leading to the verify_chunk() failure).
> 
> That would be consistent with what I got, but if so, it would be a kernel
> bug somewhere in the read() calling sequence, right?
> 
> Thanks,
> Nick
> 
> PS. For the record, this is the error from /var/log/messages:
> 
> ,----
> | ...
> | Aug  5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Unhandled sense code
> | Aug  5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> | Aug  5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Sense Key : Medium Error [current] 
> | Aug  5 18:14:11 shifter kernel: sd 10:0:7:3: [sdn] Add. Sense: No additional sense information
> | Aug  5 18:14:11 shifter kernel: end_request: I/O error, dev sdn, sector 2238819328
> | Aug  5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Unhandled sense code
> | Aug  5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> | Aug  5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Sense Key : Medium Error [current] 
> | Aug  5 18:16:03 shifter kernel: sd 10:0:7:3: [sdn] Add. Sense: No additional sense information
> | Aug  5 18:16:03 shifter kernel: end_request: I/O error, dev sdn, sector 2238819592
> `----
> 

Jim Owens pointed out to me that the code doesn't deal correctly with
short reads.  The verification information is at the beginning of every
4kB block, so the sector error causes a short read with exactly the
symptoms I see.  The next read() would get the error.

The code should  perhaps read something like this:

 		if ((nread = read(fd, chunk_buf, chunksize)) < 0) {
 			fprintf(stderr, "\n%s: read %s+%llu failed: %s\n",
 				progname, file, offset, strerror(errno));
 			return 1;
 		}
                if (nread < chunksize) {
                        fprintf(stderr, "short read etc");
                        /* or force the next read() to check for errors? */
                        return 1;
                }
 		if (verify_chunk(chunk_buf, chunksize, offset, time_st,
 				 inode_st, file) != 0)
 			return 1;

Thanks,
Nick

  reply	other threads:[~2009-08-06 22:01 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-03 13:54 ll_ver_fs data verification failure - 96TB fs Nick Dokos
2009-08-03 14:22 ` Nick Dokos
2009-08-06 20:04 ` Valerie Aurora
2009-08-06 20:37   ` Nick Dokos
2009-08-06 20:50     ` Andreas Dilger
2009-08-06 21:28       ` Nick Dokos
2009-08-06 21:43         ` Nick Dokos
2009-08-06 22:01           ` Nick Dokos [this message]
2009-08-06 22:54             ` Nick Dokos
2009-08-06 22:19         ` Andreas Dilger
2009-08-06 20:36 ` Valerie Aurora

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=18945.1249596064@alphaville.usa.hp.com \
    --to=nicholas.dokos@hp.com \
    --cc=adilger@sun.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=vaurora@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.