All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff Garzik <jeff@garzik.org>
To: Pete Zaitcev <zaitcev@redhat.com>
Cc: Project Hail List <hail-devel@vger.kernel.org>
Subject: Re: Design challenges in chunkd self-checking
Date: Tue, 22 Dec 2009 17:43:58 -0500	[thread overview]
Message-ID: <4B314BAE.4010805@garzik.org> (raw)
In-Reply-To: <20091222144111.789a5b91@redhat.com>

On 12/22/2009 04:41 PM, Pete Zaitcev wrote:
> I'm looking into adding self-checking to chunkd. This involves basically
> a process that re-reads everything stored in the chunkserver and verifies
> that it's still ok. Nothing can be simpler, right?
>
> So, current problems for which I'd like input are:
>
>   - Scheduling and deconflicting with normal operation.
>
>     Run "genisofs" in your Fedora desktop and your Firefox is DEAD.
>     It is also the reason why everyone does rpm -e mlocate the first thing
>     after the installation. The effect of massive data access blowing
>     away caches is very drastic in a regular Linux.
>     So, I have to have a good way to keep self-checkig from interfering
>     with normal service of a chunkserver.
>     Also, need to save power instead of burning it on re-reading data.

The problem seems to revolve around two variables:

* last-checked time.  You wouldn't want to check a single individual 
object more than once every N hours|days|weeks.

* maximum bytes-per-second.  You wouldn't want to exceed a useful bound 
for throughput.

Perhaps the last variable could be calculated by observing disk 
throughput over time, in conjunction with the number of objects and 
their sizes, resulting in an idea of the total time required to check 
the entire dataset.

And if we start keeping data like this, we might want to move metadata 
from the beginning of each object to a TC database.  That might speed up 
fs_list_objs and a couple other operations, too.


>   - Consistency.
>
>     Returning wrong checksums for an object that is being updated may
>     lead to us deciding to drop a perfectly good object, which is
>     unacceptable (especially when redundancy is impaired already).
>     So, I need some kind of locking, or logging, or invalidation...

It is normal and reasonable to maintain global information about all 
in-progress operations.  Caching systems do that, for example, to ensure 
multiple cache requests for object A do not initiate multiple 
simultaneous back-end requests for object A.

For the purposes of verification, I would just skip objects that are 
actively being written-to.  Those are, by definition, too new to 
probably need verification anyway.

BTW, in case this is helpful, chunkd's backend writes a zeroed metadata 
header to the beginning of each object.  The metadata header is only 
updated with "real" values after the final data byte is written.

	Jeff


  reply	other threads:[~2009-12-22 22:43 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-22 21:41 Design challenges in chunkd self-checking Pete Zaitcev
2009-12-22 22:43 ` Jeff Garzik [this message]
2009-12-23  1:40   ` Pete Zaitcev
2009-12-23  3:36     ` Jeff Garzik
2010-01-05 20:47       ` Pete Zaitcev
2010-01-05 21:02         ` Jeff Garzik
2010-01-05 21:39           ` Pete Zaitcev
2010-01-05 21:53             ` Jeff Garzik
2010-01-05 22:10               ` Pete Zaitcev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B314BAE.4010805@garzik.org \
    --to=jeff@garzik.org \
    --cc=hail-devel@vger.kernel.org \
    --cc=zaitcev@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.