Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Phil Turmel @ 2016-11-14 16:01 UTC (permalink / raw)
  To: Bruce Merry; +Cc: Wols Lists, linux-raid
In-Reply-To: <CAHy4j_6q2mmQ_y-xkh1Zx6CrP1vByQMadCXLJwibeeO85T3JgQ@mail.gmail.com>

Hi Bruce,

On 11/14/2016 01:50 AM, Bruce Merry wrote:

> Okay, I'll give that script a go to increase my kernel timeout. If I
> understand correctly, it's not the end of the world if the drive
> doesn't support SCTERC, provided I have a long kernel timeout (and
> when things go wrong it might take much longer to recover, but I can
> live with that). Is that correct?

Yes.

>> 2) Trim.  Well-behaved drive firmware guarantees zeros for trimmed
>> sectors, but many drives return random data instead.  Zing, mismatches.
>> It's often unhelpful with encrypted volumes, as even well-behaved
>> firmware can't deliver zeroed sectors *inside* the encryption.
> 
> Weee, sounds like fun. I hope it's that. Is there any way to tell
> which blocks mismatch, so that I can tell which files are in trouble
> (assuming I can figure out how to map through LVM, LUKS and
> debuge2fs).

The check operation doesn't log the sector addresses, unfortunately.  At
least I don't see any such operation in the code that increments
mismatch count.  Not even a tracepoint.  Hmmm.

In the meantime, run a "repair" scrub instead of a "check" scrub to
affirmatively force no mismatches.  (Writes first member of mirrors to
the others.)

Phil

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Bruce Merry @ 2016-11-14 16:03 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid
In-Reply-To: <5829DF1F.7030109@youngman.org.uk>

On 14 November 2016 at 17:58, Wols Lists <antlists@youngman.org.uk> wrote:
> On 14/11/16 15:52, Bruce Merry wrote:
>> On 13 November 2016 at 23:06, Wols Lists <antlists@youngman.org.uk> wrote:
>>> > Sounds like that drive could need replacing. I'd get a new drive and do
>>> > that as soon as possible - use the --replace option of mdadm - don't
>>> > fail the old drive and add the new.
>> Would you mind explaining why I should use --replace instead of taking
>> out the suspect drive? I guess I lose redundancy for any writes that
>> occur while the rebuild is happening, but I'd plan to do this with the
>> filesystem unmounted so there wouldn't be any writes.
>
> Because a replace will copy from the old drive to the new, recovering
> any failures from the rest of the array. A fail-and-add will have to
> rebuild the entire new array from what's left of the old, stressing the
> old array much more.

Okay, I can see how for RAID5 that might be a bad thing.

In my case however, it sounds like --replace will copy everything from
the failing drive, whereas I'd rather it copied everything from the
good drive. Same stress on the array, less chance of copying dodgy
data.

Bruce
-- 
Dr Bruce Merry
bmerry <@> gmail <.> com
http://www.brucemerry.org.za/
http://blog.brucemerry.org.za/

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Wols Lists @ 2016-11-14 16:09 UTC (permalink / raw)
  To: Bruce Merry; +Cc: linux-raid
In-Reply-To: <CAHy4j_5gROfjf42VwLxDfmdc6rKYdrWoBy6u2kDefx_Yzzn9kA@mail.gmail.com>

On 14/11/16 16:03, Bruce Merry wrote:
> On 14 November 2016 at 17:58, Wols Lists <antlists@youngman.org.uk> wrote:
>> On 14/11/16 15:52, Bruce Merry wrote:
>>> On 13 November 2016 at 23:06, Wols Lists <antlists@youngman.org.uk> wrote:
>>>>> Sounds like that drive could need replacing. I'd get a new drive and do
>>>>> that as soon as possible - use the --replace option of mdadm - don't
>>>>> fail the old drive and add the new.
>>> Would you mind explaining why I should use --replace instead of taking
>>> out the suspect drive? I guess I lose redundancy for any writes that
>>> occur while the rebuild is happening, but I'd plan to do this with the
>>> filesystem unmounted so there wouldn't be any writes.
>>
>> Because a replace will copy from the old drive to the new, recovering
>> any failures from the rest of the array. A fail-and-add will have to
>> rebuild the entire new array from what's left of the old, stressing the
>> old array much more.
> 
> Okay, I can see how for RAID5 that might be a bad thing.
> 
> In my case however, it sounds like --replace will copy everything from
> the failing drive, whereas I'd rather it copied everything from the
> good drive. Same stress on the array, less chance of copying dodgy
> data.
> 
So long as the data on the drive is correct (it should be) and the drive
reports a fault where it can't read it, it'll only copy good data off
the bad drive. It'll copy it from the other drive if it's dud.

Cheers,
Wol


^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Bruce Merry @ 2016-11-14 16:09 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Wols Lists, linux-raid
In-Reply-To: <1c3c5ce1-a26e-878d-9863-399efe6825e1@turmel.org>

On 14 November 2016 at 18:01, Phil Turmel <philip@turmel.org> wrote:
> In the meantime, run a "repair" scrub instead of a "check" scrub to
> affirmatively force no mismatches.  (Writes first member of mirrors to
> the others.)

I think that's good news. I wasn't sure which direction "repair" would
copy, but the good drive is the first member ("Device Role : Active
device 0"). I'll give that a go once the current scrub is done.

Thanks
Bruce
-- 
Dr Bruce Merry
bmerry <@> gmail <.> com
http://www.brucemerry.org.za/
http://blog.brucemerry.org.za/

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Phil Turmel @ 2016-11-14 16:10 UTC (permalink / raw)
  To: Bruce Merry, Wols Lists; +Cc: linux-raid
In-Reply-To: <CAHy4j_5gROfjf42VwLxDfmdc6rKYdrWoBy6u2kDefx_Yzzn9kA@mail.gmail.com>

On 11/14/2016 11:03 AM, Bruce Merry wrote:
> On 14 November 2016 at 17:58, Wols Lists <antlists@youngman.org.uk> wrote:
>> On 14/11/16 15:52, Bruce Merry wrote:
>>> On 13 November 2016 at 23:06, Wols Lists <antlists@youngman.org.uk> wrote:
>>>>> Sounds like that drive could need replacing. I'd get a new drive and do
>>>>> that as soon as possible - use the --replace option of mdadm - don't
>>>>> fail the old drive and add the new.
>>> Would you mind explaining why I should use --replace instead of taking
>>> out the suspect drive? I guess I lose redundancy for any writes that
>>> occur while the rebuild is happening, but I'd plan to do this with the
>>> filesystem unmounted so there wouldn't be any writes.
>>
>> Because a replace will copy from the old drive to the new, recovering
>> any failures from the rest of the array. A fail-and-add will have to
>> rebuild the entire new array from what's left of the old, stressing the
>> old array much more.

I entirely endorse Anthony's advice on this one.  You are at great risk
of not completing a fail/add resync with the new drive.

> Okay, I can see how for RAID5 that might be a bad thing.
> 
> In my case however, it sounds like --replace will copy everything from
> the failing drive, whereas I'd rather it copied everything from the
> good drive. Same stress on the array, less chance of copying dodgy
> data.

You simply don't have that choice, sorry.  And drives returning dodgy
data is ungodly rare.  The sector checksum algorithms are that good.
You have a URE crisis in your array that is far more significant.

Phil

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Phil Turmel @ 2016-11-14 16:14 UTC (permalink / raw)
  To: Bruce Merry; +Cc: Wols Lists, linux-raid
In-Reply-To: <CAHy4j_7gwMmeRhQo2qvgMpOu2-m=BN_=VKbBXZc413PCyWSJuw@mail.gmail.com>

On 11/14/2016 11:09 AM, Bruce Merry wrote:
> On 14 November 2016 at 18:01, Phil Turmel <philip@turmel.org> wrote:
>> In the meantime, run a "repair" scrub instead of a "check" scrub to
>> affirmatively force no mismatches.  (Writes first member of mirrors to
>> the others.)
> 
> I think that's good news. I wasn't sure which direction "repair" would
> copy, but the good drive is the first member ("Device Role : Active
> device 0"). I'll give that a go once the current scrub is done.

Direction is truly immaterial, as any bad sectors on device 0 will be
reconstructed from the other.  Neil Brown has a blog entry on this topic
that applies here:

http://neil.brown.name/blog/20100211050355

Phil


^ permalink raw reply

* Re: "creative" bio usage in the RAID code
From: Shaohua Li @ 2016-11-15  0:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-raid, linux-block, neilb
In-Reply-To: <20161112174238.GA11518@infradead.org>

On Sat, Nov 12, 2016 at 09:42:38AM -0800, Christoph Hellwig wrote:
> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
> > > It's mostly about the RAID1 and RAID10 code which does a lot of funny
> > > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
> > > drivers don't touch.  One example is the r1buf_pool_alloc code,
> > > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
> > > case, which would also take care of r1buf_pool_free.  I'm not sure
> > > about all the others cases, as some bits don't fully make sense to me,
> > 
> > The problem is we use the iov_vec to track the pages allocated. We will read
> > data to the pages and write out later for resync. If we add new fields to track
> > the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
> > avoid the tricky parts. This should work for both the resync and writebehind
> > cases.
> 
> I don't think we need to track the pages specificly - if we clone
> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
> we do one bio_kmalloc, then bio_alloc_pages then clone it for the
> others bios.  for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
> bio_alloc_pages for each.

Sure, for r1buf_pool_alloc, what you suggested should work well. There are a
lot of other places we are using bi_vcnt/bi_io_vec. I'm not sure if it's easy
to replace them with bio iterator. But having a separate data structue to track
the memory we read/rewite/sync and so on definitively will make things easier.
I'm not saying to add the extra data structure in bio but instead in r1bio.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH 07/12] dm: use bvec iterator helpers to implement .get_page and .next_page
From: Ming Lei @ 2016-11-15  1:01 UTC (permalink / raw)
  To: Jens Axboe, Linux Kernel Mailing List
  Cc: linux-block, Linux FS Devel, Christoph Hellwig, Ming Lei,
	Alasdair Kergon, Mike Snitzer, maintainer:DEVICE-MAPPER (LVM),
	Shaohua Li, open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <1478865957-25252-8-git-send-email-tom.leiming@gmail.com>

On Fri, Nov 11, 2016 at 8:05 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> Firstly we have mature bvec/bio iterator helper for iterate each
> page in one bio, not necessary to reinvent a wheel to do that.
>
> Secondly the coming multipage bvecs requires this patch.
>
> Also add comments about the direct access to bvec table.
>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
>  drivers/md/dm-io.c | 34 ++++++++++++++++++++++++----------
>  1 file changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
> index 0bf1a12e35fe..2ef573c220fc 100644
> --- a/drivers/md/dm-io.c
> +++ b/drivers/md/dm-io.c
> @@ -162,7 +162,10 @@ struct dpages {
>                          struct page **p, unsigned long *len, unsigned *offset);
>         void (*next_page)(struct dpages *dp);
>
> -       unsigned context_u;
> +       union {
> +               unsigned context_u;
> +               struct bvec_iter context_bi;
> +       };
>         void *context_ptr;
>
>         void *vma_invalidate_address;
> @@ -204,25 +207,36 @@ static void list_dp_init(struct dpages *dp, struct page_list *pl, unsigned offse
>  static void bio_get_page(struct dpages *dp, struct page **p,
>                          unsigned long *len, unsigned *offset)
>  {
> -       struct bio_vec *bvec = dp->context_ptr;
> -       *p = bvec->bv_page;
> -       *len = bvec->bv_len - dp->context_u;
> -       *offset = bvec->bv_offset + dp->context_u;
> +       struct bio_vec bv = bvec_iter_bvec((struct bio_vec *)dp->context_ptr,
> +                       dp->context_bi);
> +
> +       *p = bv.bv_page;
> +       *len = bv.bv_len;
> +       *offset = bv.bv_offset;
> +
> +       /* avoid to figure out it in bio_next_page() again */
> +       dp->context_bi.bi_sector = (sector_t)bv.bv_len;
>  }
>
>  static void bio_next_page(struct dpages *dp)
>  {
> -       struct bio_vec *bvec = dp->context_ptr;
> -       dp->context_ptr = bvec + 1;
> -       dp->context_u = 0;
> +       unsigned int len = (unsigned int)dp->context_bi.bi_sector;
> +
> +       bvec_iter_advance((struct bio_vec *)dp->context_ptr,
> +                       &dp->context_bi, len);
>  }
>
>  static void bio_dp_init(struct dpages *dp, struct bio *bio)
>  {
>         dp->get_page = bio_get_page;
>         dp->next_page = bio_next_page;
> -       dp->context_ptr = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> -       dp->context_u = bio->bi_iter.bi_bvec_done;
> +
> +       /*
> +        * We just use bvec iterator to retrieve pages, so it is ok to
> +        * access the bvec table directly here
> +        */
> +       dp->context_ptr = bio->bi_io_vec;
> +       dp->context_bi = bio->bi_iter;
>  }

Hi Alasdair, Mike, Christoph and anyone,

Could you give this one a review?

From my test, it just works fine, and I believe it is a good cleanup.

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH 08/12] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
From: Ming Lei @ 2016-11-15  1:03 UTC (permalink / raw)
  To: Jens Axboe, Linux Kernel Mailing List
  Cc: linux-block, Linux FS Devel, Christoph Hellwig, Ming Lei,
	Alasdair Kergon, Mike Snitzer, maintainer:DEVICE-MAPPER (LVM),
	Shaohua Li, open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <1478865957-25252-9-git-send-email-tom.leiming@gmail.com>

On Fri, Nov 11, 2016 at 8:05 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> Avoid to access .bi_vcnt directly, because the bio can be
> splitted from block layer, and .bi_vcnt should never have
> been used here.
>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
>  drivers/md/dm-rq.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index b2a9e2d161e4..d9cc8324e597 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -797,8 +797,13 @@ static void dm_old_request_fn(struct request_queue *q)
>                 if (req_op(rq) != REQ_OP_FLUSH)
>                         pos = blk_rq_pos(rq);
>
> +               /*
> +                * bio can be splitted from block layer, so use
> +                * !bio_multiple_segments instead of 'bio->bi_vcnt == 1'
> +                */
>                 if ((dm_old_request_peeked_before_merge_deadline(md) &&
> -                    md_in_flight(md) && rq->bio && rq->bio->bi_vcnt == 1 &&
> +                    md_in_flight(md) && rq->bio &&
> +                    !bio_multiple_segments(rq->bio) &&
>                      md->last_rq_pos == pos && md->last_rq_rw == rq_data_dir(rq)) ||
>                     (ti->type->busy && ti->type->busy(ti))) {
>                         blk_delay_queue(q, 10);

Hi Alasdair, Mike, Christoph and anyone,

Could you give this one a review?

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v6 03/11] md/r5cache: State machine for raid5-cache write back mode
From: Shaohua Li @ 2016-11-15  1:22 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161110204623.3484694-4-songliubraving@fb.com>

On Thu, Nov 10, 2016 at 12:46:15PM -0800, Song Liu wrote:
> This patch adds state machine for raid5-cache. With log device, the
> raid456 array could operate in two different modes (r5c_journal_mode):
>   - write-back (R5C_MODE_WRITE_BACK)
>   - write-through (R5C_MODE_WRITE_THROUGH)
> 
> Existing code of raid5-cache only has write-through mode. For write-back
> cache, it is necessary to extend the state machine.
> 
> With write-back cache, every stripe could operate in two different
> modes:
>   - caching
>   - writing-out
> 
> In caching mode, the stripe handles writes as:
>   - write to journal
>   - return IO
> 
> In writing-out mode, the stripe behaviors as a stripe in write through
> mode R5C_MODE_WRITE_THROUGH.
> 
> STRIPE_R5C_WRITE_OUT is added to sh->state to differentiate caching and
> writing-out mode.
> 
> When the array is write-through, stripes also go between caching mode
> and writing-out mode.
> 
> Please note: this is a "no-op" patch for raid5-cache write-through
> mode.
> 
> The following detailed explanation is copied from the raid5-cache.c:
> 
> /*
>  * raid5 cache state machine
>  *
>  * With rhe RAID cache, each stripe works in two modes:
>  *      - caching mode
>  *      - writing-out mode

'mode' is quite confusing here. it always remainders me r5c_journal_mode. can
you use state?

> + * Note: when the array is in write-through, each stripe still goes through
> + * caching mode and writing-out mode. In such cases, this function is called
> + * in r5c_handle_stripe_dirtying().
> + */
> +static void r5c_make_stripe_write_out(struct stripe_head *sh)
> +{
> +	struct r5conf *conf = sh->raid_conf;
> +	struct r5l_log *log = conf->log;
> +
> +	if (!log)
> +		return;
> +	WARN_ON(test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
> +	set_bit(STRIPE_R5C_WRITE_OUT, &sh->state);
> +}
> +
> +/*
> + * Setting proper flags after writing (or flushing) data and/or parity to the
> + * log device. This is called from r5l_log_endio() or r5l_log_flush_endio().
> + */
> +static void r5c_finish_cache_stripe(struct stripe_head *sh)
> +{
> +	struct r5l_log *log = sh->raid_conf->log;
> +
> +	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) {
> +		BUG_ON(!test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
> +		/*
> +		 * Set R5_InJournal for parity dev[pd_idx]. This means parity
> +		 * is in the journal. For RAID 6, it is NOT necessary to set
> +		 * the flag for dev[qd_idx], as the two parities are written
> +		 * out together.
> +		 */
> +		set_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags);

if this flag is only for pd_idx disk and you are using it to determine the
stripe data is in journal, why not make it a stripe flag?

Thanks,
Shaohua

^ permalink raw reply

* Re: "creative" bio usage in the RAID code
From: Ming Lei @ 2016-11-15  1:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Christoph Hellwig,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, linux-block,
	NeilBrown
In-Reply-To: <20161115001357.sb5242j5t7bsurjd@kernel.org>

On Tue, Nov 15, 2016 at 8:13 AM, Shaohua Li <shli@kernel.org> wrote:
> On Sat, Nov 12, 2016 at 09:42:38AM -0800, Christoph Hellwig wrote:
>> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
>> > > It's mostly about the RAID1 and RAID10 code which does a lot of funny
>> > > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
>> > > drivers don't touch.  One example is the r1buf_pool_alloc code,
>> > > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
>> > > case, which would also take care of r1buf_pool_free.  I'm not sure
>> > > about all the others cases, as some bits don't fully make sense to me,
>> >
>> > The problem is we use the iov_vec to track the pages allocated. We will read
>> > data to the pages and write out later for resync. If we add new fields to track
>> > the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
>> > avoid the tricky parts. This should work for both the resync and writebehind
>> > cases.
>>
>> I don't think we need to track the pages specificly - if we clone
>> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
>> we do one bio_kmalloc, then bio_alloc_pages then clone it for the
>> others bios.  for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
>> bio_alloc_pages for each.
>
> Sure, for r1buf_pool_alloc, what you suggested should work well. There are a
> lot of other places we are using bi_vcnt/bi_io_vec. I'm not sure if it's easy
> to replace them with bio iterator. But having a separate data structue to track
> the memory we read/rewite/sync and so on definitively will make things easier.
> I'm not saying to add the extra data structure in bio but instead in r1bio.

From view of multipage bvec, r1buf_pool_alloc() is fine because
the direct access to bi_vcnt/bi_io_vec just happens on a new allocated
bio. For other cases, if pages aren't added to one bio via bio_add_page(),
and the bio isn't cloned from somewhere,  it should be safe to keep current
usage about accessing to bi_vcnt/bi_io_vec.

But it is cleaner to use bio iterator helpers than direct access.

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v6 03/11] md/r5cache: State machine for raid5-cache write back mode
From: Song Liu @ 2016-11-15  1:36 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid@vger.kernel.org, NeilBrown, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <20161115012221.5holqbeqi542k7pa@kernel.org>


> On Nov 14, 2016, at 5:22 PM, Shaohua Li <shli@kernel.org> wrote:
> 
>> The following detailed explanation is copied from the raid5-cache.c:
>> 
>> /*
>> * raid5 cache state machine
>> *
>> * With rhe RAID cache, each stripe works in two modes:
>> *      - caching mode
>> *      - writing-out mode
> 
> 'mode' is quite confusing here. it always remainders me r5c_journal_mode. can
> you use state?

I am not very sure whether state is the best name here. Maybe we can use "phase"?

>> + * Note: when the array is in write-through, each stripe still goes through
>> + * caching mode and writing-out mode. In such cases, this function is called
>> + * in r5c_handle_stripe_dirtying().
>> + */
>> +static void r5c_make_stripe_write_out(struct stripe_head *sh)
>> +{
>> +	struct r5conf *conf = sh->raid_conf;
>> +	struct r5l_log *log = conf->log;
>> +
>> +	if (!log)
>> +		return;
>> +	WARN_ON(test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
>> +	set_bit(STRIPE_R5C_WRITE_OUT, &sh->state);
>> +}
>> +
>> +/*
>> + * Setting proper flags after writing (or flushing) data and/or parity to the
>> + * log device. This is called from r5l_log_endio() or r5l_log_flush_endio().
>> + */
>> +static void r5c_finish_cache_stripe(struct stripe_head *sh)
>> +{
>> +	struct r5l_log *log = sh->raid_conf->log;
>> +
>> +	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) {
>> +		BUG_ON(!test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
>> +		/*
>> +		 * Set R5_InJournal for parity dev[pd_idx]. This means parity
>> +		 * is in the journal. For RAID 6, it is NOT necessary to set
>> +		 * the flag for dev[qd_idx], as the two parities are written
>> +		 * out together.
>> +		 */
>> +		set_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags);
> 
> if this flag is only for pd_idx disk and you are using it to determine the
> stripe data is in journal, why not make it a stripe flag?
R5_InJournal is used in multiple place in the next patch. Using it here will 
save us a stripe flag. 

Thanks,
Song



^ permalink raw reply

* Re: [PATCH v6 03/11] md/r5cache: State machine for raid5-cache write back mode
From: Shaohua Li @ 2016-11-15  1:38 UTC (permalink / raw)
  To: Song Liu
  Cc: Shaohua Li, linux-raid@vger.kernel.org, NeilBrown, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <E9BA3F38-D363-42CA-9360-B5CFF0F30671@fb.com>

On Tue, Nov 15, 2016 at 01:36:45AM +0000, Song Liu wrote:
> 
> > On Nov 14, 2016, at 5:22 PM, Shaohua Li <shli@kernel.org> wrote:
> > 
> >> The following detailed explanation is copied from the raid5-cache.c:
> >> 
> >> /*
> >> * raid5 cache state machine
> >> *
> >> * With rhe RAID cache, each stripe works in two modes:
> >> *      - caching mode
> >> *      - writing-out mode
> > 
> > 'mode' is quite confusing here. it always remainders me r5c_journal_mode. can
> > you use state?
> 
> I am not very sure whether state is the best name here. Maybe we can use "phase"?

sounds good

^ permalink raw reply

* Aluminum die casting parts
From: Bob @ 2016-11-15  5:35 UTC (permalink / raw)
  To: linux-raid

Hello,

Here is the experienced manufactuer of aluminum castings.

Our casting parts in gear box, heat transfers and automations  have supplied clients in USA, Canada, Germany and so on for more then 10 years. and have got perfect reputation from them. We are able to offer from casting, sand blasting, heat treatment and machining the whole process.

Shall you have any interesting, pls don't hesitate and let us know.

b.rgds 
Bob Hu 
Ningbo Yinzhou Xusheng Machinery Factory 
Skype: bobhu1
Cell: 0086-1395832 7774 
Tel: 0086-574-88128603 
bob@xs-aluminumcasting.com 
www.xs-aluminumcasting.com

^ permalink raw reply

* [PATCH] RAID1: Avoid unnecessary loop to decrease conf->nr_queued in raid1d()
From: Coly Li @ 2016-11-15 15:27 UTC (permalink / raw)
  To: linux-raid; +Cc: Coly Li, Shaohua Li

commit ccfc7bf1f0 points out bios from conf->bio_end_io_list also
contribute to conf->nr_queued counter, that's right. But the fix
could be improved. The original fix replaces list_add() by a while()
loop to iterate every bio on conf->bio_end_io_list, and decrease
conf->nr_queued. If we look at the code several lines after, we may
find there is another while() loop to iterate every node from tmp
list which contains the original content of conf->bio_end_io_list.
Yes, we can decrease conf->nr_queued here, then we can avoid the
previous extra while() loop, which consumes more CPU cycles and hold
a spin lock longer time when conf->bio_end_io_list is not tiny. 

This patch changes to decrease conf->nr_queued in loop of
while(!list_empty(&tmp)){}, it avoids an extra loop execution, and
avoids holding conf->device_lock for too long time.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
---
 drivers/md/raid1.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 29e2df5..ea98a73 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2388,16 +2388,17 @@ static void raid1d(struct md_thread *thread)
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
 		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-			while (!list_empty(&conf->bio_end_io_list)) {
-				list_move(conf->bio_end_io_list.prev, &tmp);
-				conf->nr_queued--;
-			}
+			list_add(&tmp, &conf->bio_end_io_list);
+			list_del_init(&conf->bio_end_io_list);
 		}
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			conf->nr_queued--;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
-- 
2.6.6

^ permalink raw reply related

* Re: [PATCH v6 04/11] md/r5cache: caching mode of r5cache
From: Shaohua Li @ 2016-11-15 17:03 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161110204623.3484694-5-songliubraving@fb.com>

On Thu, Nov 10, 2016 at 12:46:16PM -0800, Song Liu wrote:
> As described in previous patch, write back cache operates in two
> modes: caching and writing-out. The caching mode works as:
> 1. write data to journal
>    (r5c_handle_stripe_dirtying, r5c_cache_data)
> 2. call bio_endio
>    (r5c_handle_data_cached, r5c_return_dev_pending_writes).
> 
> Then the writing-out path is as:
> 1. Mark the stripe as write-out (r5c_make_stripe_write_out)
> 2. Calcualte parity (reconstruct or RMW)
> 3. Write parity (and maybe some other data) to journal device
> 4. Write data and parity to RAID disks
> 
> This patch implements caching mode. The cache is integrated with
> stripe cache of raid456. It leverages code of r5l_log to write
> data to journal device.
> 
> Writing-out mode of the cache is implemented in the next patch.
> 
> With r5cache, write operation does not wait for parity calculation
> and write out, so the write latency is lower (1 write to journal
> device vs. read and then write to raid disks). Also, r5cache will
> reduce RAID overhead (multipile IO due to read-modify-write of
> parity) and provide more opportunities of full stripe writes.
> 
> This patch adds 2 flags to stripe_head.state:
>  - STRIPE_R5C_PARTIAL_STRIPE,
>  - STRIPE_R5C_FULL_STRIPE,
> 
> Instead of inactive_list, stripes with cached data are tracked in
> r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
> STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
> stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
> are not considered as "active".
> 
> For RMW, the code allocates an extra page for each data block
> being updated.  This is stored in r5dev->page and the old data
> is read into it.  Then the prexor calculation subtracts ->page
> from the parity block, and the reconstruct calculation adds the
> ->orig_page data back into the parity block.
> 
> r5cache naturally excludes SkipCopy. When the array has write back
> cache, async_copy_data() will not skip copy.
> 
> There are some known limitations of the cache implementation:
> 
> 1. Write cache only covers full page writes (R5_OVERWRITE). Writes
>    of smaller granularity are write through.
> 2. Only one log io (sh->log_io) for each stripe at anytime. Later
>    writes for the same stripe have to wait. This can be improved by
>    moving log_io to r5dev.
> 3. With writeback cache, read path must enter state machine, which
>    is a significant bottleneck for some workloads.
> 4. There is no per stripe checkpoint (with r5l_payload_flush) in
>    the log, so recovery code has to replay more than necessary data
>    (sometimes all the log from last_checkpoint). This reduces
>    availability of the array.
> 
> This patch includes a fix proposed by ZhengYuan Liu
> <liuzhengyuan@kylinos.cn>
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 192 +++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/md/raid5.c       | 158 +++++++++++++++++++++++++++++++++-----
>  drivers/md/raid5.h       |  17 +++++
>  3 files changed, 342 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index abb2c58..ad97103 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -218,8 +218,21 @@ static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
>  static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
>  			      struct list_head *temp_inactive_list)
>  {
> +	int i;
> +	int uptodate = 0;	/* number of data pages with R5_UPTODATE */
> +	int injournal = 0;	/* number of date pages with R5_InJournal */
> +
>  	BUG_ON(!list_empty(&sh->lru));
>  	BUG_ON(atomic_read(&conf->active_stripes)==0);
> +
> +	if (r5c_is_writeback(conf->log))
> +		for (i = sh->disks; i--; ) {
> +			if (test_bit(R5_UPTODATE, &sh->dev[i].flags))
> +				uptodate++;
> +			if (test_bit(R5_InJournal, &sh->dev[i].flags))
> +				injournal++;
> +		}
> +
>  	if (test_bit(STRIPE_HANDLE, &sh->state)) {
>  		if (test_bit(STRIPE_DELAYED, &sh->state) &&
>  		    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> @@ -245,8 +258,29 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
>  			    < IO_THRESHOLD)
>  				md_wakeup_thread(conf->mddev->thread);
>  		atomic_dec(&conf->active_stripes);
> -		if (!test_bit(STRIPE_EXPANDING, &sh->state))
> -			list_add_tail(&sh->lru, temp_inactive_list);
> +		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
> +			if (!r5c_is_writeback(conf->log))
> +				list_add_tail(&sh->lru, temp_inactive_list);
> +			else {
> +				WARN_ON(test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags));
> +				if (injournal == 0)
> +					list_add_tail(&sh->lru, temp_inactive_list);
> +				else if (uptodate == conf->raid_disks - conf->max_degraded) {
> +					/* full stripe */
> +					if (!test_and_set_bit(STRIPE_R5C_FULL_STRIPE, &sh->state))
> +						atomic_inc(&conf->r5c_cached_full_stripes);
> +					if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
> +						atomic_dec(&conf->r5c_cached_partial_stripes);
> +					list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);

Using the R5_UPTODATE bit to determine full stripe is skeptical. Read enters
into the state machine with writeback. After data is is read into stripe cache,
the R5_UPTODATE bit is set. So here we could put stripe which is never written
into r5c_cached_full_stripes. why not just use injournal to do the determination?

> +				} else {
> +					/* partial stripe */
> +					if (!test_and_set_bit(STRIPE_R5C_PARTIAL_STRIPE,
> +							      &sh->state))
> +						atomic_inc(&conf->r5c_cached_partial_stripes);
> +					list_add_tail(&sh->lru, &conf->r5c_partial_stripe_list);
> +				}
> +			}
> +		}
>  	}
>  }
>  
> @@ -830,8 +864,18 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>  
>  	might_sleep();
>  
> -	if (r5l_write_stripe(conf->log, sh) == 0)
> -		return;
> +	if (test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
> +		/* writing out mode */
> +		if (r5l_write_stripe(conf->log, sh) == 0)
> +			return;
> +	} else {
> +		/* caching mode */
> +		if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) {

The STRIPE_LOG_TRAPPED bit isn't designed for this, maybe it just works, I have
to double check. but this one is a little confusing.

>  static struct dma_async_tx_descriptor *
> @@ -1496,7 +1558,8 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
>  	for (i = disks; i--; ) {
>  		struct r5dev *dev = &sh->dev[i];
>  		/* Only process blocks that are known to be uptodate */
> -		if (test_bit(R5_Wantdrain, &dev->flags))
> +		if (test_bit(R5_Wantdrain, &dev->flags) ||
> +		    test_bit(R5_InJournal, &dev->flags))
>  			xor_srcs[count++] = dev->page;
>  	}
>  
> @@ -1530,6 +1593,7 @@ ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu,
>  static struct dma_async_tx_descriptor *
>  ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
>  {
> +	struct r5conf *conf = sh->raid_conf;
>  	int disks = sh->disks;
>  	int i;
>  	struct stripe_head *head_sh = sh;
> @@ -1547,6 +1611,7 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
>  
>  again:
>  			dev = &sh->dev[i];
> +			clear_bit(R5_InJournal, &dev->flags);

why clear the bit here? isn't this bit cleared when the stripe is initialized?

Thanks,
Shaohua

^ permalink raw reply

* RE: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Peter Sangas @ 2016-11-15 18:14 UTC (permalink / raw)
  To: 'Wols Lists', 'Bruce Merry'; +Cc: linux-raid
In-Reply-To: <5829DF1F.7030109@youngman.org.uk>

Hi Wol,


-----Original Message-----
From: Wols Lists [mailto:antlists@youngman.org.uk] 
Sent: Monday, November 14, 2016 7:58 AM
To: Bruce Merry
Cc: linux-raid@vger.kernel.org
Subject: Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1

On 14/11/16 15:52, Bruce Merry wrote:
> On 13 November 2016 at 23:06, Wols Lists <antlists@youngman.org.uk> wrote:
>> > Sounds like that drive could need replacing. I'd get a new drive 
>> > and do that as soon as possible - use the --replace option of mdadm 
>> > - don't fail the old drive and add the new.
> Would you mind explaining why I should use --replace instead of taking 
> out the suspect drive? I guess I lose redundancy for any writes that 
> occur while the rebuild is happening, but I'd plan to do this with the 
> filesystem unmounted so there wouldn't be any writes.

>Because a replace will copy from the old drive to the new, recovering any failures from the rest of the array. A fail-and-add will have to rebuild the entire new array >from what's left of the old, stressing the old array much more.

>Okay, in your case, it probably won't make an awful lot of difference, but it does make you vulnerable to problems on the "good" drive. To alter your wording >slightly, you lose redundancy for writes AND READS that occur while the array is rebuilding. It's just good practice (and I point it out because --replace is new and >not well known at present).

>Cheers,
>Wol

With respect to the --replace switch and "replacing a failed drive" documented on the wiki here:
https://raid.wiki.kernel.org/index.php/Replacing_a_failed_drive  Can you clear a few things up for me ?

1. If I just want to replace a working drive in a RAID1 and the array is still redundant I can 
issue the following command as in your example:

mdadm /dev/mdN [--fail /dev/sdx1] --remove /dev/sdx1 --add /dev/sdy1

which fails and removes sdx1 and replaces it with sdy1.

Question1. How is this different from first doing a fail/remove on sdx1, physically replacing sdx1 with sdy1 and doing an add on sdy1?


2. If one of the drives as an error in a RAID1 and gets kicked out of the array and the array loses redundancy the wiki has the following example:

mdmad /dev/mdN --re-add /dev/sdX1
mdadm /dev/mdN --add /dev/sdY1 --replace /dev/sdX1 --with /dev/sdY1

Question2.   Is this point here to first try and re-add sdX1 with the "--re-add" (first line above) and if that fails do a replace (second line above)?


Thanks,
Peter


To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Anthony Youngman @ 2016-11-15 18:49 UTC (permalink / raw)
  To: Peter Sangas; +Cc: linux-raid
In-Reply-To: <008001d23f6c$298b5260$7ca1f720$@wnsdev.com>



On 15/11/16 18:14, Peter Sangas wrote:
> Hi Wol,
>
>
> -----Original Message-----
> From: Wols Lists [mailto:antlists@youngman.org.uk]
> Sent: Monday, November 14, 2016 7:58 AM
> To: Bruce Merry
> Cc: linux-raid@vger.kernel.org
> Subject: Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
>
> On 14/11/16 15:52, Bruce Merry wrote:
>> On 13 November 2016 at 23:06, Wols Lists <antlists@youngman.org.uk> wrote:
>>>> Sounds like that drive could need replacing. I'd get a new drive
>>>> and do that as soon as possible - use the --replace option of mdadm
>>>> - don't fail the old drive and add the new.
>> Would you mind explaining why I should use --replace instead of taking
>> out the suspect drive? I guess I lose redundancy for any writes that
>> occur while the rebuild is happening, but I'd plan to do this with the
>> filesystem unmounted so there wouldn't be any writes.
>
>> Because a replace will copy from the old drive to the new, recovering any failures from the rest of the array. A fail-and-add will have to rebuild the entire new array >from what's left of the old, stressing the old array much more.
>
>> Okay, in your case, it probably won't make an awful lot of difference, but it does make you vulnerable to problems on the "good" drive. To alter your wording >slightly, you lose redundancy for writes AND READS that occur while the array is rebuilding. It's just good practice (and I point it out because --replace is new and >not well known at present).
>
>> Cheers,
>> Wol
>
> With respect to the --replace switch and "replacing a failed drive" documented on the wiki here:
> https://raid.wiki.kernel.org/index.php/Replacing_a_failed_drive  Can you clear a few things up for me ?
>
> 1. If I just want to replace a working drive in a RAID1 and the array is still redundant I can
> issue the following command as in your example:
>
> mdadm /dev/mdN [--fail /dev/sdx1] --remove /dev/sdx1 --add /dev/sdy1
>
> which fails and removes sdx1 and replaces it with sdy1.
>
> Question1. How is this different from first doing a fail/remove on sdx1, physically replacing sdx1 with sdy1 and doing an add on sdy1?
>
Not really different at all. It's just that you (obviously) can't do the 
remove and add in the same command if you physically swap the drive in 
the middle.

But I bang on a bit about having access to spare port to stick a drive 
on, so I've assumed you can have both the old and the new drive 
physically (and logically) in the system at the same time.
>
> 2. If one of the drives as an error in a RAID1 and gets kicked out of the array and the array loses redundancy the wiki has the following example:
>
> mdmad /dev/mdN --re-add /dev/sdX1
> mdadm /dev/mdN --add /dev/sdY1 --replace /dev/sdX1 --with /dev/sdY1
>
> Question2.   Is this point here to first try and re-add sdX1 with the "--re-add" (first line above) and if that fails do a replace (second line above)?
>
Correct. You've lost redundancy, and (you NEED a bitmap here) the idea 
is to get sdX1 back in to the array to restore redundancy before you 
copy its contents to sdY1.

You need the bitmap because, without it, a re-add becomes a normal add, 
and it's not only a waste of time, it adds stress to the array and 
increases your chances of a total failure.
>
> Thanks,
> Peter
>
Cheers,
Wol

^ permalink raw reply

* Re: [PATCH 07/12] dm: use bvec iterator helpers to implement .get_page and .next_page
From: Christoph Hellwig @ 2016-11-15 18:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block,
	Linux FS Devel, Christoph Hellwig, Alasdair Kergon, Mike Snitzer,
	maintainer:DEVICE-MAPPER (LVM), Shaohua Li,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <CACVXFVPbG6Dz2F=tc9o=jPdNCcXWDmtReaWGWQeTPa9HaGjh0A@mail.gmail.com>

> Hi Alasdair, Mike, Christoph and anyone,
> 
> Could you give this one a review?

It looks nice, but I don't understand the code anywhere near well
enough to review it.  We'll need someone from the DM to look over it.

^ permalink raw reply

* Re: [PATCH 07/12] dm: use bvec iterator helpers to implement .get_page and .next_page
From: Mike Snitzer @ 2016-11-15 18:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, Linux Kernel Mailing List, linux-block,
	Linux FS Devel, Alasdair Kergon, maintainer:DEVICE-MAPPER (LVM),
	Shaohua Li, open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <20161115185551.GA14816@infradead.org>

On Tue, Nov 15 2016 at  1:55pm -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> > Hi Alasdair, Mike, Christoph and anyone,
> > 
> > Could you give this one a review?
> 
> It looks nice, but I don't understand the code anywhere near well
> enough to review it.  We'll need someone from the DM to look over it.

I'll try to get to it this week.

^ permalink raw reply

* Re: [PATCH v6 04/11] md/r5cache: caching mode of r5cache
From: Song Liu @ 2016-11-15 19:08 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid@vger.kernel.org, NeilBrown, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <20161115170330.kdjxussdau54ekus@kernel.org>


> On Nov 15, 2016, at 9:03 AM, Shaohua Li <shli@kernel.org> wrote:
> 
> On Thu, Nov 10, 2016 at 12:46:16PM -0800, Song Liu wrote:
>> As described in previous patch, write back cache operates in two
>> modes: caching and writing-out. The caching mode works as:
>> 1. write data to journal
>>   (r5c_handle_stripe_dirtying, r5c_cache_data)
>> 2. call bio_endio
>>   (r5c_handle_data_cached, r5c_return_dev_pending_writes).
>> 
>> Then the writing-out path is as:
>> 1. Mark the stripe as write-out (r5c_make_stripe_write_out)
>> 2. Calcualte parity (reconstruct or RMW)
>> 3. Write parity (and maybe some other data) to journal device
>> 4. Write data and parity to RAID disks
>> 
>> This patch implements caching mode. The cache is integrated with
>> stripe cache of raid456. It leverages code of r5l_log to write
>> data to journal device.
>> 
>> Writing-out mode of the cache is implemented in the next patch.
>> 
>> With r5cache, write operation does not wait for parity calculation
>> and write out, so the write latency is lower (1 write to journal
>> device vs. read and then write to raid disks). Also, r5cache will
>> reduce RAID overhead (multipile IO due to read-modify-write of
>> parity) and provide more opportunities of full stripe writes.
>> 
>> This patch adds 2 flags to stripe_head.state:
>> - STRIPE_R5C_PARTIAL_STRIPE,
>> - STRIPE_R5C_FULL_STRIPE,
>> 
>> Instead of inactive_list, stripes with cached data are tracked in
>> r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
>> STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
>> stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
>> are not considered as "active".
>> 
>> For RMW, the code allocates an extra page for each data block
>> being updated.  This is stored in r5dev->page and the old data
>> is read into it.  Then the prexor calculation subtracts ->page
>> from the parity block, and the reconstruct calculation adds the
>> ->orig_page data back into the parity block.
>> 
>> r5cache naturally excludes SkipCopy. When the array has write back
>> cache, async_copy_data() will not skip copy.
>> 
>> There are some known limitations of the cache implementation:
>> 
>> 1. Write cache only covers full page writes (R5_OVERWRITE). Writes
>>   of smaller granularity are write through.
>> 2. Only one log io (sh->log_io) for each stripe at anytime. Later
>>   writes for the same stripe have to wait. This can be improved by
>>   moving log_io to r5dev.
>> 3. With writeback cache, read path must enter state machine, which
>>   is a significant bottleneck for some workloads.
>> 4. There is no per stripe checkpoint (with r5l_payload_flush) in
>>   the log, so recovery code has to replay more than necessary data
>>   (sometimes all the log from last_checkpoint). This reduces
>>   availability of the array.
>> 
>> This patch includes a fix proposed by ZhengYuan Liu
>> <liuzhengyuan@kylinos.cn>
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> 
>> +				if (injournal == 0)
>> +					list_add_tail(&sh->lru, temp_inactive_list);
>> +				else if (uptodate == conf->raid_disks - conf->max_degraded) {
>> +					/* full stripe */
>> +					if (!test_and_set_bit(STRIPE_R5C_FULL_STRIPE, &sh->state))
>> +						atomic_inc(&conf->r5c_cached_full_stripes);
>> +					if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
>> +						atomic_dec(&conf->r5c_cached_partial_stripes);
>> +					list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
> 
> Using the R5_UPTODATE bit to determine full stripe is skeptical. Read enters
> into the state machine with writeback. After data is is read into stripe cache,
> the R5_UPTODATE bit is set. So here we could put stripe which is never written
> into r5c_cached_full_stripes. why not just use injournal to do the determination?

Here, we first test if (injournal == 0). If so, this stripe is released to temp_inactive_list. So I think
it is not problem?
>> 
>> 	might_sleep();
>> 
>> -	if (r5l_write_stripe(conf->log, sh) == 0)
>> -		return;
>> +	if (test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
>> +		/* writing out mode */
>> +		if (r5l_write_stripe(conf->log, sh) == 0)
>> +			return;
>> +	} else {
>> +		/* caching mode */
>> +		if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) {
> 
> The STRIPE_LOG_TRAPPED bit isn't designed for this, maybe it just works, I have
> to double check. but this one is a little confusing.

This is somehow reuse the flag in caching mode/phase. In handle_stripe(), a large part of the 
logic is skipped with STRIPE_LOG_TRAPPED. This behavior is the same in caching mode/phase
and write_out mode/phase. 

>> static struct dma_async_tx_descriptor *
>> @@ -1496,7 +1558,8 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
>> 	for (i = disks; i--; ) {
>> 		struct r5dev *dev = &sh->dev[i];
>> 		/* Only process blocks that are known to be uptodate */
>> -		if (test_bit(R5_Wantdrain, &dev->flags))
>> +		if (test_bit(R5_Wantdrain, &dev->flags) ||
>> +		    test_bit(R5_InJournal, &dev->flags))
>> 			xor_srcs[count++] = dev->page;
>> 	}
>> 
>> @@ -1530,6 +1593,7 @@ ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu,
>> static struct dma_async_tx_descriptor *
>> ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
>> {
>> +	struct r5conf *conf = sh->raid_conf;
>> 	int disks = sh->disks;
>> 	int i;
>> 	struct stripe_head *head_sh = sh;
>> @@ -1547,6 +1611,7 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
>> 
>> again:
>> 			dev = &sh->dev[i];
>> +			clear_bit(R5_InJournal, &dev->flags);
> 
> why clear the bit here? isn't this bit cleared when the stripe is initialized?

This is necessary when we rewrite a page that is already in journal. This means there is new data coming to 
this stripe and dev, so we should not consider the page is secured in journal. 

> 
> Thanks,
> Shaohua


^ permalink raw reply

* RE: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Peter Sangas @ 2016-11-15 19:22 UTC (permalink / raw)
  To: 'Anthony Youngman'; +Cc: linux-raid
In-Reply-To: <f5c66009-956a-6751-2164-3249b8454208@youngman.org.uk>

Got it. Thanks again.


-----Original Message-----
From: Anthony Youngman [mailto:antlists@youngman.org.uk] 
Sent: Tuesday, November 15, 2016 10:50 AM
To: Peter Sangas
Cc: linux-raid@vger.kernel.org
Subject: Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1



On 15/11/16 18:14, Peter Sangas wrote:
> Hi Wol,
>
>
> -----Original Message-----
> From: Wols Lists [mailto:antlists@youngman.org.uk]
> Sent: Monday, November 14, 2016 7:58 AM
> To: Bruce Merry
> Cc: linux-raid@vger.kernel.org
> Subject: Re: What to do about Offline_Uncorrectable and Pending_Sector 
> in RAID1
>
> On 14/11/16 15:52, Bruce Merry wrote:
>> On 13 November 2016 at 23:06, Wols Lists <antlists@youngman.org.uk> wrote:
>>>> Sounds like that drive could need replacing. I'd get a new drive 
>>>> and do that as soon as possible - use the --replace option of mdadm
>>>> - don't fail the old drive and add the new.
>> Would you mind explaining why I should use --replace instead of 
>> taking out the suspect drive? I guess I lose redundancy for any 
>> writes that occur while the rebuild is happening, but I'd plan to do 
>> this with the filesystem unmounted so there wouldn't be any writes.
>
>> Because a replace will copy from the old drive to the new, recovering any failures from the rest of the array. A fail-and-add will have to rebuild the entire new array >from what's left of the old, stressing the old array much more.
>
>> Okay, in your case, it probably won't make an awful lot of difference, but it does make you vulnerable to problems on the "good" drive. To alter your wording >slightly, you lose redundancy for writes AND READS that occur while the array is rebuilding. It's just good practice (and I point it out because --replace is new and >not well known at present).
>
>> Cheers,
>> Wol
>
> With respect to the --replace switch and "replacing a failed drive" documented on the wiki here:
> https://raid.wiki.kernel.org/index.php/Replacing_a_failed_drive  Can you clear a few things up for me ?
>
> 1. If I just want to replace a working drive in a RAID1 and the array 
> is still redundant I can issue the following command as in your example:
>
> mdadm /dev/mdN [--fail /dev/sdx1] --remove /dev/sdx1 --add /dev/sdy1
>
> which fails and removes sdx1 and replaces it with sdy1.
>
> Question1. How is this different from first doing a fail/remove on sdx1, physically replacing sdx1 with sdy1 and doing an add on sdy1?
>
Not really different at all. It's just that you (obviously) can't do the remove and add in the same command if you physically swap the drive in the middle.

But I bang on a bit about having access to spare port to stick a drive on, so I've assumed you can have both the old and the new drive physically (and logically) in the system at the same time.
>
> 2. If one of the drives as an error in a RAID1 and gets kicked out of the array and the array loses redundancy the wiki has the following example:
>
> mdmad /dev/mdN --re-add /dev/sdX1
> mdadm /dev/mdN --add /dev/sdY1 --replace /dev/sdX1 --with /dev/sdY1
>
> Question2.   Is this point here to first try and re-add sdX1 with the "--re-add" (first line above) and if that fails do a replace (second line above)?
>
Correct. You've lost redundancy, and (you NEED a bitmap here) the idea is to get sdX1 back in to the array to restore redundancy before you copy its contents to sdY1.

You need the bitmap because, without it, a re-add becomes a normal add, and it's not only a waste of time, it adds stress to the array and increases your chances of a total failure.
>
> Thanks,
> Peter
>
Cheers,
Wol
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH v6 04/11] md/r5cache: caching mode of r5cache
From: Shaohua Li @ 2016-11-15 21:49 UTC (permalink / raw)
  To: Song Liu
  Cc: Shaohua Li, linux-raid@vger.kernel.org, NeilBrown, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <0A3E2AC9-966D-4EC3-A16D-CE8AC7B85DB7@fb.com>

On Tue, Nov 15, 2016 at 07:08:29PM +0000, Song Liu wrote:
> 
> > On Nov 15, 2016, at 9:03 AM, Shaohua Li <shli@kernel.org> wrote:
> > 
> > On Thu, Nov 10, 2016 at 12:46:16PM -0800, Song Liu wrote:
> >> As described in previous patch, write back cache operates in two
> >> modes: caching and writing-out. The caching mode works as:
> >> 1. write data to journal
> >>   (r5c_handle_stripe_dirtying, r5c_cache_data)
> >> 2. call bio_endio
> >>   (r5c_handle_data_cached, r5c_return_dev_pending_writes).
> >> 
> >> Then the writing-out path is as:
> >> 1. Mark the stripe as write-out (r5c_make_stripe_write_out)
> >> 2. Calcualte parity (reconstruct or RMW)
> >> 3. Write parity (and maybe some other data) to journal device
> >> 4. Write data and parity to RAID disks
> >> 
> >> This patch implements caching mode. The cache is integrated with
> >> stripe cache of raid456. It leverages code of r5l_log to write
> >> data to journal device.
> >> 
> >> Writing-out mode of the cache is implemented in the next patch.
> >> 
> >> With r5cache, write operation does not wait for parity calculation
> >> and write out, so the write latency is lower (1 write to journal
> >> device vs. read and then write to raid disks). Also, r5cache will
> >> reduce RAID overhead (multipile IO due to read-modify-write of
> >> parity) and provide more opportunities of full stripe writes.
> >> 
> >> This patch adds 2 flags to stripe_head.state:
> >> - STRIPE_R5C_PARTIAL_STRIPE,
> >> - STRIPE_R5C_FULL_STRIPE,
> >> 
> >> Instead of inactive_list, stripes with cached data are tracked in
> >> r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
> >> STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
> >> stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
> >> are not considered as "active".
> >> 
> >> For RMW, the code allocates an extra page for each data block
> >> being updated.  This is stored in r5dev->page and the old data
> >> is read into it.  Then the prexor calculation subtracts ->page
> >> from the parity block, and the reconstruct calculation adds the
> >> ->orig_page data back into the parity block.
> >> 
> >> r5cache naturally excludes SkipCopy. When the array has write back
> >> cache, async_copy_data() will not skip copy.
> >> 
> >> There are some known limitations of the cache implementation:
> >> 
> >> 1. Write cache only covers full page writes (R5_OVERWRITE). Writes
> >>   of smaller granularity are write through.
> >> 2. Only one log io (sh->log_io) for each stripe at anytime. Later
> >>   writes for the same stripe have to wait. This can be improved by
> >>   moving log_io to r5dev.
> >> 3. With writeback cache, read path must enter state machine, which
> >>   is a significant bottleneck for some workloads.
> >> 4. There is no per stripe checkpoint (with r5l_payload_flush) in
> >>   the log, so recovery code has to replay more than necessary data
> >>   (sometimes all the log from last_checkpoint). This reduces
> >>   availability of the array.
> >> 
> >> This patch includes a fix proposed by ZhengYuan Liu
> >> <liuzhengyuan@kylinos.cn>
> >> 
> >> Signed-off-by: Song Liu <songliubraving@fb.com>
> >> ---
> >> 
> >> +				if (injournal == 0)
> >> +					list_add_tail(&sh->lru, temp_inactive_list);
> >> +				else if (uptodate == conf->raid_disks - conf->max_degraded) {
> >> +					/* full stripe */
> >> +					if (!test_and_set_bit(STRIPE_R5C_FULL_STRIPE, &sh->state))
> >> +						atomic_inc(&conf->r5c_cached_full_stripes);
> >> +					if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
> >> +						atomic_dec(&conf->r5c_cached_partial_stripes);
> >> +					list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
> > 
> > Using the R5_UPTODATE bit to determine full stripe is skeptical. Read enters
> > into the state machine with writeback. After data is is read into stripe cache,
> > the R5_UPTODATE bit is set. So here we could put stripe which is never written
> > into r5c_cached_full_stripes. why not just use injournal to do the determination?
> 
> Here, we first test if (injournal == 0). If so, this stripe is released to temp_inactive_list. So I think
> it is not problem?

The read case could still happen. We could read data from one disk, so the disk
has the UPTODATE set. Then we can write another disk of the stripe, at that
time the 'uptodate' isn't correct. The UPTODATE bit doesn't mean the stripe
data is in journal, so this is an abuse of the bit.

> >> 
> >> 	might_sleep();
> >> 
> >> -	if (r5l_write_stripe(conf->log, sh) == 0)
> >> -		return;
> >> +	if (test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
> >> +		/* writing out mode */
> >> +		if (r5l_write_stripe(conf->log, sh) == 0)
> >> +			return;
> >> +	} else {
> >> +		/* caching mode */
> >> +		if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) {
> > 
> > The STRIPE_LOG_TRAPPED bit isn't designed for this, maybe it just works, I have
> > to double check. but this one is a little confusing.
> 
> This is somehow reuse the flag in caching mode/phase. In handle_stripe(), a large part of the 
> logic is skipped with STRIPE_LOG_TRAPPED. This behavior is the same in caching mode/phase
> and write_out mode/phase. 

I'd think this is another abuse of the bit. Both writeout mode and the caching
mode could set the bit. Here you are using the bit to determine if entering
caching mode. This isn't clear at all to me. I'd add another bit to indicate
the caching mode if necessary.

> >> static struct dma_async_tx_descriptor *
> >> @@ -1496,7 +1558,8 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
> >> 	for (i = disks; i--; ) {
> >> 		struct r5dev *dev = &sh->dev[i];
> >> 		/* Only process blocks that are known to be uptodate */
> >> -		if (test_bit(R5_Wantdrain, &dev->flags))
> >> +		if (test_bit(R5_Wantdrain, &dev->flags) ||
> >> +		    test_bit(R5_InJournal, &dev->flags))
> >> 			xor_srcs[count++] = dev->page;
> >> 	}
> >> 
> >> @@ -1530,6 +1593,7 @@ ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu,
> >> static struct dma_async_tx_descriptor *
> >> ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
> >> {
> >> +	struct r5conf *conf = sh->raid_conf;
> >> 	int disks = sh->disks;
> >> 	int i;
> >> 	struct stripe_head *head_sh = sh;
> >> @@ -1547,6 +1611,7 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
> >> 
> >> again:
> >> 			dev = &sh->dev[i];
> >> +			clear_bit(R5_InJournal, &dev->flags);
> > 
> > why clear the bit here? isn't this bit cleared when the stripe is initialized?
> 
> This is necessary when we rewrite a page that is already in journal. This means there is new data coming to 
> this stripe and dev, so we should not consider the page is secured in journal. 

This sounds suggest we should clear the bit after writeout is done.

^ permalink raw reply

* Re: [PATCH v6 06/11] md/r5cache: sysfs entry r5c_journal_mode
From: Shaohua Li @ 2016-11-15 23:35 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161110204623.3484694-7-songliubraving@fb.com>

On Thu, Nov 10, 2016 at 12:46:18PM -0800, Song Liu wrote:
> With write cache, r5c_journal_mode is the knob to switch between
> write-back and write-through.

I'd prefer the name is journal_mode
> Below is an example:
> 
> root@virt-test:~/# cat /sys/block/md0/md/r5c_state
> [write-through] write-back
> root@virt-test:~/# echo write-back > /sys/block/md0/md/r5c_state
> root@virt-test:~/# cat /sys/block/md0/md/r5c_state
this doesn't match the code, please update

> write-through [write-back]
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c       |  1 +
>  drivers/md/raid5.h       |  1 +
>  3 files changed, 62 insertions(+)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 8330053..d2acb69 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -60,6 +60,8 @@ enum r5c_journal_mode {
>  	R5C_JOURNAL_MODE_WRITE_BACK = 1,
>  };
>  
> +static char *r5c_journal_mode_str[] = {"write-through",
> +				       "write-back"};
>  /*
>   * raid5 cache state machine
>   *
> @@ -1602,6 +1604,64 @@ static void r5l_write_super(struct r5l_log *log, sector_t cp)
>  	set_bit(MD_CHANGE_DEVS, &mddev->flags);
>  }
>  
> +static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
> +{
> +	struct r5conf *conf = mddev->private;
> +	int ret;
> +
> +	if (!conf->log)
> +		return 0;
> +
> +	switch (conf->log->r5c_journal_mode) {
> +	case R5C_JOURNAL_MODE_WRITE_THROUGH:
> +		ret = snprintf(page, PAGE_SIZE, "[%s] %s\n",
> +			       r5c_journal_mode_str[R5C_JOURNAL_MODE_WRITE_THROUGH],
> +			       r5c_journal_mode_str[R5C_JOURNAL_MODE_WRITE_BACK]);
Limiting these lines within 80 character width doesn't impact readability

> +		break;
> +	case R5C_JOURNAL_MODE_WRITE_BACK:
> +		ret = snprintf(page, PAGE_SIZE, "%s [%s]\n",
> +			       r5c_journal_mode_str[R5C_JOURNAL_MODE_WRITE_THROUGH],
> +			       r5c_journal_mode_str[R5C_JOURNAL_MODE_WRITE_BACK]);
> +		break;
> +	default:
> +		ret = 0;
> +	}
> +	return ret;
> +}
> +
> +static ssize_t r5c_journal_mode_store(struct mddev *mddev,
> +				      const char *page, size_t len)
> +{
> +	struct r5conf *conf = mddev->private;
> +	struct r5l_log *log = conf->log;
> +	int val = -1, i;
> +
> +	if (!log)
> +		return -ENODEV;
> +
> +	for (i = 0; i < sizeof(r5c_journal_mode_str) / sizeof(r5c_journal_mode_str[0]); i++)
use ARRAY_SIZE

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v6 03/11] md/r5cache: State machine for raid5-cache write back mode
From: NeilBrown @ 2016-11-16  0:17 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-4-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 3505 bytes --]

On Fri, Nov 11 2016, Song Liu wrote:

> This patch adds state machine for raid5-cache. With log device, the
> raid456 array could operate in two different modes (r5c_journal_mode):
>   - write-back (R5C_MODE_WRITE_BACK)
>   - write-through (R5C_MODE_WRITE_THROUGH)
>
> Existing code of raid5-cache only has write-through mode. For write-back
> cache, it is necessary to extend the state machine.
>
> With write-back cache, every stripe could operate in two different
> modes:
>   - caching
>   - writing-out
>
> In caching mode, the stripe handles writes as:
>   - write to journal
>   - return IO
>
> In writing-out mode, the stripe behaviors as a stripe in write through
> mode R5C_MODE_WRITE_THROUGH.
>
> STRIPE_R5C_WRITE_OUT is added to sh->state to differentiate caching and
> writing-out mode.
>
> When the array is write-through, stripes also go between caching mode
> and writing-out mode.

This bothers me.  Why would a stripe *ever* be in "caching mode" (or
"caching phase") when the array is in write-through?  It doesn't seem to
make sense.

A stripe_head goes through several states/stages/phases/things.

 1/ write data blocks to journal
 2/ reading missing data blocks and compute parity
 3/ write data and parity to journal
 4/ write data and parity to RAID

When there is no log, we only perform 2 and 4
When there is a log in WRITE_THOUGH we perform 2,3,4
When there is a log in WRITE_BACK we perform 1 (maybe several times) 2 3 4.

Step 2 is initiated by calling handle_stripe_dirtying().
A new function is introduced "r5c_handle_stripe_dirtying()" which
determines if handle_stripe_dirtying() should do anything.
(It returns '0' if it shouldn't proceed, and -EAGAIN if it should,
 which seems a little strange).

If there is no log, or if STRIPE_R5C_WRITE_OUT is set, -EAGAIN is
returned.
Otherwise if in MODE_WRITE_THROUGH, STRIPE_R5C_WRITE_OUT is set and
-EAGAIN is returned.
else (in next patch) are more complex determination is made, but
-EAGAIN is only ever returns with STRIPE_R5C_WRITE_OUT set, or if log == NULL.

So STRIPE_R5C_WRITE_OUT is (sometimes) set to enter step 2, and cleared
when step 4 completes.

STRIPE_R5C_WRITE_OUT means that the 2(3)4 writeout sequence has
commenced and not yet completed.  I would like to see it *always* set
when that happens, including when log==NULL.
I would probably even rename r5c_handle_stripe_dirtying() to something
like  r5c_check_need_writeout() which would check to see if writeout was
needed, and would set STRIPE_R5C_WRITE_OUT if it was.
Then handle_stripe_dirtying() would be called iff STRIPE_R5C_WRITE_OUT
were set. (it could even be named to handle_stripe_writeout()??)

There are two actions that can be taken when where are ->towrite blocks
on a stripe.  We can enter WRITE_OUT, or they can be cached in the
journal.  Also we can enter WRITE_OUT when a stripe needs to be removed
From memory or from the journal.
This makes "writeout" and "cache" seem more like "actions" than states,
modes, or phases.  Naming is hard.

Trying to understand R5_InJournal:
 It is set (on pd_idx) when step 3 completes.
 It is only used (in this patch) in r5c_finish_stripe_write_out()
   to make sure WRITE_OUT isn't cleared until after InJournal is
 cleared.

So setting InJournal[pd_idx] marks the end of step 3 and clearing WRITE_OUT
marks the end of step 4.

I think that observation helps me understand the code a bit better.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox