Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH 3/7] imsm: PPL support
From: Artur Paszkiewicz @ 2016-11-30  8:07 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid
In-Reply-To: <wrfjzikigvj8.fsf@redhat.com>

On 11/29/2016 04:23 PM, Jes Sorensen wrote:
> Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
>> On 11/29/2016 12:51 AM, Jes Sorensen wrote:
>>>> @@ -3177,6 +3195,9 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info, char *
>>>>  
>>>>  		disk = &super->disks->disk;
>>>>  		info->data_offset = total_blocks(&super->disks->disk) - reserved;
>>>> +		/* mpb anchor sector - see store_imsm_mpb() */
>>>> +		info->sb_start = total_blocks(&super->disks->disk) -
>>>> +				 ((2 * super->sector_size) >> 9);
>>>>  		info->component_size = reserved;
>>>>  		info->disk.state  = is_configured(disk) ? (1 << MD_DISK_ACTIVE) : 0;
>>>>  		/* we don't change info->disk.raid_disk here because
>>>
>>> Hi Artur,
>>>
>>> I have been sitting staring at the above for a while, and looking at
>>> store_imsm_mpb() it is not clear to me what is meant to happen here.
>>>
>>> For 4k sector drives, aren't you pushing back sb_start way further than
>>> you are for 512 byte sector drives? Ie. you are subtracting 16 sectors
>>> for the 4k drive but only two sectors for the 512 byte sector drive?
>>>
>>> Maybe it's because it's Monday or I lost the last of my marbles, but
>>> could you possibly enlighten me here please?
>>
>> Jes,
>>
>> You read it correctly. The reason for this is that the IMSM anchor is
>> located in the second _logical_ sector from the end of the drive. So for
>> 4k drives this will be 16 512-byte sectors from the end.
> 
> I see, so the IMSM implementation uses 512 byte logical sectors on top
> of 4k drives? Could I ask you to add a note explaining this in the code?

IMSM uses logical (4k or 512b) sector sizes in the metadata, but mdadm
implementation uses just 512 byte sectors. This is how it works since
Pawel's 4k support patches - it converts 4k metadata internally to 512b
sector values. Plus here the sb_start value is passed to the kernel, so
it must be in 512 byte sectors. Sure, I can add a comment about this.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH 1/7] imsm: metadata changes for PPL
From: Artur Paszkiewicz @ 2016-11-30  7:34 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid
In-Reply-To: <wrfj4m2qia5r.fsf@redhat.com>

On 11/29/2016 04:21 PM, Jes Sorensen wrote:
> Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
>> On 11/29/2016 12:27 AM, Jes Sorensen wrote:
>>> Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
>>>> @@ -189,7 +195,30 @@ struct imsm_dev {
>>>>  	__u16 cache_policy;
>>>>  	__u8  cng_state;
>>>>  	__u8  cng_sub_state;
>>>> -#define IMSM_DEV_FILLERS 10
>>>> +	__u16 my_vol_raid_dev_num; /* Used in Unique volume Id for this RaidDev */
>>>> +
>>>> +	/* NVM_EN */
>>>> +	__u8 nv_cache_mode;
>>>> +	union {
>>>> +		__u8 nv_cache_flags;
>>>> +		struct {
>>>> +		    __u8 dirty:1; /* 1 - cache is dirty, 0 - clean */
>>>> +		    __u8 nvc_health:2;
>>>> +		    __u8 expansion_bytes:5;
>>>> +		} nvCache;
>>>> +	};
>>>
>>> This sets off alarm bells here - you declare a union of a u8 and a
>>> matching bitfield, but do not handle bit endian bitfields. If someone
>>> tried to use this on a big endian system this could get messy.
>>
>> Those fields are not used in the code at all, the only thing added to
>> this structure that is used is 'rwh_policy'. The rest is to fill the
>> gaps between IMSM format in UEFI/Windows.
> 
> Hi Artur,
> 
> I did notice the code wasn't actually used, sorry I didn't mention
> that. However I would still prefer to see at least a comment in the code
> indicating that this would fail on BE systems.

Since I wasn't going to use those fields I had not given this much
thought, but you are right - this is definitely not portable. I think
I'll remove those bitfields in the next version.

Thanks,
Artur

^ permalink raw reply

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2016-11-30  7:19 UTC (permalink / raw)
  To: Guoqing Jiang, linux-raid; +Cc: Shaohua Li, Neil Brown, Johannes Thumshirn
In-Reply-To: <583E7392.9020308@suse.com>

On 2016/11/30 下午2:37, Guoqing Jiang wrote:
> 
> 
> On 11/28/2016 03:33 PM, Coly Li wrote:
>> In raid1_sync_request(), I see,
>> conf->cluster_sync_low = mddev->curr_resync_completed;
>> conf->cluster_sync_high = conf->cluster_sync_low +
>> CLUSTER_RESYNC_WINDOW_SECTORS;
>>
>> Is it possible that LBA range [conf->cluster_sync_low,
>> conf->cluster_sync_high] goes across the border of a barrier unit size ?
> 
> Not pretty sure about it, but since cluster's resync window is 32M which is
> less than BARRIER_UNIT_SECTOR_SIZE, I guess it would not cause trouble.

Thanks for the hint. So BARRIER_UNIT_SECTOR_SIZE is 32MB aligned,
cluster's resync window won't go across the boundary.


Coly

^ permalink raw reply

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Guoqing Jiang @ 2016-11-30  6:37 UTC (permalink / raw)
  To: Coly Li, linux-raid; +Cc: Shaohua Li, Neil Brown, Johannes Thumshirn
In-Reply-To: <0bfdd1c7-5953-a282-dd37-3a8595c90844@suse.de>



On 11/28/2016 03:33 PM, Coly Li wrote:
> In raid1_sync_request(), I see,
> conf->cluster_sync_low = mddev->curr_resync_completed;
> conf->cluster_sync_high = conf->cluster_sync_low +
> CLUSTER_RESYNC_WINDOW_SECTORS;
>
> Is it possible that LBA range [conf->cluster_sync_low,
> conf->cluster_sync_high] goes across the border of a barrier unit size ?

Not pretty sure about it, but since cluster's resync window is 32M which is
less than BARRIER_UNIT_SECTOR_SIZE, I guess it would not cause trouble.

Thanks,
Guoqing

^ permalink raw reply

* Re: [PATCH 4/4] md/raid5-cache: adjust the write position of the empty block and mark it as a checkpoint
From: JackieLiu @ 2016-11-30  4:03 UTC (permalink / raw)
  To: Shaohua Li; +Cc: songliubraving, 刘正元, linux-raid
In-Reply-To: <20161129223150.tseez7b4y6lb72fs@kernel.org>


> 在 2016年11月30日，06:31，Shaohua Li <shli@kernel.org> 写道：
> 
> On Mon, Nov 28, 2016 at 04:19:21PM +0800, JackieLiu wrote:
>> When recovery is complete, we write an empty block and record his
>> position first, then make the data-only stripes rewritten done,
>> the location of the empty block as the last checkpoint position
>> to write into the super block. And we should update last_checkpoint
>> to this empty block position.
>> ...
>> +	pos = ctx.pos;
>> +	r5l_log_write_empty_meta_block(log, ctx.pos, (ctx.seq += 10));
> 
> hmm, move the ctx.seq += 10 out please

yep, if this patch is OK,I will resend an appropriate patch.

>> +	ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
>> +
>> 	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
>> 		pr_debug("md/raid:%s: starting from clean shutdown\n",
>> 			 mdname(mddev));
>> @@ -2167,9 +2171,9 @@ static int r5l_recovery_log(struct r5l_log *log)
>> 
>> 	log->log_start = ctx.pos;
>> 	log->next_checkpoint = ctx.pos;
>> +	log->last_checkpoint = pos;
>> 	log->seq = ctx.seq;
>> -	r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq);
>> -	r5l_write_super(log, ctx.pos);
>> +	r5l_write_super(log, pos);
>> 	return 0;
>> }
> 
> Applied the first 3 patches in the series. This one looks good too, but why we
> always create the empty meta block? It's only required when we don't rewrite
> the data. Eg, the data_only_stripes == 0.
> 
> Thanks,
> Shaohua

As you said, when data_only_stripes != 0, does not need to write an empty 
meta block, but we need to calculate the position of the first list member and
save it.  at the same time, when data_only_stripes == 0, then you need to write
an empty block, and let the super block pointing to him; In any case, Since 
there is a possibility that invalid blocks are connected to valid blocks, we still 
need to add 10 to them.

In my option, if this empty block has been added, we just let the super block 
pointing to him, everything is OK now, the code is more clean, and the logic 
is clear.

Thanks
Jackie





^ permalink raw reply

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2016-11-30  2:57 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <20161129192952.woyx55w33ytry73a@kernel.org>

On 2016/11/30 上午3:29, Shaohua Li wrote:
> On Mon, Nov 28, 2016 at 02:42:22PM +0800, Coly Li wrote:
>> On 2016/11/23 上午5:35, Shaohua Li wrote:
>>> On Tue, Nov 22, 2016 at 05:54:00AM +0800, Coly Li wrote:
>>>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>>>> introduces a sliding resync window for raid1 I/O barrier, this idea limits
>>>> I/O barriers to happen only inside a slidingresync window, for regular
>>>> I/Os out of this resync window they don't need to wait for barrier any
>>>> more. On large raid1 device, it helps a lot to improve parallel writing
>>>> I/O throughput when there are background resync I/Os performing at
>>>> same time. 
>>>>
>>>> The idea of sliding resync widow is awesome, but there are several
>>>> challenges are very difficult to solve,
>>>>  - code complexity
>>>>    Sliding resync window requires several veriables to work collectively,
>>>>    this is complexed and very hard to make it work correctly. Just grep
>>>>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>>>>    the original resync window patch. This is not the end, any further
>>>>    related modification may easily introduce more regreassion.
>>>>  - multiple sliding resync windows
>>>>    Currently raid1 code only has a single sliding resync window, we cannot
>>>>    do parallel resync with current I/O barrier implementation.
>>>>    Implementing multiple resync windows are much more complexed, and very
>>>>    hard to make it correctly.
>>>>
>>>> Therefore I decide to implement a much simpler raid1 I/O barrier, by
>>>> removing resync window code, I believe life will be much easier.
>>>>
>>>> The brief idea of the simpler barrier is,
>>>>  - Do not maintain a logbal unique resync window
>>>>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>>>>    I/O only has to wait for a resync I/O when both them have same barrier
>>>>    bucket index, vice versa.
>>>>  - I/O barrier can be recuded to an acceptable number if there are enought
>>>>    barrier buckets
>>>>
>>>> Here I explain how the barrier buckets are designed,
>>>>  - BARRIER_UNIT_SECTOR_SIZE
>>>>    The whole LBA address space of a raid1 device is divided into multiple
>>>>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>>>>    Bio request won't go across border of barrier unit size, that means
>>>>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>>>>  - BARRIER_BUCKETS_NR
>>>>    There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
>>>>    hit different barrier units, they only need to compete I/O barrier with
>>>>    other I/Os which hit the same barrier bucket index with each other. The
>>>>    index of a barrier bucket which a bio should look for is calculated by
>>>>    get_barrier_bucket_idx(),
>>>> 	(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
>>>>    sector is the start sector number of a bio. align_to_barrier_unit_end()
>>>>    will make sure the finall bio sent into generic_make_request() won't
>>>>    exceed border of the barrier unit size.
>>>>  - RRIER_BUCKETS_NR
>>>>    Number of barrier buckets is defined by,
>>>> 	#define BARRIER_BUCKETS_NR	(PAGE_SIZE/sizeof(long))
>>>>    For 4KB page size, there are 512 buckets for each raid1 device. That
>>>>    means the propobility of full random I/O barrier confliction may be
>>>>    reduced down to 1/512.
>>>
>>> Thanks! The idea is awesome and does makes the code easier to understand.
>>>
>>> For hash conflict, I'm worrying about one case. resync does sequential access.
>>> Say we have a sequential normal IO. If the first normal IO and resync IO have
>>> conflict, it's possible they will have conflict subsequently, since both are
>>> doing sequential access. We can change the hash to something like this:
>>>
>>> for the first 64M * 512, the hash is 0, 1, ... 511
>>> For the second 64M * 512, the hash is 1, 3, 5 ...
>>> The third 64M * 512, the hash is 2, 5, 8 ...
>>>
>>> It should be very easy to implement something like this, and this should reduce
>>> the conflict of sequential access.
>>>  
>>
>> Hi Shaohua,
>>
>> What I concerned was memory footprint. For very fast sequential I/O,
>> lineal mapping buckets means each (64 bytes) cache line contains 8 long
>> integer, it is very friendly for CPU caching,
>>  - sequential writing 8 barrier units only requires 1 memory fetching
>> for each barrier variables (barrier[], nr_pending[], nr_waiting[],
>> nr_queued[]).
>>  - memory prefetch may have positive effect in sequential I/O.
>>
>> It will be very rare that resync I/O and regular I/O are always
>> conflicted in same barrier bucket: resync I/O throughput usually is
>> slower than regular I/O, it is very hard to keep them always conflicting
>> in same bucket. If this condition really happens, current sliding resync
>> window implementation may also have a similar conflict (regular I/O
>> always hits resync window). So the barrier buckets don't make things worse.
>>
>> If we use a non-continuous mapping, memory fingerprint of the buckets
>> will be quite distributed, and occupies more cache lines for a
>> sequential I/O, which will increase the probability of more cache bounce
>> if there are multiple sequential I/O (on different offset) happen on
>> different CPU cores.
>>
>> This is why I use a very simple linear buckets hashing.
> 
> Don't think memory footprint matters much. And wasting a little memory (eg,
> make the barrier and stuffes cacheline aligned) doesn't matter here too if
> necessary. If we could reduce the hash confilict in an easy way, why not? I
> think Neil's suggestion (using hash_long) makes sense.
>  

Okey, I will do it in next version.


>>>> Comparing to single sliding resync window,
>>>>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>>>>    will have confliction within a single barrier units. So it is similar to
>>>>    single sliding resync window.
>>>>  - But a barrier unit bucket is shared by all barrier units with identical
>>>>    barrier uinit index, the probability of confliction might be higher
>>>>    than single sliding resync window, in condition that writing I/Os
>>>>    always hit barrier units which have identical barrier bucket index with
>>>>    the resync I/Os. This is a very rare condition in real I/O work loads,
>>>>    I cannot imagine how it could happen in practice.
>>>>  - Therefore we can achieve a good enough low confliction rate with much
>>>>    simpler barrier algorithm and implementation.
>>>>
>>>> If user has a (realy) large raid1 device, for example 10PB size, we may
>>>> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
>>>> it is possible to be a raid1-created-time-defined variable in future.
>>>>
>>>> There are two changes should be noticed,
>>>>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>>>>    single loop, it looks like this,
>>>> 	spin_lock_irqsave(&conf->device_lock, flags);
>>>> 	conf->nr_queued[idx]--;
>>>> 	spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>    This change generates more spin lock operations, but in next patch of
>>>>    this patch set, it will be replaced by a single line code,
>>>> 	atomic_dec(conf->nr_queueud[idx]);
>>>>    So we don't need to worry about spin lock cost here.
>>>>  - In raid1_make_request(), wait_barrier() is replaced by,
>>>>    a) wait_read_barrier(): wait barrier in regular read I/O code path
>>>>    b) wait_barrier(): wait barrier in regular write I/O code path
>>>>    The differnece is wait_read_barrier() only waits if array is frozen, I
>>>>    am not able to combile them into one function, because they must receive
>>>>    differnet data types in their arguments list.
>>>>  - align_to_barrier_unit_end() is called to make sure both regular and
>>>>    resync I/O won't go across the border of a barrier unit size.
>>>>  
>>>> Open question:
>>>>  - Need review from md clustring developer, I don't touch related code now.
>>>
>>> Don't think it matters, but please open eyes, Guoqing!
>>>  
>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>> Cc: Shaohua Li <shli@fb.com>
>>>> Cc: Neil Brown <neilb@suse.de>
>>>> Cc: Johannes Thumshirn <jthumshirn@suse.de>
>>>> Cc: Guoqing Jiang <gqjiang@suse.com>
>>>> ---
>>>>  drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
>>>>  drivers/md/raid1.h |  42 +++++-------
>>>>  2 files changed, 242 insertions(+), 189 deletions(-)
>>>>
>>>> Index: linux-raid1/drivers/md/raid1.c
>>>> ===================================================================
>>>> --- linux-raid1.orig/drivers/md/raid1.c
>>>> +++ linux-raid1/drivers/md/raid1.c
>>>> @@ -66,9 +66,8 @@
>>>>   */
>>>>  static int max_queued_requests = 1024;
>>>>  
>>>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>>>> -			  sector_t bi_sector);
>>>> -static void lower_barrier(struct r1conf *conf);
>>>> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
>>>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>>>>  
>>>>  static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
>>>>  {
>>>> @@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
>>>>  #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
>>>>  #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
>>>>  #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
>>>> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>>>>  
>>>>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>>>>  {
>>>> @@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
>>>>  {
>>>>  	struct r1conf *conf = r1_bio->mddev->private;
>>>>  	int i;
>>>> +	sector_t sector_nr = r1_bio->sector;
>>>>  
>>>>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>>>>  		struct bio *bio = r1_bio->bios[i];
>>>> @@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
>>>>  
>>>>  	mempool_free(r1_bio, conf->r1buf_pool);
>>>>  
>>>> -	lower_barrier(conf);
>>>> +	lower_barrier(conf, sector_nr);
>>>>  }
>>>>  
>>>>  static void reschedule_retry(struct r1bio *r1_bio)
>>>> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>>>>  	unsigned long flags;
>>>>  	struct mddev *mddev = r1_bio->mddev;
>>>>  	struct r1conf *conf = mddev->private;
>>>> +	sector_t sector_nr;
>>>> +	long idx;
>>>> +
>>>> +	sector_nr = r1_bio->sector;
>>>> +	idx = get_barrier_bucket_idx(sector_nr);
>>>>  
>>>>  	spin_lock_irqsave(&conf->device_lock, flags);
>>>>  	list_add(&r1_bio->retry_list, &conf->retry_list);
>>>> -	conf->nr_queued ++;
>>>> +	conf->nr_queued[idx]++;
>>>>  	spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>  
>>>>  	wake_up(&conf->wait_barrier);
>>>> @@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
>>>>  	struct bio *bio = r1_bio->master_bio;
>>>>  	int done;
>>>>  	struct r1conf *conf = r1_bio->mddev->private;
>>>> -	sector_t start_next_window = r1_bio->start_next_window;
>>>> -	sector_t bi_sector = bio->bi_iter.bi_sector;
>>>>  
>>>>  	if (bio->bi_phys_segments) {
>>>>  		unsigned long flags;
>>>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>>>>  	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>>>>  		bio->bi_error = -EIO;
>>>>  
>>>> -	if (done) {
>>>> +	if (done)
>>>>  		bio_endio(bio);
>>>> -		/*
>>>> -		 * Wake up any possible resync thread that waits for the device
>>>> -		 * to go idle.
>>>> -		 */
>>>> -		allow_barrier(conf, start_next_window, bi_sector);
>>>> -	}
>>>>  }
>>>>  
>>>>  static void raid_end_bio_io(struct r1bio *r1_bio)
>>>>  {
>>>>  	struct bio *bio = r1_bio->master_bio;
>>>> +	struct r1conf *conf = r1_bio->mddev->private;
>>>>  
>>>>  	/* if nobody has done the final endio yet, do it now */
>>>>  	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>>>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>>>>  
>>>>  		call_bio_endio(r1_bio);
>>>>  	}
>>>> +
>>>> +	/*
>>>> +	 * Wake up any possible resync thread that waits for the device
>>>> +	 * to go idle.
>>>> +	 */
>>>> +	allow_barrier(conf, r1_bio->sector);
>>> Why this change?
>>>
>>
>> I move allow_barrier() here on purpose. Because current raid1 code
>> raises nr_pending for each master bio, the barrier buckets patch raises
>> nr_pending[idx] for each r1_bio.
>>
>> Current code increases nr_pending for each master bio (the bio structure
>> received by raid1_make_request()). A master bio may be split by multiple
>> r1_bio structures, when all r1_bio completes, nr_pending is decreased
>> for the original master bio.
>>
>> Since the master bio may go across multiple barrier units, in this patch
>> master bio will be split into multiple r1_bio structures for each
>> barrier units. In this case, we need to call wait_barrer() for each
>> r1_bio, and call allow_barrier() on each r1_bio completion. This is why
>> allow_barrier() is moved form master bio completion time to r1_bio
>> completion time.
>>
>> A master bio may also be split into multiple r1_bio if bad blocks
>> encountered, and these r1_bio may stay in same barrier unit. In this
>> case the different r1_bio does increase nr_pending[] for same bucket
>> index, we should also call wait_barrier() for each r1_bio and call
>> allow_barrier() at its completion time.
> 
> I think if you do bio_split before raid1_make_request, these issues go away.
> 

Let me try it :-)

>>>>  	free_r1bio(r1_bio);
>>>>  }
>>>>  
>>>> @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
>>>>  	return mirror;
>>>>  }
>>>>  
>>>> +/* bi_end_io callback for regular READ bio */
>>>>  static void raid1_end_read_request(struct bio *bio)
>>>>  {
>>>>  	int uptodate = !bio->bi_error;
>>>> @@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
>>>>  		bio_put(to_put);
>>>>  }
>>>>  
>>>> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
>>>> +					  sector_t sectors)
>>>> +{
>>>> +	sector_t len;
>>>> +
>>>> +	WARN_ON(sectors == 0);
>>>> +	/* len is the number of sectors from start_sector to end of the
>>>> +	 * barrier unit which start_sector belongs to.
>>>> +	 */
>>>> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
>>>> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
>>>> +	      start_sector;
>>>> +
>>>> +	if (len > sectors)
>>>> +		len = sectors;
>>>> +
>>>> +	return len;
>>>> +}
>>>
>>> I'd prefer calling bio_split at the early of raid1_make_request and split the
>>> bio if it crosses the bucket boundary. please see raid10_make_request for
>>> example. resync will not cross the boundary. So the code will not need to worry
>>> about the boundary. I believe this will make the code simpiler (all the
>>> align_to_barrier_unit_end calling can removed) and easy to understand.
>>>
>>
>> Indeed, my first implementation uses bio_split(). The reasons why I
>> don't use it later are,
>> - indeed I need to write more code.
>> - I can't simply use existing r1_bio completion code to call
>> bio_end_io() to the master bio when bio->bi_phys_segments is zero (see
>> call_bio_endio()). Because r1_bio->master_bio is the split bio, not the
>> original master bio received by raid1_make_request()
>>
>> Therefore finally I decide to use align_to_barrier_unit_end() to
>> generate more r1_bio structures if the original master bio goes across
>> barrier unit size, it makes less modification in this patch.
> 
> That bi_phys_segments issue doesn't make sense to me. Please see how raid10 use
> bio_split.
> - rename raid1_make_request to __raid1_make_request
> - add a raid1_make_request, split bio there, and then call __raid1_make_request
> 
> the __raid1_make_request does everthing it currently does, the bi_phys_segments
> should still work. This might add more code. But now we don't need to worry
> about the boundary problem everywhere else except the new raid1_make_request.
> For example, the allow_barrier issue goes away with this approach. This will
> make the code more understandable and less error prone.
>  

Thanks for the hint. I will use this method in next version.


>>>> +
>>>>  /*
>>>>   * This routine returns the disk from which the requested read should
>>>>   * be done. There is a per-array 'next expected sequential IO' sector
>>>> @@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
>>>>  		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
>>>>  	}
>>>>  	rcu_read_unlock();
>>>> +	sectors = align_to_barrier_unit_end(this_sector, sectors);
>>>>  	*max_sectors = sectors;
>>>>  
>>>>  	return best_disk;
>>>> @@ -779,168 +803,174 @@ static void flush_pending_writes(struct
>>>>   *    there is no normal IO happeing.  It must arrange to call
>>>>   *    lower_barrier when the particular background IO completes.
>>>>   */
>>>> +
>>>>  static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>>>>  {
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>> +
>>>>  	spin_lock_irq(&conf->resync_lock);
>>>>  
>>>>  	/* Wait until no block IO is waiting */
>>>> -	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
>>>> +	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
>>>>  			    conf->resync_lock);
>>>>  
>>>>  	/* block any new IO from starting */
>>>> -	conf->barrier++;
>>>> -	conf->next_resync = sector_nr;
>>>> +	conf->barrier[idx]++;
>>>>  
>>>>  	/* For these conditions we must wait:
>>>>  	 * A: while the array is in frozen state
>>>> -	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
>>>> -	 *    the max count which allowed.
>>>> -	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
>>>> -	 *    next resync will reach to the window which normal bios are
>>>> -	 *    handling.
>>>> -	 * D: while there are any active requests in the current window.
>>>> +	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
>>>> +	 *    existing in sector number ranges corresponding to idx.
>>>> +	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
>>>> +	 *    the max count which allowed in sector number ranges
>>>> +	 *    conrresponding to idx.
>>>>  	 */
>>>>  	wait_event_lock_irq(conf->wait_barrier,
>>>> -			    !conf->array_frozen &&
>>>> -			    conf->barrier < RESYNC_DEPTH &&
>>>> -			    conf->current_window_requests == 0 &&
>>>> -			    (conf->start_next_window >=
>>>> -			     conf->next_resync + RESYNC_SECTORS),
>>>> +			    !conf->array_frozen && !conf->nr_pending[idx] &&
>>>> +			    conf->barrier[idx] < RESYNC_DEPTH,
>>>>  			    conf->resync_lock);
>>>
>>> So there is a slight behavior change. Originally we guarautee no more than
>>> RESYNC_DEPTH sync. Now this only checks 'conf->barrier[idx] < RESYNC_DEPTH'.
>>> How about barrier[idx-1], barrier[idx-2]...? It's possible conf->barrier[idx] <
>>> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. Not sure how
>>> important this is, but at least we can check barrier[idx] + barrier[idx-1].
>>
>> The original code wants to rise more multiple barriers, but less then
>> RESYNC_DEPTH. It helps resync I/O throughput with less resync I/O and
>> regular I/O conflict. Considering conf->barrier is a global barrier,
>> large RESYNC_DEPTH will starve regular I/O, it is only set to 32 for
>> whole raid1 device.
>>
>> In the barrier buckets patch, conf->barrier[idx] only takes effect in a
>> single bucket. More barriers raised in this barrier buckets won't
>> interfere regular I/O in other barrier buckets, therefore we can have
>> much more resync I/O barriers to raise, which is good for resync I/O
>> throughput without starve more regular I/Os.
>>
>> If we have 512 barrier buckets, that means on the whole raid1 device
>> there are 512*RESYNC_DEPTH barriers can be raised, and allows more
>> parallel resync I/O.
>>
>> Before implementing parallel resync, I keep RESYNC_DEPTH as 32 in this
>> patch, because we only have a resync I/O hits one barrier buckets, same
>> RESYNC_DEPTH won't change barrier behavior now. Latter when we have
>> parallel resync I/O, let's see whether we need to modify this value,
>> also still keep it as 32.
> 
> The problem is resync write could harm latency of regular IO. Even we have
> resync limitation mechanism, sending a lot of resync write out could harm
> regular IO latency for worst case. If the application has P99 metrics, more
> resync IO definitively will harm it. This will not be easy to observe though.
> Could we have a global pending counter to record barrier IO and using it in the
> same way as the old code? That said I'm not sure how important this is, but
> this could make things worse, so I'd like to avoid the regression.
>  

Okey, I will maintain a global barrier depth counter, and keep global
barrier depth to 32 as current upstream code behavior does.

>>>> -
>>>> -	conf->nr_pending++;
>>>> +	conf->nr_pending[idx]++;
>>>>  	spin_unlock_irq(&conf->resync_lock);
>>>>  }
>>>>  
>>>> -static void lower_barrier(struct r1conf *conf)
>>>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>>>>  {
>>>>  	unsigned long flags;
>>>> -	BUG_ON(conf->barrier <= 0);
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>> +
>>>> +	BUG_ON(conf->barrier[idx] <= 0);
>>>>  	spin_lock_irqsave(&conf->resync_lock, flags);
>>>> -	conf->barrier--;
>>>> -	conf->nr_pending--;
>>>> +	conf->barrier[idx]--;
>>>> +	conf->nr_pending[idx]--;
>>>>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
>>>>  	wake_up(&conf->wait_barrier);
>>>>  }
>>>>  
>>>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
>>>> +/* A regular I/O should wait when,
>>>> + * - The whole array is frozen (both READ and WRITE)
>>>> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
>>>> + */
>>>> +static void _wait_barrier(struct r1conf *conf, long idx)
>>>>  {
>>>> -	bool wait = false;
>>>> -
>>>> -	if (conf->array_frozen || !bio)
>>>> -		wait = true;
>>>> -	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
>>>> -		if ((conf->mddev->curr_resync_completed
>>>> -		     >= bio_end_sector(bio)) ||
>>>> -		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
>>>> -		     <= bio->bi_iter.bi_sector))
>>>> -			wait = false;
>>>> -		else
>>>> -			wait = true;
>>>> +	spin_lock_irq(&conf->resync_lock);
>>>> +	if (conf->array_frozen || conf->barrier[idx]) {
>>>> +		conf->nr_waiting[idx]++;
>>>> +		/* Wait for the barrier to drop. */
>>>> +		wait_event_lock_irq(
>>>> +			conf->wait_barrier,
>>>> +			!conf->array_frozen && !conf->barrier[idx],
>>>> +			conf->resync_lock);
>>>> +		conf->nr_waiting[idx]--;
>>>>  	}
>>>>  
>>>> -	return wait;
>>>> +	conf->nr_pending[idx]++;
>>>> +	spin_unlock_irq(&conf->resync_lock);
>>>>  }
>>>>  
>>>> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>>>> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>>>>  {
>>>> -	sector_t sector = 0;
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>>  
>>>>  	spin_lock_irq(&conf->resync_lock);
>>>> -	if (need_to_wait_for_sync(conf, bio)) {
>>>> -		conf->nr_waiting++;
>>>> -		/* Wait for the barrier to drop.
>>>> -		 * However if there are already pending
>>>> -		 * requests (preventing the barrier from
>>>> -		 * rising completely), and the
>>>> -		 * per-process bio queue isn't empty,
>>>> -		 * then don't wait, as we need to empty
>>>> -		 * that queue to allow conf->start_next_window
>>>> -		 * to increase.
>>>> -		 */
>>>> -		wait_event_lock_irq(conf->wait_barrier,
>>>> -				    !conf->array_frozen &&
>>>> -				    (!conf->barrier ||
>>>> -				     ((conf->start_next_window <
>>>> -				       conf->next_resync + RESYNC_SECTORS) &&
>>>> -				      current->bio_list &&
>>>> -				      !bio_list_empty(current->bio_list))),
>>>> -				    conf->resync_lock);
>>>> -		conf->nr_waiting--;
>>>> -	}
>>>> -
>>>> -	if (bio && bio_data_dir(bio) == WRITE) {
>>>> -		if (bio->bi_iter.bi_sector >= conf->next_resync) {
>>>> -			if (conf->start_next_window == MaxSector)
>>>> -				conf->start_next_window =
>>>> -					conf->next_resync +
>>>> -					NEXT_NORMALIO_DISTANCE;
>>>> -
>>>> -			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
>>>> -			    <= bio->bi_iter.bi_sector)
>>>> -				conf->next_window_requests++;
>>>> -			else
>>>> -				conf->current_window_requests++;
>>>> -			sector = conf->start_next_window;
>>>> -		}
>>>> +	if (conf->array_frozen) {
>>>> +		conf->nr_waiting[idx]++;
>>>> +		/* Wait for array to unfreeze */
>>>> +		wait_event_lock_irq(
>>>> +			conf->wait_barrier,
>>>> +			!conf->array_frozen,
>>>> +			conf->resync_lock);
>>>> +		conf->nr_waiting[idx]--;
>>>>  	}
>>>> -
>>>> -	conf->nr_pending++;
>>>> +	conf->nr_pending[idx]++;
>>>>  	spin_unlock_irq(&conf->resync_lock);
>>>> -	return sector;
>>>>  }
>>>>  
>>>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>>>> -			  sector_t bi_sector)
>>>> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
>>>> +{
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>> +
>>>> +	_wait_barrier(conf, idx);
>>>> +}
>>>> +
>>>> +static void wait_all_barriers(struct r1conf *conf)
>>>> +{
>>>> +	long idx;
>>>> +
>>>> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
>>>> +		_wait_barrier(conf, idx);
>>>
>>> This is racy. Since resync is sequential, idealy we only need to check the
>>> bucket sequentially. The compiler could do  _wait_barrier(conf, 1) and then
>>> _wait_barrier(conf, 0). It's possible the bucket 1 has barrier right after the
>>> check of bucket 0. Even the compiler doesn't reorder, there is case the bucket
>>> 511 should be check before bucket 0 if the resync is just crossing 512 buckets.
>>> It would be better we check the resync position and adjust the bucket wait
>>> order according to the position.
>>>
>>
>> wait_all_barriers() and allow_all_barriers() are used in close_sync() only,
>>  static void close_sync(struct r1conf *conf)
>>  {
>> -	wait_barrier(conf, NULL);
>> -	allow_barrier(conf, 0, 0);
>> +	wait_all_barriers(conf);
>> +	allow_all_barriers(conf);
>> [snip]
>>  }
>>
>> wait_barrier() and allow_barrier() called here is for a synchronization
>> purpose, to make sure before close_sync() returns there is no resync I/O
>> existing.
>>
>> close_sync() is called in raid1_sync_request() when resync I/O exceeds
>> the size of raid1 device, and resync should be closed. In this
>> condition, there won't be any new resync I/O generated. If a
>> conf->barrier[idx] reaches 0, it won't increase before a new resync
>> restarts. Therefore it is safe to call _wait_barrier() one by one, until
>> every conf->barrier[idx] reaches to 0.
>>
>> This is a usage of wait/allow_barrier() which is not mentioned in the
>> code. They are not for regular I/O waiting for resync I/O, they are used
>> as a synchronization to make sure all on-flying resync I/O to complete.
> 
> thanks, this makes sense.
> 
>>>>  	int good_sectors = RESYNC_SECTORS;
>>>>  	int min_bad = 0; /* number of sectors that are bad in all devices */
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>>  
>>>>  	if (!conf->r1buf_pool)
>>>>  		if (init_resync(conf))
>>>> @@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
>>>>  	 * If there is non-resync activity waiting for a turn, then let it
>>>>  	 * though before starting on this new sync request.
>>>>  	 */
>>>> -	if (conf->nr_waiting)
>>>> +	if (conf->nr_waiting[idx])
>>>>  		schedule_timeout_uninterruptible(1);
>>>>  
>>>>  	/* we are incrementing sector_nr below. To be safe, we check against
>>>> @@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
>>>>  	r1_bio->sector = sector_nr;
>>>>  	r1_bio->state = 0;
>>>>  	set_bit(R1BIO_IsSync, &r1_bio->state);
>>>> +	/* make sure good_sectors won't go across barrier unit border */
>>>> +	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
>>>>  
>>>>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>>>>  		struct md_rdev *rdev;
>>>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
>>>>  	if (!conf)
>>>>  		goto abort;
>>>>  
>>>> +	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
>>>> +	if (!conf->nr_pending)
>>>> +		goto abort;
>>>
>>> This allocation is a little wierd. I'd define a count and uses
>>> sizeof(nr_pending) * count to do the allocation. There is nothing related to
>>> PAGE_SIZE. Alternatively, just make nr_pending an array in r1conf.
>>
>> This is just for better memory usage, r1conf is allocated by slab
>> allocator, large not aligned size may cause internal memory
>> fragmentation inside slab's pages. call kzalloc() with PAGE_SIZE will
>> avoid such issue.
> 
> Not sure if slab has closest slab for this, but that's not important. I think
> sizeof(nr_pending) * count makes more sense here, even sizeof(nr_pending)
> * count == PAGE_SIZE.
> 

Okey, I will move these counters into r1conf, in next version.

Thanks for your review and suggestion !

Coly


^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on bitmap_startwrite+0x122
From: Shaohua Li @ 2016-11-30  0:08 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, NeilBrown
In-Reply-To: <CAMGffEmvr2rJrAmb+qpq-FdFVMxisH14f_VwWMTTyAXpc_saMQ@mail.gmail.com>

On Mon, Nov 28, 2016 at 09:45:07AM +0100, Jinpu Wang wrote:
> Hi folks,
> 
> We hit another hung task on bitmap_startwrite with our test machines, this time
> it's hung in bitmap_startwrite.
> 
> We build MD/RAID1 over 2 block devices exported via IB,
> bitmap=internal. KVM build on top of
> RAID1 on compute node, disks are on remote storage node, one storage
> node crash/reboot, multiple raid1 on multiple compute node KVM run
> into hung task:
> 
> [106204.343870] INFO: task kvm:37669 blocked for more than 180 seconds.
> 
> [106204.344138]       Tainted: G          IO    4.4.28-1-pserver #1
> 
> [106204.344385] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> 
> [106204.344798] kvm             D ffff882037723710     0 37669      1 0x00000000
> 
> [106204.344805]  ffff882037723710 ffff882038f08d00 ffff882029770d00
> ffff8820377236d8
> 
> [106204.344809]  ffff8820377236d8 ffff882037724000 0000000000308648
> 0000000000000008
> 
> [106204.344813]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
> ffffffff81811c60
> 
> [106204.344818] Call Trace:
> 
> [106204.344831]  [<ffffffff81811c60>] schedule+0x30/0x80
> 
> [106204.344841]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]
> 
> [106204.344848]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310
> 
> [106204.344853]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80
> 
> [106204.344859]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127
> 
> [106204.344865]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]
> 
> [106204.344871]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0
> 
> [106204.344875]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0
> 
> [106204.344879]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0
> 
> [106204.344883]  [<ffffffff813fd692>] submit_bio+0x62/0x150
> 
> [106204.344892]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0
> 
> [106204.344897]  [<ffffffff810b9999>] ? __remove_hrtimer+0x89/0xa0
> 
> [106204.344903]  [<ffffffff8173c08f>] ? udp_poll+0x1f/0xb0
> 
> [106204.344908]  [<ffffffff816b71c7>] ? sock_poll+0x57/0x120
> 
> [106204.344913]  [<ffffffff811cdbf0>] ? I_BDEV+0x10/0x10
> 
> [106204.344918]  [<ffffffff811d3b1e>] __blockdev_direct_IO+0x3e/0x40
> 
> [106204.344922]  [<ffffffff811ce287>] blkdev_direct_IO+0x47/0x50
> 
> [106204.344930]  [<ffffffff81132c60>] generic_file_direct_write+0xb0/0x170
> 
> [106204.344934]  [<ffffffff81132ded>] __generic_file_write_iter+0xcd/0x1f0
> 
> [106204.344943]  [<ffffffff81184ff8>] ? kmem_cache_free+0x78/0x190
> 
> [106204.344948]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0
> 
> [106204.344952]  [<ffffffff811ce547>] blkdev_write_iter+0x87/0x110
> 
> [106204.344956]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0
> 
> [106204.344962]  [<ffffffff811dec56>] aio_run_iocb+0x236/0x2a0
> 
> [106204.344966]  [<ffffffff811dd183>] ? eventfd_ctx_read+0x53/0x200
> 
> [106204.344973]  [<ffffffff811b3bbf>] ? __fget_light+0x1f/0x60
> 
> [106204.344976]  [<ffffffff811b3c0e>] ? __fdget+0xe/0x10
> 
> [106204.344980]  [<ffffffff811dfb5a>] do_io_submit+0x23a/0x4d0
> 
> [106204.344985]  [<ffffffff811dfdfb>] SyS_io_submit+0xb/0x10
> 
> [106204.344989]  [<ffffffff818154d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
> 
> [106384.345330] INFO: task kvm:37669 blocked for more than 180 seconds.
> 
> [106384.345621]       Tainted: G          IO    4.4.28-1-pserver #1
> 
> [106384.345866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> 
> [106384.346275] kvm             D ffff882037723710     0 37669      1 0x00000000
> 
> [106384.346282]  ffff882037723710 ffff882038f08d00 ffff882029770d00
> ffff8820377236d8
> 
> [106384.346286]  ffff8820377236d8 ffff882037724000 0000000000308648
> 0000000000000008
> 
> [106384.346290]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
> ffffffff81811c60
> 
> [106384.346294] Call Trace:
> 
> [106384.346308]  [<ffffffff81811c60>] schedule+0x30/0x80
> 
> [106384.346317]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]
> 
> [106384.346325]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310
> 
> [106384.346330]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80
> 
> [106384.346336]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127
> 
> [106384.346341]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]
> 
> [106384.346347]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0
> 
> [106384.346352]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0
> 
> [106384.346356]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0
> 
> [106384.346360]  [<ffffffff813fd692>] submit_bio+0x62/0x150
> 
> [106384.346369]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0
> 
> 
> (gdb) l *bitmap_startwrite+0x122
> 
> 0x121d2 is in bitmap_startwrite (drivers/md/bitmap.c:1396).
> 
> 
> 
> 1394                 if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {
> 
> 1395                         DEFINE_WAIT(__wait);
> 
> 1396                         /* note that it is safe to do the prepare_to_wait
> 
> 1397                          * after the test as long as we do it
> before dropping
> 
> 1398                          * the spinlock.
> 
> 1399                          */
> 
> 1400                         prepare_to_wait(&bitmap->overflow_wait, &__wait,
> 
> 1401                                         TASK_UNINTERRUPTIBLE);
> 
> 1402                         spin_unlock_irq(&bitmap->counts.lock);
> 
> 1403                         schedule();
> 
> 1404                         finish_wait(&bitmap->overflow_wait, &__wait);
> 
> 1405                         continue;
> 
> 1406                 }
> 
> so seem KVM is waiting on overflow_wait queue, but somehow no one wake
> him up. During reboot one storage, raid1 has a lot IO errors in that
> time, I guess some error handle part is broken.
> 
> I haven't have a reproducer, just want to report to community, in case
> this is known bug, or anyone has patch already :)

Does the kernel report the raid disk is faulty and gets removed? Is this a
real hang, eg maybe we are waitting for IO error reported.

^ permalink raw reply

* Re: [PATCH] md/raid5: limit request size according to implementation limits
From: Shaohua Li @ 2016-11-29 23:52 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Neil Brown, linux-raid, linux-kernel, stable
In-Reply-To: <148026435214.19980.7956943898609877817.stgit@buzz>

On Sun, Nov 27, 2016 at 07:32:32PM +0300, Konstantin Khlebnikov wrote:
> Current implementation employ 16bit counter of active stripes in lower
> bits of bio->bi_phys_segments. If request is big enough to overflow
> this counter bio will be completed and freed too early.
> 
> Fortunately this not happens in default configuration because several
> other limits prevent that: stripe_cache_size * nr_disks effectively
> limits count of active stripes. And small max_sectors_kb at lower
> disks prevent that during normal read/write operations.
> 
> Overflow easily happens in discard if it's enabled by module parameter
> "devices_handle_discard_safely" and stripe_cache_size is set big enough.
> 
> This patch limits requests size with 256Mb - 8Kb to prevent overflows.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Neil Brown <neilb@suse.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/md/raid5.c |    9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 92ac251e91e6..cce6057b9aca 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6984,6 +6984,15 @@ static int raid5_run(struct mddev *mddev)
>  			stripe = (stripe | (stripe-1)) + 1;
>  		mddev->queue->limits.discard_alignment = stripe;
>  		mddev->queue->limits.discard_granularity = stripe;
> +
> +		/*
> +		 * We use 16-bit counter of active stripes in bi_phys_segments
> +		 * (minus one for over-loaded initialization)
> +		 */
> +		blk_queue_max_hw_sectors(mddev->queue, 0xfffe * STRIPE_SECTORS);
> +		blk_queue_max_discard_sectors(mddev->queue,
> +					      0xfffe * STRIPE_SECTORS);
> +
>  		/*
>  		 * unaligned part of discard request will be ignored, so can't
>  		 * guarantee discard_zeroes_data

Thanks! I applied this one, which is easy for stable too. After Neil's patches
to remove the limitation, we can remove this one. 

Thanks,
Shaohua

^ permalink raw reply

* Re: Is trim/discard supported on LVM RAID logical volume?
From: Andreas Klauer @ 2016-11-29 23:32 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Patrick Dung, Linux-RAID
In-Reply-To: <CAJCQCtQk01rtPa1=_XnxS3vfhP_gB88fXFayx4T_-OkbUUBGoQ@mail.gmail.com>

On Tue, Nov 29, 2016 at 03:21:36PM -0700, Chris Murphy wrote:
> Reading the documentation for both, they're functionally the same
> thing in that with these options the particular layer will pass on a
> trim to the next layer underneath it.

discard/fstrim/blkdiscard works fine with issue_discards=0.

If it doesn't work, setting issue_discards=1 won't do a thing for you. 
That flag only affects the lvm tool itself, discard freed LVM extents 
after lvreduce or pvmove. Can't undo botched resizes anymore, too bad.

Most storage layers simply don't have any setting for this at all.
It just works. Or it does not (in many cases, old/stable kernel/distro).
cryptsetup's allow-discards is the exception, the one layer where you 
have to manually enable something to make it work.

Regards
Andreas Klauer

^ permalink raw reply

* [PATCH] Reorganize make_request to clean up code.
From: Robert LeBlanc @ 2016-11-29 23:07 UTC (permalink / raw)
  To: linux-raid; +Cc: Robert LeBlanc

This code only runs during a write, so move it to the write section to
clean up the code.

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
---
 drivers/md/raid1.c  | 68 ++++++++++++++++++++++++++---------------------------
 drivers/md/raid10.c |  4 ++--
 2 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 21dc00e..f2db4bf 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1054,40 +1054,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 	int max_sectors;
 	sector_t start_next_window;
 
-	/*
-	 * Register the new request and wait if the reconstruction
-	 * thread has put up a bar for new requests.
-	 * Continue immediately if no resync is active currently.
-	 */
-
-	md_write_start(mddev, bio); /* wait on superblock update early */
-
-	if (bio_data_dir(bio) == WRITE &&
-	    ((bio_end_sector(bio) > mddev->suspend_lo &&
-	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
-	    (mddev_is_clustered(mddev) &&
-	     md_cluster_ops->area_resyncing(mddev, WRITE,
-		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
-		/* As the suspend_* range is controlled by
-		 * userspace, we want an interruptible
-		 * wait.
-		 */
-		DEFINE_WAIT(w);
-		for (;;) {
-			flush_signals(current);
-			prepare_to_wait(&conf->wait_barrier,
-					&w, TASK_INTERRUPTIBLE);
-			if (bio_end_sector(bio) <= mddev->suspend_lo ||
-			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
-			    (mddev_is_clustered(mddev) &&
-			     !md_cluster_ops->area_resyncing(mddev, WRITE,
-				     bio->bi_iter.bi_sector, bio_end_sector(bio))))
-				break;
-			schedule();
-		}
-		finish_wait(&conf->wait_barrier, &w);
-	}
-
 	start_next_window = wait_barrier(conf, bio);
 
 	bitmap = mddev->bitmap;
@@ -1194,6 +1160,40 @@ read_again:
 	/*
 	 * WRITE:
 	 */
+
+	/*
+	 * Register the new request and wait if the reconstruction
+	 * thread has put up a bar for new requests.
+	 * Continue immediately if no resync is active currently.
+	 */
+
+	md_write_start(mddev, bio); /* wait on superblock update early */
+
+	if ((bio_end_sector(bio) > mddev->suspend_lo &&
+	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
+	    (mddev_is_clustered(mddev) &&
+	     md_cluster_ops->area_resyncing(mddev, WRITE,
+		     bio->bi_iter.bi_sector, bio_end_sector(bio)))) {
+		/* As the suspend_* range is controlled by
+		 * userspace, we want an interruptible
+		 * wait.
+		 */
+		DEFINE_WAIT(w);
+		for (;;) {
+			flush_signals(current);
+			prepare_to_wait(&conf->wait_barrier,
+					&w, TASK_INTERRUPTIBLE);
+			if (bio_end_sector(bio) <= mddev->suspend_lo ||
+			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
+			    (mddev_is_clustered(mddev) &&
+			     !md_cluster_ops->area_resyncing(mddev, WRITE,
+				     bio->bi_iter.bi_sector, bio_end_sector(bio))))
+				break;
+			schedule();
+		}
+		finish_wait(&conf->wait_barrier, &w);
+	}
+
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
 		wait_event(conf->wait_barrier,
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index be1a9fc..536b61d 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1064,8 +1064,6 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 	int max_sectors;
 	int sectors;
 
-	md_write_start(mddev, bio);
-
 	/*
 	 * Register the new request and wait if the reconstruction
 	 * thread has put up a bar for new requests.
@@ -1190,6 +1188,8 @@ read_again:
 	/*
 	 * WRITE:
 	 */
+	md_write_start(mddev, bio);
+
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
 		wait_event(conf->wait_barrier,
-- 
2.10.2


^ permalink raw reply related

* Re: raid0 vs. mkfs
From: Avi Kivity @ 2016-11-29 22:45 UTC (permalink / raw)
  To: NeilBrown, linux-raid
In-Reply-To: <87d1he7zv9.fsf@notabene.neil.brown.name>

On 11/29/2016 11:14 PM, NeilBrown wrote:
> On Mon, Nov 28 2016, Avi Kivity wrote:
>>> If it is easy for the upper layer to break a very large request into a
>>> few very large requests, then I wouldn't necessarily object.
>> I can't see why it would be hard.  It's simple arithmetic.
> That is easy to say before writing the code :-)
> It probably is easy for RAID0.  Less so for RAID10.  Even less for
> RAID6.
>

pick the largest subrange wihin the inpu range whose bounds are 0 (mod 
stripe-size); TRIM it (for all members); apply the regular algorithm to 
the head and tail subranges.  Works for all RAID types.  If the RAID is 
undergoing reshaping, exclude the range undergoing reshaping, and treat 
the two halves separately.

>>> But unless it is very hard for the lower layer to merge requests, it
>>> should be doing that too.
>> Merging has tradeoffs.  When you merge requests R1, R2, ... Rn you make
>> the latency request R1 sum of the latencies of R1..Rn.  You may gain
>> some efficiency in the process, but that's not going to make up for a
>> factor of n.  The queuing layer has no way to tell whether the caller is
>> interested in the latency of individual requests.  By sending large
>> requests, the caller indicates it's not interested in the latency of
>> individual subranges.  The queuing layer is still free to internally
>> split the request to smaller ranges, to satisfy hardware constraints, or
>> to reduce worst-case latencies for competing request streams.
> I would have thought that using plug/unplug to group requests is a
> fairly strong statement that they can be handled as a unit if that is
> convenient.

It is not.  As an example, consider a read and a few associated 
read-ahead requests submitted in a batch.  The last thing you want is 
for them to be treated as a unit.

Plug/unplug means: I have a bunch of requests here.  Whether they should 
be merged or reordered is orthogonal to whether they are members of a 
batch or not.

>
>
>> So I disagree that all the work should be pushed to the merging layer.
>> It has less information to work with, so the fewer decisions it has to
>> make, the better.
> I think that the merging layer should be as efficient as it reasonably
> can be, and particularly should take into account plugging.  This
> benefits all callers.

Yes, but plugging does not mean "please merge anything you can until the 
unplug".

> If it can be demonstrated that changes to some of the upper layers bring
> further improvements with acceptable costs, then certainly it is good to
> have those too.

Generating millions of requests only to merge them again is 
inefficient.  It happens in an edge case (TRIM of the entirety of a very 
large RAID), but it already caused on user to believe the system 
failed.  I think the system should be more robust than that.

> NeilBrown

^ permalink raw reply

* Re: [PATCH] md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedly
From: Shaohua Li @ 2016-11-29 22:40 UTC (permalink / raw)
  To: JackieLiu; +Cc: songliubraving, liuzhengyuan, linux-raid
In-Reply-To: <20161129035730.12822-1-liuyun01@kylinos.cn>

On Tue, Nov 29, 2016 at 11:57:30AM +0800, JackieLiu wrote:
> R5c_make_stripe_write_out has set this flag, do not need to set again.
> 
> Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
> ---
>  drivers/md/raid5-cache.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 95bd51e..aa89de0 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1247,8 +1247,6 @@ static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
>  	atomic_inc(&conf->active_stripes);
>  	r5c_make_stripe_write_out(sh);
>  
> -	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> -		atomic_inc(&conf->preread_active_stripes);
>  	raid5_release_stripe(sh);
>  }

applied, thanks!

^ permalink raw reply

* Re: [PATCH] md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_log
From: Shaohua Li @ 2016-11-29 22:37 UTC (permalink / raw)
  To: JackieLiu; +Cc: shli, songliubraving, liuzhengyuan, linux-raid
In-Reply-To: <20161129031320.9025-1-liuyun01@kylinos.cn>

On Tue, Nov 29, 2016 at 11:13:20AM +0800, JackieLiu wrote:
> The next_cp_seq field is useless, remove it.

applied, thanks!
 
> Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
> ---
>  drivers/md/raid5-cache.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 5f817bd..95bd51e 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -113,7 +113,6 @@ struct r5l_log {
>  	u64 seq;			/* log head sequence */
>  
>  	sector_t next_checkpoint;
> -	u64 next_cp_seq;
>  
>  	struct mutex io_mutex;
>  	struct r5l_io_unit *current_io;	/* current io_unit accepting new data */
> @@ -1075,7 +1074,6 @@ static bool r5l_complete_finished_ios(struct r5l_log *log)
>  			break;
>  
>  		log->next_checkpoint = io->log_start;
> -		log->next_cp_seq = io->seq;
>  
>  		list_del(&io->log_sibling);
>  		mempool_free(io, log->io_pool);
> -- 
> 2.10.2
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 4/4] md/raid5-cache: adjust the write position of the empty block and mark it as a checkpoint
From: Shaohua Li @ 2016-11-29 22:31 UTC (permalink / raw)
  To: JackieLiu; +Cc: songliubraving, liuzhengyuan, linux-raid
In-Reply-To: <20161128081921.5641-4-liuyun01@kylinos.cn>

On Mon, Nov 28, 2016 at 04:19:21PM +0800, JackieLiu wrote:
> When recovery is complete, we write an empty block and record his
> position first, then make the data-only stripes rewritten done,
> the location of the empty block as the last checkpoint position
> to write into the super block. And we should update last_checkpoint
> to this empty block position.
> 
> ------------------------------------------------------------------
> |  old log       | empty block | data only stripes | invalid log |
> ------------------------------------------------------------------
> ^                ^                                 ^
> |                |- log->last_checkpoint           |- log->log_start
> |                |- log->last_cp_seq               |- log->next_checkpoint
> |- log->seq=n    |- log->seq=10+n
> 
> At the same time, if there is no data-only stripes, this scene may appear,
> | meta1 | meta2 | meta3 |
> meta 1 is valid, meta 2 is invalid. meta 3 could be valid. so we should 
> The solution is we create a new meta in meta2 with its seq == meta1's 
> seq + 10 and let superblock points to meta2. 
> 
> Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
> Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
> ---
>  drivers/md/raid5-cache.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 9e72180..20594f7 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -2072,7 +2072,6 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
>  		return -ENOMEM;
>  	}
>  
> -	ctx->seq += 10;
>  	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
>  		struct r5l_meta_block *mb;
>  		int i;
> @@ -2132,6 +2131,7 @@ static int r5l_recovery_log(struct r5l_log *log)
>  	struct mddev *mddev = log->rdev->mddev;
>  	struct r5l_recovery_ctx ctx;
>  	int ret;
> +	sector_t pos;
>  
>  	ctx.pos = log->last_checkpoint;
>  	ctx.seq = log->last_cp_seq;
> @@ -2149,6 +2149,10 @@ static int r5l_recovery_log(struct r5l_log *log)
>  	if (ret)
>  		return ret;
>  
> +	pos = ctx.pos;
> +	r5l_log_write_empty_meta_block(log, ctx.pos, (ctx.seq += 10));

hmm, move the ctx.seq += 10 out please
> +	ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
> +
>  	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
>  		pr_debug("md/raid:%s: starting from clean shutdown\n",
>  			 mdname(mddev));
> @@ -2167,9 +2171,9 @@ static int r5l_recovery_log(struct r5l_log *log)
>  
>  	log->log_start = ctx.pos;
>  	log->next_checkpoint = ctx.pos;
> +	log->last_checkpoint = pos;
>  	log->seq = ctx.seq;
> -	r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq);
> -	r5l_write_super(log, ctx.pos);
> +	r5l_write_super(log, pos);
>  	return 0;
>  }

Applied the first 3 patches in the series. This one looks good too, but why we
always create the empty meta block? It's only required when we don't rewrite
the data. Eg, the data_only_stripes == 0.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 3/9] imsm: give md list of known bad blocks on startup
From: Jes Sorensen @ 2016-11-29 22:27 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1480424555-31509-4-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> On create set bad block support flag for each drive. On assmble also
> provide a list of known bad blocks. Bad blocks are stored in metadata
> per disk so they have to be checked against volume boundaries
> beforehand.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
>  super-intel.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 59 insertions(+)

Tomasz,

Thanks for posting 1/9 v3 - I was in the process of applying this set,
but something is causing a conflict.

> diff --git a/super-intel.c b/super-intel.c
> index 421bfbc..b2afdff 100644
> --- a/super-intel.c
> +++ b/super-intel.c
[snip]
> @@ -7160,6 +7212,12 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
>  			info_d->events = __le32_to_cpu(mpb->generation_num);
>  			info_d->data_offset = pba_of_lba0(map);
>  			info_d->component_size = blocks_per_member(map);
> +
> +			info_d->bb.supported = 0;
> +			get_volume_badblocks(super->bbm_log, ord_to_idx(ord),
> +					     info_d->data_offset,
> +					     info_d->component_size,
> +					     &info_d->bb);
>  		}
>  		/* now that the disk list is up-to-date fixup recovery_start */
>  		update_recovery_start(super, dev, this);

This hunk is failing as my tree doesn't have the line above:
			info_d->component_size = blocks_per_member(map);

I can merge this manually, but I prefer to be sure we are in sync just
in case. Where did that line come from - did I miss an earlier patch?

Cheers,
Jes

^ permalink raw reply

* Help: RAID5 - Disk failure during upgrade
From: Thomas Büschgens @ 2016-11-29 22:22 UTC (permalink / raw)
  To: linux-raid

Hi there,


kind of "cry for help" mail to the list.


I am running a Thecus N7510 NAS with 7 * 4TB diskd (Western Digital)
in a RAID5 setup. This config was running "smoothly" for about 3 years
now. Couple of days ago I decided to upgrade to 8TB disks instead.


Following the recommended Thecus procedure I did the following:


1. Check SMART on all disks. Fine
2. Pull Disk No. 1
3. Re-assemble HD-Case with new 8TB disk
4. Put new Disk into slot 1


So far, so good. The array immediatly started the rebuild... and a
couple of minutes later disk No. 5 failed.


Hiere the excerpt from the Thecus log:

2016-11-28 23:13:09 [N7510] : User admin logged in from 192.168.7.29
2016-11-28 22:30:36 [N7510] : The RAID [RAID] on system [N7510] change
to degrade mode.
2016-11-28 22:29:57 [N7510] : Disk 5 on [N7510] has failed.
2016-11-28 22:29:56 [N7510] : Disk 5 on [N7510] has failed.
2016-11-28 22:29:56 [N7510] : Disk 5 on [N7510] has failed.
2016-11-28 22:23:52 [N7510] : The RAID [RAID] on system [N7510] is
recovering the RAID and rebuilding is in progress.
2016-11-28 22:23:43 [N7510] : Disk 1 on [N7510] has been added.
2016-11-28 22:17:06 [N7510] : The RAID [RAID] on system [N7510] change
to degrade mode.
2016-11-28 22:17:05 [N7510] : Disk 1 on [N7510] has been removed.


Disk No. 5 is now marked as a potential spare. The output from "mdadm
--examine" is attached to the email.


My basic question is the following: How to proceed.


Currently I am considering the following options:


1. Change back to Disk No. 1 (4TB) - the original one. The disk was
running smoothly when I changed it
2. Option No. 1 - but shutting the system down while doing this
3. Pull/Plug Disk No. 5 and see what happens
4. Reboot?


I don't think this is a Thecus specific question - rather a
mdraid-related issue - in terms of finding the correct procedure.


Any advice / guidance will be appreciated. In case someone needs more
detailed information I am happy to provide this.


Thx,


Tom



/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 59723730:2db46102:6225ddd5:5f0e4781
Name : N7510:10 (local to host N7510)
Creation Time : Fri Jan 3 09:55:12 2014
Raid Level : raid1
Raid Devices : 7


Avail Dev Size : 4192256 (2047.34 MiB 2146.44 MB)
Array Size : 4192232 (2047.33 MiB 2146.42 MB)
Used Dev Size : 4192232 (2047.33 MiB 2146.42 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 9f7bf852:6cc2b78d:29b68d62:a3e0aea4


Update Time : Tue Nov 29 18:37:38 2016
Checksum : 27108c9a - correct
Events : 27744




Device Role : Active device 0
Array State : AAAA.AA ('A' == active, '.' == missing)
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 59723730:2db46102:6225ddd5:5f0e4781
Name : N7510:10 (local to host N7510)
Creation Time : Fri Jan 3 09:55:12 2014
Raid Level : raid1
Raid Devices : 7


Avail Dev Size : 4192256 (2047.34 MiB 2146.44 MB)
Array Size : 4192232 (2047.33 MiB 2146.42 MB)
Used Dev Size : 4192232 (2047.33 MiB 2146.42 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 6463836f:3365a29f:2eff4a43:34ca4251


Update Time : Tue Nov 29 18:37:38 2016
Checksum : e2d749b1 - correct
Events : 27744




Device Role : Active device 1
Array State : AAAA.AA ('A' == active, '.' == missing)
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 59723730:2db46102:6225ddd5:5f0e4781
Name : N7510:10 (local to host N7510)
Creation Time : Fri Jan 3 09:55:12 2014
Raid Level : raid1
Raid Devices : 7


Avail Dev Size : 4192256 (2047.34 MiB 2146.44 MB)
Array Size : 4192232 (2047.33 MiB 2146.42 MB)
Used Dev Size : 4192232 (2047.33 MiB 2146.42 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 980bdf39:0105d063:5e846eed:53e5cfb1


Update Time : Tue Nov 29 18:37:38 2016
Checksum : 7c113204 - correct
Events : 27744




Device Role : Active device 2
Array State : AAAA.AA ('A' == active, '.' == missing)
/dev/sdd1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 59723730:2db46102:6225ddd5:5f0e4781
Name : N7510:10 (local to host N7510)
Creation Time : Fri Jan 3 09:55:12 2014
Raid Level : raid1
Raid Devices : 7


Avail Dev Size : 4192256 (2047.34 MiB 2146.44 MB)
Array Size : 4192232 (2047.33 MiB 2146.42 MB)
Used Dev Size : 4192232 (2047.33 MiB 2146.42 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : b176c33f:09dd2c01:9b65da40:27615b03


Update Time : Tue Nov 29 18:37:38 2016
Checksum : c449d239 - correct
Events : 27744




Device Role : Active device 3
Array State : AAAA.AA ('A' == active, '.' == missing)
/dev/sde1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 59723730:2db46102:6225ddd5:5f0e4781
Name : N7510:10 (local to host N7510)
Creation Time : Fri Jan 3 09:55:12 2014
Raid Level : raid1
Raid Devices : 7


Avail Dev Size : 4192256 (2047.34 MiB 2146.44 MB)
Array Size : 4192232 (2047.33 MiB 2146.42 MB)
Used Dev Size : 4192232 (2047.33 MiB 2146.42 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : e1963582:a6a35363:88ab4994:5881ec66


Update Time : Mon Nov 28 22:23:50 2016
Checksum : 1fe1025c - correct
Events : 27659




Device Role : Active device 4
Array State : AAAAAAA ('A' == active, '.' == missing)
/dev/sdf1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 59723730:2db46102:6225ddd5:5f0e4781
Name : N7510:10 (local to host N7510)
Creation Time : Fri Jan 3 09:55:12 2014
Raid Level : raid1
Raid Devices : 7


Avail Dev Size : 4192256 (2047.34 MiB 2146.44 MB)
Array Size : 4192232 (2047.33 MiB 2146.42 MB)
Used Dev Size : 4192232 (2047.33 MiB 2146.42 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 9d8b239f:4f85ab54:8acfbe55:aab1eb51


Update Time : Tue Nov 29 18:37:38 2016
Checksum : da9d49e0 - correct
Events : 27744




Device Role : Active device 6
Array State : AAAA.AA ('A' == active, '.' == missing)
/dev/sdg1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 59723730:2db46102:6225ddd5:5f0e4781
Name : N7510:10 (local to host N7510)
Creation Time : Fri Jan 3 09:55:12 2014
Raid Level : raid1
Raid Devices : 7


Avail Dev Size : 4192256 (2047.34 MiB 2146.44 MB)
Array Size : 4192232 (2047.33 MiB 2146.42 MB)
Used Dev Size : 4192232 (2047.33 MiB 2146.42 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 9bfe9659:f919f877:a7c93e6d:c748290a


Update Time : Tue Nov 29 18:37:38 2016
Checksum : 881ae2c3 - correct
Events : 27744




Device Role : Active device 5
Array State : AAAA.AA ('A' == active, '.' == missing)

^ permalink raw reply

* Re: Is trim/discard supported on LVM RAID logical volume?
From: Chris Murphy @ 2016-11-29 22:21 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Chris Murphy, Patrick Dung, Linux-RAID
In-Reply-To: <20161129220949.GA4035@metamorpher.de>

On Tue, Nov 29, 2016 at 3:09 PM, Andreas Klauer
<Andreas.Klauer@metamorpher.de> wrote:
> On Tue, Nov 29, 2016 at 01:20:16PM -0700, Chris Murphy wrote:
>> Seen /etc/lvm/lvm.conf issue_discards is probably set to 0.
>
> It should be 0 unless you understand what it's doing and actually want it...
>
> Don't confuse LVM's issue_discards with cryptsetup's allow_discards.

Reading the documentation for both, they're functionally the same
thing in that with these options the particular layer will pass on a
trim to the next layer underneath it. So if the idea is to get trim to
to the SSD as the original poster suggests they want, then they'd have
to use the option at every logical device. Of course there are good
reasons to support trim affecting one or more layers, but not pass it
onto the physical block device.

-- 
Chris Murphy

^ permalink raw reply

* Re: [mdadm PATCH] Introduce enum flag_mode for setting and clearing flags.
From: Jes Sorensen @ 2016-11-29 22:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <877f7m7xng.fsf@notabene.neil.brown.name>

NeilBrown <neilb@suse.com> writes:
> We currently use '1' to indicate that a flag (writemostly or failfast)
> needs to be set, and '2' to indicate that it needs to be cleared.
>
> Using magic number like this is not a best-practice.
>
> So replaced them with values from a enum.
>
> No functional change.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>
> Ask, and you shall receive :-)

Darn, are you trying ot put me to work?

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: Is trim/discard supported on LVM RAID logical volume?
From: Andreas Klauer @ 2016-11-29 22:09 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Patrick Dung, Linux-RAID
In-Reply-To: <CAJCQCtSpOfMvgfxOFvQtM=zcOnyoqvu3GregmW9NS5j9auRuyw@mail.gmail.com>

On Tue, Nov 29, 2016 at 01:20:16PM -0700, Chris Murphy wrote:
> Seen /etc/lvm/lvm.conf issue_discards is probably set to 0.

It should be 0 unless you understand what it's doing and actually want it...

Don't confuse LVM's issue_discards with cryptsetup's allow_discards.

Regards
Andreas Klauer

^ permalink raw reply

* [mdadm PATCH] Introduce enum flag_mode for setting and clearing flags.
From: NeilBrown @ 2016-11-29 22:02 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid
In-Reply-To: <wrfjfumbpv7f.fsf@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6988 bytes --]


We currently use '1' to indicate that a flag (writemostly or failfast)
needs to be set, and '2' to indicate that it needs to be cleared.

Using magic number like this is not a best-practice.

So replaced them with values from a enum.

No functional change.

Signed-off-by: NeilBrown <neilb@suse.com>
---

Ask, and you shall receive :-)

NeilBrown

 Build.c       |  2 +-
 Create.c      |  4 ++--
 Incremental.c |  4 ++--
 Manage.c      | 32 ++++++++++++++++----------------
 mdadm.c       | 12 ++++++------
 mdadm.h       |  8 ++++++--
 6 files changed, 33 insertions(+), 29 deletions(-)

diff --git a/Build.c b/Build.c
index 8603c710aea2..74a440e78965 100644
--- a/Build.c
+++ b/Build.c
@@ -192,7 +192,7 @@ int Build(char *mddev, struct mddev_dev *devlist,
 			disk.number = i;
 			disk.raid_disk = i;
 			disk.state = (1<<MD_DISK_SYNC) | (1<<MD_DISK_ACTIVE);
-			if (dv->writemostly == 1)
+			if (dv->writemostly == FlagSet)
 				disk.state |= 1<<MD_DISK_WRITEMOSTLY;
 			disk.major = major(stb.st_rdev);
 			disk.minor = minor(stb.st_rdev);
diff --git a/Create.c b/Create.c
index bd114eabafc1..2721884efabc 100644
--- a/Create.c
+++ b/Create.c
@@ -888,9 +888,9 @@ int Create(struct supertype *st, char *mddev,
 				else
 					inf->disk.state = 0;
 
-				if (dv->writemostly == 1)
+				if (dv->writemostly == FlagSet)
 					inf->disk.state |= (1<<MD_DISK_WRITEMOSTLY);
-				if (dv->failfast == 1)
+				if (dv->failfast == FlagSet)
 					inf->disk.state |= (1<<MD_DISK_FAILFAST);
 
 				if (have_container)
diff --git a/Incremental.c b/Incremental.c
index 75d95ccc497a..0f507bb32c9e 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -1034,8 +1034,8 @@ static int array_try_spare(char *devname, int *dfdp, struct dev_policy *pol,
 			char chosen_devname[24]; // 2*11 for int (including signs) + colon + null
 			devlist.next = NULL;
 			devlist.used = 0;
-			devlist.writemostly = 0;
-			devlist.failfast = 0;
+			devlist.writemostly = FlagDefault;
+			devlist.failfast = FlagDefault;
 			devlist.devname = chosen_devname;
 			sprintf(chosen_devname, "%d:%d", major(stb.st_rdev),
 				minor(stb.st_rdev));
diff --git a/Manage.c b/Manage.c
index 429d8631cd23..5c3d2b9b1a9f 100644
--- a/Manage.c
+++ b/Manage.c
@@ -679,17 +679,17 @@ int attempt_re_add(int fd, int tfd, struct mddev_dev *dv,
 			else
 				disc.state |= (1 << MD_DISK_CLUSTER_ADD);
 		}
-		if (dv->writemostly == 1)
+		if (dv->writemostly == FlagSet)
 			disc.state |= 1 << MD_DISK_WRITEMOSTLY;
-		if (dv->writemostly == 2)
+		if (dv->writemostly == FlagClear)
 			disc.state &= ~(1 << MD_DISK_WRITEMOSTLY);
-		if (dv->failfast == 1)
+		if (dv->failfast == FlagSet)
 			disc.state |= 1 << MD_DISK_FAILFAST;
-		if (dv->failfast == 2)
+		if (dv->failfast == FlagClear)
 			disc.state &= ~(1 << MD_DISK_FAILFAST);
 		remove_partitions(tfd);
-		if (update || dv->writemostly > 0
-			|| dv->failfast > 0) {
+		if (update || dv->writemostly != FlagDefault
+			|| dv->failfast != FlagDefault) {
 			int rv = -1;
 			tfd = dev_open(dv->devname, O_RDWR);
 			if (tfd < 0) {
@@ -697,19 +697,19 @@ int attempt_re_add(int fd, int tfd, struct mddev_dev *dv,
 				return -1;
 			}
 
-			if (dv->writemostly == 1)
+			if (dv->writemostly == FlagSet)
 				rv = dev_st->ss->update_super(
 					dev_st, NULL, "writemostly",
 					devname, verbose, 0, NULL);
-			if (dv->writemostly == 2)
+			if (dv->writemostly == FlagClear)
 				rv = dev_st->ss->update_super(
 					dev_st, NULL, "readwrite",
 					devname, verbose, 0, NULL);
-			if (dv->failfast == 1)
+			if (dv->failfast == FlagSet)
 				rv = dev_st->ss->update_super(
 					dev_st, NULL, "failfast",
 					devname, verbose, 0, NULL);
-			if (dv->failfast == 2)
+			if (dv->failfast == FlagClear)
 				rv = dev_st->ss->update_super(
 					dev_st, NULL, "nofailfast",
 					devname, verbose, 0, NULL);
@@ -975,9 +975,9 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 		int dfd;
 		if (dv->disposition == 'j')
 			disc.state |= (1 << MD_DISK_JOURNAL) | (1 << MD_DISK_SYNC);
-		if (dv->writemostly == 1)
+		if (dv->writemostly == FlagSet)
 			disc.state |= 1 << MD_DISK_WRITEMOSTLY;
-		if (dv->failfast == 1)
+		if (dv->failfast == FlagSet)
 			disc.state |= 1 << MD_DISK_FAILFAST;
 		dfd = dev_open(dv->devname, O_RDWR | O_EXCL|O_DIRECT);
 		if (tst->ss->add_to_super(tst, &disc, dfd,
@@ -1022,9 +1022,9 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 			disc.state |= (1 << MD_DISK_CLUSTER_ADD);
 	}
 
-	if (dv->writemostly == 1)
+	if (dv->writemostly == FlagSet)
 		disc.state |= (1 << MD_DISK_WRITEMOSTLY);
-	if (dv->failfast == 1)
+	if (dv->failfast == FlagSet)
 		disc.state |= (1 << MD_DISK_FAILFAST);
 	if (tst->ss->external) {
 		/* add a disk
@@ -1801,8 +1801,8 @@ int move_spare(char *from_devname, char *to_devname, dev_t devid)
 
 	devlist.next = NULL;
 	devlist.used = 0;
-	devlist.writemostly = 0;
-	devlist.failfast = 0;
+	devlist.writemostly = FlagDefault;
+	devlist.failfast = FlagDefault;
 	devlist.devname = devname;
 	sprintf(devname, "%d:%d", major(devid), minor(devid));
 
diff --git a/mdadm.c b/mdadm.c
index 3c8f273c8254..c3a265b80947 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -89,8 +89,8 @@ int main(int argc, char *argv[])
 	int oneshot = 0;
 	int spare_sharing = 1;
 	struct supertype *ss = NULL;
-	int writemostly = 0;
-	int failfast = 0;
+	enum flag_mode writemostly = FlagDefault;
+	enum flag_mode failfast = FlagDefault;
 	char *shortopt = short_options;
 	int dosyslog = 0;
 	int rebuild_map = 0;
@@ -412,20 +412,20 @@ int main(int argc, char *argv[])
 		case O(CREATE,'W'):
 		case O(CREATE,WriteMostly):
 			/* set write-mostly for following devices */
-			writemostly = 1;
+			writemostly = FlagSet;
 			continue;
 
 		case O(MANAGE,'w'):
 			/* clear write-mostly for following devices */
-			writemostly = 2;
+			writemostly = FlagClear;
 			continue;
 
 		case O(MANAGE,FailFast):
 		case O(CREATE,FailFast):
-			failfast = 1;
+			failfast = FlagSet;
 			continue;
 		case O(MANAGE,NoFailFast):
-			failfast = 2;
+			failfast = FlagClear;
 			continue;
 
 		case O(GROW,'z'):
diff --git a/mdadm.h b/mdadm.h
index d47de01f725b..a85b5ad8149d 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -424,6 +424,10 @@ enum bitmap_update {
     NodeNumUpdate,
 };
 
+enum flag_mode {
+	FlagDefault, FlagSet, FlagClear,
+};
+
 /* structures read from config file */
 /* List of mddevice names and identifiers
  * Identifiers can be:
@@ -517,8 +521,8 @@ struct mddev_dev {
 				 * 'A' for re_add.
 				 * Not set for names read from .config
 				 */
-	char writemostly;	/* 1 for 'set writemostly', 2 for 'clear writemostly' */
-	char failfast;		/* Ditto but for 'failfast' flag */
+	enum flag_mode writemostly;
+	enum flag_mode failfast;
 	int used;		/* set when used */
 	long long data_offset;
 	struct mddev_dev *next;
-- 
2.10.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH] md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedly
From: Song Liu @ 2016-11-29 21:59 UTC (permalink / raw)
  To: JackieLiu; +Cc: liuzhengyuan, linux-raid, shli
In-Reply-To: <20161129035730.12822-1-liuyun01@kylinos.cn>

On Tue, Nov 29, 2016 at 11:57:30AM +0800, JackieLiu wrote:
> R5c_make_stripe_write_out has set this flag, do not need to set again.
> 
> Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
> ---
>  drivers/md/raid5-cache.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 95bd51e..aa89de0 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1247,8 +1247,6 @@ static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
>  	atomic_inc(&conf->active_stripes);
>  	r5c_make_stripe_write_out(sh);
>  
> -	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> -		atomic_inc(&conf->preread_active_stripes);
>  	raid5_release_stripe(sh);
>  }
>  
> -- 
> 2.10.2
> 
> 
> 
Reviewed-by: Song Liu <songliubraving@fb.com>

Thanks for the fix!
Song

^ permalink raw reply

* Re: raid0 vs. mkfs
From: NeilBrown @ 2016-11-29 21:14 UTC (permalink / raw)
  To: Avi Kivity, linux-raid
In-Reply-To: <df73ebc4-9b78-09b5-022b-089c30dea17c@scylladb.com>

[-- Attachment #1: Type: text/plain, Size: 1822 bytes --]

On Mon, Nov 28 2016, Avi Kivity wrote:
>> If it is easy for the upper layer to break a very large request into a
>> few very large requests, then I wouldn't necessarily object.
>
> I can't see why it would be hard.  It's simple arithmetic.

That is easy to say before writing the code :-)
It probably is easy for RAID0.  Less so for RAID10.  Even less for
RAID6.

>
>> But unless it is very hard for the lower layer to merge requests, it
>> should be doing that too.
>
> Merging has tradeoffs.  When you merge requests R1, R2, ... Rn you make 
> the latency request R1 sum of the latencies of R1..Rn.  You may gain 
> some efficiency in the process, but that's not going to make up for a 
> factor of n.  The queuing layer has no way to tell whether the caller is 
> interested in the latency of individual requests.  By sending large 
> requests, the caller indicates it's not interested in the latency of 
> individual subranges.  The queuing layer is still free to internally 
> split the request to smaller ranges, to satisfy hardware constraints, or 
> to reduce worst-case latencies for competing request streams.

I would have thought that using plug/unplug to group requests is a
fairly strong statement that they can be handled as a unit if that is
convenient.

>
> So I disagree that all the work should be pushed to the merging layer.  
> It has less information to work with, so the fewer decisions it has to 
> make, the better.

I think that the merging layer should be as efficient as it reasonably
can be, and particularly should take into account plugging.  This
benefits all callers.

If it can be demonstrated that changes to some of the upper layers bring
further improvements with acceptable costs, then certainly it is good to
have those too.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 4/4] md/raid5-cache: adjust the write position of the empty block and mark it as a checkpoint
From: Song Liu @ 2016-11-29 20:33 UTC (permalink / raw)
  To: linux-raid; +Cc: Song Liu
In-Reply-To: <20161128081921.5641-4-liuyun01@kylinos.cn>

Reviewed-by: Song Liu <songliubraving@fb.com>



^ permalink raw reply

* Re: [PATCH 3/4] md/raid5-cache: release the stripe_head at the appropriate location
From: Song Liu @ 2016-11-29 20:33 UTC (permalink / raw)
  To: linux-raid; +Cc: Song Liu
In-Reply-To: <20161128081921.5641-3-liuyun01@kylinos.cn>

Reviewed-by: Song Liu <songliubraving@fb.com>



^ permalink raw reply

* Re: [PATCH 2/4] md/raid5-cache: use ring add to prevent overflow
From: Song Liu @ 2016-11-29 20:31 UTC (permalink / raw)
  To: linux-raid; +Cc: Song Liu
In-Reply-To: <20161128081921.5641-2-liuyun01@kylinos.cn>

Reviewed-by: Song Liu <songliubraving@fb.com>



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox