Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: proactive disk replacement
From: NeilBrown @ 2017-03-22  4:16 UTC (permalink / raw)
  To: Andreas Klauer, Reindl Harald; +Cc: Adam Goryachev, Jeff Allison, linux-raid
In-Reply-To: <20170321124129.GA18865@metamorpher.de>

[-- Attachment #1: Type: text/plain, Size: 2267 bytes --]

On Tue, Mar 21 2017, Andreas Klauer wrote:

> On Tue, Mar 21, 2017 at 01:03:22PM +0100, Reindl Harald wrote:
>> the IO of a RAID5/6 rebuild is hardly linear beause the informations 
>> (data + parity) are spread all over the disks
>
> It's not "randomly" spread all over. The blocks are always where they belong.
>
> https://en.wikipedia.org/wiki/Standard_RAID_levels#/media/File:RAID_6.svg
>
> It's AAAA, BBBB, CCCC, DDDD. Not DBCA, BADC, ADBC, ...
>
> There is no random I/O involved here, at worst it will decide to not read 
> a parity block because it's not needed but that does not cause huge/random
> jumps for the HDD read heads.

RAID5 resync (after an unclean shutdown) does read the parity.
It reads all devices in parallel and checks parity.  Normally all the
parity is correct so it doesn't write at all.
Occasionally there might be incorrect parity, in which case the head
will seek back and write the correct parity.

RAID5 recovery (when a device was removed and a new device is added)
reads all the *other* devices in parallel, calculates the missing block
(parity or data) and writes out to the replaced devices.  All reads and
writes are sequential.

NeilBrown


>
>> while in case of RAID1/10 it is really linear
>
> Actually RAID 10 has the most interesting layout choices... 
> to this day mdadm is unable to grow/convert some of these.
>
> In a RAID 10 rebuild the HDD might have to jump from end to start.
>
> Of course if you consider metadata updates (progress has to be 
> recorded somewhere?) then ALL rebuilds regardless of RAID level 
> are random I/O in a way.
>
> But such is the fate of a HDD, it's their bread and butter. 
> Any server that does anything other than "idle" does random I/O 24/7.
>
> If there was no other I/O (because the RAID is live during rebuild) 
> and no metadata updates (or external metadata) you could totally do 
> RAID0/1/5/6 rebuilds with tape drives. That's how random it is.
> RAID10 might need a rewind in between.
>
> Regards
> Andreas Klauer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCHv2 2/2] super1: check and output faulty dev role
From: Jinpu Wang @ 2017-03-22 10:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Gioh Kim, jes.sorensen, linux-raid, linux-kernel@vger.kernel.org
In-Reply-To: <877f3icslm.fsf@notabene.neil.brown.name>

On Tue, Mar 21, 2017 at 8:55 PM, NeilBrown <neilb@suse.com> wrote:
> On Mon, Mar 20 2017, Gioh Kim wrote:
>
>> From: Jack Wang <jinpu.wang@profitbricks.com>
>>
>> Output the real dev role in examine_super1, it will help to
>> find problem.
>>
>> Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
>> Reviewed-by: Gioh Kim <gi-oh.kim@profitbricks.com>
>> ---
>>  super1.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/super1.c b/super1.c
>> index f3520ac..c903371 100644
>> --- a/super1.c
>> +++ b/super1.c
>> @@ -501,8 +501,10 @@ static void examine_super1(struct supertype *st, char *homehost)
>>  #endif
>>       printf("   Device Role : ");
>>       role = role_from_sb(sb);
>> -     if (role >= MD_DISK_ROLE_FAULTY)
>> -             printf("spare\n");
>> +     if (role == MD_DISK_ROLE_SPARE)
>> +             printf("Spare\n");
>> +     else if (role == MD_DISK_ROLE_FAULTY)
>> +             printf("Faulty\n");
>>       else if (role == MD_DISK_ROLE_JOURNAL)
>>               printf("Journal\n");
>>       else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_REPLACEMENT))
>> --
>> 2.5.0
>
> I don't think the distinction between "faulty" and "spare" is really
> useful here.  I used to report the difference and it turned out to be
> confusing, so we stopped.
>
> This is information stored on some other disk, not the one that is
> spare-or-faulty.  All it needs to know if what other devices are
> working.  It doesn't need to know about which devices aren't working and
> why.
> The distinction between 'faulty' and 'spare' is only relevant to the
> device itself, and to the array as a whole.
>
> We should probably get rid of the distinction between
> MD_DISK_ROLE_FAULTY and MD_DISK_ROLE_SPARE.
> Most places that test for it just test >= MD_DISK_ROLE_FAULTY.
>
> NeilBrown

The reason why I did this change, was during debugging the problem, we
 notice the dev_role was wrong, but
when I print in mdadm it said 'spare', which lead me to check other
kernel code path, so spent more time until, I found
examine_super1, treat >=MD_DISK_ROLE_FAULTY as 'spare'.

I thought if the output was right, it could have saved me or maybe
also other developer some time.

But if this cause confusing in the past, we can drop it, the first is
the real bugfix.

Thanks!
-- 
Jack Wang
Linux Kernel Developer

^ permalink raw reply

* Hang on md RAID5 -> RAID6 reshape
From: Tim Small @ 2017-03-22 10:41 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

Hello,

I seem to have hit a bug on reshape (3 disk RAID5 to 4 disk RAID6).

[1050078.168330] ata7.00: error: { UNC }
[1050078.177689] ata7.00: configured for UDMA/133
[1050078.177704] sd 6:0:0:0: [sdd] tag#1 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[1050078.177707] sd 6:0:0:0: [sdd] tag#1 Sense Key : Medium Error [current]
[1050078.177709] sd 6:0:0:0: [sdd] tag#1 Add. Sense: Unrecovered read
error - auto reallocate failed
[1050078.177712] sd 6:0:0:0: [sdd] tag#1 CDB: Read(10) 28 00 02 f4 00 00
00 04 00 00
[1050078.177713] blk_update_request: I/O error, dev sdd, sector 49545912
[1050078.179067] ata7: EH complete
[1050080.945668] raid5_end_read_request: 216 callbacks suppressed
[1050080.945672] md/raid:md2: read error corrected (8 sectors at
28572344 on sde2)
[1050080.945674] md/raid:md2: read error corrected (8 sectors at
28572352 on sde2)
[1050080.945675] md/raid:md2: read error corrected (8 sectors at
28572360 on sde2)
[1050080.945677] md/raid:md2: read error corrected (8 sectors at
28572368 on sde2)
[1050080.945678] md/raid:md2: read error corrected (8 sectors at
28572376 on sde2)
[1050080.945680] md/raid:md2: read error corrected (8 sectors at
28572384 on sde2)
[1050080.945681] md/raid:md2: read error corrected (8 sectors at
28572392 on sde2)
[1050080.945683] md/raid:md2: read error corrected (8 sectors at
28572400 on sde2)
[1050080.945685] md/raid:md2: read error corrected (8 sectors at
28572408 on sde2)
[1050080.945686] md/raid:md2: read error corrected (8 sectors at
28572416 on sde2)
[1050635.232146] INFO: task md2_reshape:23542 blocked for more than 120
seconds.
[1050635.233729]       Not tainted 4.8.0-34-generic #36~16.04.1-Ubuntu
[1050635.235284] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1050635.236858] md2_reshape     D ffff887187b3fb58     0 23542      2
0x00000000
[1050635.236863]  ffff887187b3fb58 ffff8871f7c50001 ffff8871cdad6c00
ffff886fd5615100
[1050635.236866]  0000000000000253 ffff887187b40000 0000000001eff000
ffff887187b3fbd8
[1050635.236868]  ffff8870233c8688 ffff8870233c8400 ffff887187b3fb70
ffffffff9fc91d45
[1050635.236870] Call Trace:
[1050635.236878]  [<ffffffff9fc91d45>] schedule+0x35/0x80
[1050635.236886]  [<ffffffffc0382ea2>] reshape_request+0x682/0x970 [raid456]
[1050635.236889]  [<ffffffff9f4c74d0>] ? wake_atomic_t_function+0x60/0x60
[1050635.236892]  [<ffffffffc03834bb>] raid5_sync_request+0x32b/0x3b0
[raid456]
[1050635.236896]  [<ffffffff9faebb75>] md_do_sync+0x955/0xf00
[1050635.236898]  [<ffffffff9f4c74d0>] ? wake_atomic_t_function+0x60/0x60
[1050635.236902]  [<ffffffff9f490c83>] ? kernel_sigaction+0x43/0xe0
[1050635.236904]  [<ffffffff9fae83e9>] md_thread+0x139/0x150
[1050635.236905]  [<ffffffff9fae82b0>] ? find_pers+0x70/0x70
[1050635.236908]  [<ffffffff9f4a4008>] kthread+0xd8/0xf0
[1050635.236910]  [<ffffffff9fc9679f>] ret_from_fork+0x1f/0x40
[1050635.236912]  [<ffffffff9f4a3f30>] ? kthread_create_on_node+0x1e0/0x1e0


I'm assuming the reallocation is unrelated, since it happened 8 minutes
before the hang.  I didn't explicitly check the underlying device
health, but since there were other md devices using the same drive set I
assume they were OK (no other kernel messages relating to I/O failures
or problems with the other md devices when I rebooted about 9 hours
later).

After reboot, assembly with backup-file (and echo 1 >
/sys/module/md_mod/parameters/start_dirty_degraded) it is continuing the
reshape.

Tim.

^ permalink raw reply

* mdadm: one question about the readonly and readwrite feature
From: Zhilong Liu @ 2017-03-22 12:00 UTC (permalink / raw)
  To: NeilBrown, Jes Sorensen; +Cc: linux-raid@vger.kernel.org

Hi, Neil;

   Excuse me, according to read 'mdadm/tests/ToTest', I'm a little 
confused about "readonly"
and "readwrite" feature, and I've no idea how to fix it. Thus I report 
this question and I'm sorry
for this long description email.

relevant linux/driver/md commit: 260fa034ef7a4ff8b73068b48ac497edd5217491

   My question: If the array has been set the MD_CLOSING flag, although 
hasn't removed the sysfs
folder because sysfs_remove_group() wasn't invoked, and now, how should 
mdadm continue to
control this 'readonly' array?
   Of course, once we cannot operate the array, the 'readwrite' feature 
would be never worked.

Test step:
# ./mdadm -CR /dev/md0 -b internal -l1 -n2 /dev/loop[0-1] --assume-clean
# ./mdadm -o /dev/mdX

# in md.h
enum mddev_flags {
         MD_ARRAY_FIRST_USE,     /* First use of array, needs 
initialization */
         MD_CLOSING,             /* If set, we are closing the array, do 
not open it then */

1. In mdadm tool:
   the func: Manage_ro(dv->devname, mdfd, -1) would be never invoked 
once the array has been
set 'readonly' before. the open_mddev() cannot get a valid file 
descriptor any more. Most of mdadm
commands would be failure, I have to execute the "echo clear > 
/sys/block/mdX/md/array_state".

# refer to mdadm.c
... ...
static int misc_list(struct mddev_dev *devlist,
                      struct mddev_ident *ident,
                      char *dump_directory,
                      struct supertype *ss, struct context *c)
{
... ...
                 switch(dv->devname[0] == '/') {
                         case 0:
                                 mdfd = open_dev(dv->devname);
                                 if (mdfd >= 0) break;
                         case 1:
                                 mdfd = open_mddev(dv->devname, 1);
                 }
                 if (mdfd>=0) {
                         switch(dv->disposition) {
                         case 'R':
                                 c->runstop = 1;
                                 rv |= Manage_run(dv->devname, mdfd, c); 
break;
                         case 'S':
                                 rv |= Manage_stop(dv->devname, mdfd, 
c->verbose, 0); break;
                         case 'o':
                                 rv |= Manage_ro(dv->devname, mdfd, 1); 
break;
                         case 'w':
                                 rv |= Manage_ro(dv->devname, mdfd, -1); 
break;
                         }
2. in md driver:
For readonly, the code path is:
ioctl(fd, STOP_ARRAY_RO, NULL)  - - > set_bit(MD_CLOSING, &mddev->flags) 
- - > md_set_readonly()

cut a piece of code: - -> md_set_readonly() of md.c:
... ...
         if (mddev->pers) {
                 __md_stop_writes(mddev);

                 err  = -ENXIO;
                 if (mddev->ro==1)
                         goto out;
                 mddev->ro = 1;
                 set_disk_ro(mddev->gendisk, 1);
                 clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
                 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);   - - > 
I think it did nothing once readonly has been set.
                 md_wakeup_thread(mddev->thread);
                 sysfs_notify_dirent_safe(mddev->sysfs_state);
                 err = 0;
... ...

Thanks for your patience,
-Zhilong

^ permalink raw reply

* Re: [PATCH] block: trace completion of all bios.
From: Christoph Hellwig @ 2017-03-22 12:51 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, linux-raid, dm-devel, Alasdair Kergon,
	Mike Snitzer, Shaohua Li, linux-kernel
In-Reply-To: <877f3iave6.fsf@notabene.neil.brown.name>

On Wed, Mar 22, 2017 at 01:38:09PM +1100, NeilBrown wrote:
> 
> Currently only dm and md/raid5 bios trigger trace_block_bio_complete().
> Now that we have bio_chain(), it is not possible, in general, for a
> driver to know when the bio is really complete.  Only bio_endio()
> knows that.
> 
> So move the trace_block_bio_complete() call to bio_endio().

This will cause duplicate events for request based drivers.  You'll
need to have a bio_endio_notrace or similar without that the request
completion path can call.

^ permalink raw reply

* Re: proactive disk replacement
From: Gandalf Corvotempesta @ 2017-03-22 13:53 UTC (permalink / raw)
  To: Phil Turmel
  Cc: David Brown, Wols Lists, Reindl Harald, Adam Goryachev,
	Jeff Allison, linux-raid
In-Reply-To: <f90b9218-85ef-6af1-fa78-e3641a307566@turmel.org>

2017-03-21 17:49 GMT+01:00 Phil Turmel <philip@turmel.org>:
> The correlation is effectively immaterial in a non-degraded raid5 and
> singly-degraded raid6 because recovery will succeed as long as any two
> errors are in different 4k block/sector locations.  And for non-degraded
> raid6, all three UREs must occur in the same block/sector to lose
> data. Some participants in this discussion need to read the statistical
> description of this stuff here:
>
> http://marc.info/?l=linux-raid&m=139050322510249&w=2
>
> As long as you are 'check' scrubbing every so often (I scrub weekly),
> the odds of catastrophe on raid6 are the odds of something *else* taking
> out the machine or controller, not the odds of simultaneous drive
> failures.

This is true but disk failures happens much more than multiple UREs on
the same stripe.
I think that in a RAID6 is much easier to loose data due to multiple
disk failures.

Last years i've lose a server due to 4 (of 6) disks failures in less
than an hours during a rebuild.

The first failure was detected in the middle of the night. It was a
disconnection/reconnaction of a single disks.
The riconnection triggered a resync. During the resync another disk
failed. RAID6 recovered even from this double failure
but at about 60% of rebuild, the third disk failed bringing the whole raid down.

I was waked up by our monitoring system and looking at the server,
there was also a fourth disk down :)

4 disks down in less than a hour. All disk was enterprise: SAS 15K,
not desktop drives.

^ permalink raw reply

* Re: proactive disk replacement
From: David Brown @ 2017-03-22 14:12 UTC (permalink / raw)
  To: Gandalf Corvotempesta, Phil Turmel
  Cc: Wols Lists, Reindl Harald, Adam Goryachev, Jeff Allison,
	linux-raid
In-Reply-To: <CAJH6TXjXR1BM6UojbbgTNpCdyyMhfO4VOG0dxYUAV59PEY+O2g@mail.gmail.com>

On 22/03/17 14:53, Gandalf Corvotempesta wrote:
> 2017-03-21 17:49 GMT+01:00 Phil Turmel <philip@turmel.org>:
>> The correlation is effectively immaterial in a non-degraded raid5 and
>> singly-degraded raid6 because recovery will succeed as long as any two
>> errors are in different 4k block/sector locations.  And for non-degraded
>> raid6, all three UREs must occur in the same block/sector to lose
>> data. Some participants in this discussion need to read the statistical
>> description of this stuff here:
>>
>> http://marc.info/?l=linux-raid&m=139050322510249&w=2
>>
>> As long as you are 'check' scrubbing every so often (I scrub weekly),
>> the odds of catastrophe on raid6 are the odds of something *else* taking
>> out the machine or controller, not the odds of simultaneous drive
>> failures.
> 
> This is true but disk failures happens much more than multiple UREs on
> the same stripe.
> I think that in a RAID6 is much easier to loose data due to multiple
> disk failures.

Certainly multiple disk failures is an easy way to loose data in /any/
storage system (or at least, loose data since the last backup).

The issue here is whether it is more or less likely to be a problem in
RAID6 than other raid arrangements.  And the answer is that complete
disk failures are not more likely during a RAID6 rebuild than during
other raid rebuilds, and a RAID6 will tolerate more failures than RAID1
or RAID5.

Of course, multiple disk failures /do/ occur.  There can be a common
cause of failure.  I have had a few raid systems die completely over the
years.  The causes I can remember include:

1. The SAS controller card died - and I didn't have a replacement.  The
data on the disks is probably still fine.

2. The whole computer died in some unknown way.  The data on the disks
was fine - I put them in another cabinet and re-assembled the md array.

3. A hardware raid card died.  The data may have been on the disks, but
the hardware raid was in a proprietary format.

4. I knocked a disk cabinet off its shelf.  This let to multiple
simultaneous drive failures.

Based on these, my policy is:

1. Stick to SATA drives that are easily available, easily replaced, and
easily read from any system.

2. Avoid hardware raid - use md raid and/or btrfs raid.

3. Do a lot of backups - on independent systems, and with off-site
copies.  Raid does not prevent loss from fire or theft, or a UPS going
bananas, or a user deleting the wrong file.

4. Mount your equipment securely, and turn round slowly!

> 
> Last years i've lose a server due to 4 (of 6) disks failures in less
> than an hours during a rebuild.
> 
> The first failure was detected in the middle of the night. It was a
> disconnection/reconnaction of a single disks.
> The riconnection triggered a resync. During the resync another disk
> failed. RAID6 recovered even from this double failure
> but at about 60% of rebuild, the third disk failed bringing the whole raid down.
> 
> I was waked up by our monitoring system and looking at the server,
> there was also a fourth disk down :)
> 
> 4 disks down in less than a hour. All disk was enterprise: SAS 15K,
> not desktop drives.
> 


^ permalink raw reply

* Re: proactive disk replacement
From: Phil Turmel @ 2017-03-22 14:32 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: David Brown, Wols Lists, Reindl Harald, Adam Goryachev,
	Jeff Allison, linux-raid
In-Reply-To: <CAJH6TXjXR1BM6UojbbgTNpCdyyMhfO4VOG0dxYUAV59PEY+O2g@mail.gmail.com>

On 03/22/2017 09:53 AM, Gandalf Corvotempesta wrote:

> Last years i've lose a server due to 4 (of 6) disks failures in less
> than an hours during a rebuild.
> 
> The first failure was detected in the middle of the night. It was a
> disconnection/reconnaction of a single disks.
> The riconnection triggered a resync. During the resync another disk
> failed. RAID6 recovered even from this double failure
> but at about 60% of rebuild, the third disk failed bringing the whole raid down.
> 
> I was waked up by our monitoring system and looking at the server,
> there was also a fourth disk down :)
> 
> 4 disks down in less than a hour. All disk was enterprise: SAS 15K,
> not desktop drives.

You should win a prize, Gandalf.  In the several years I've participated
on this mailing list, you are the first to describe such a catastrophe
where the drives really were at fault, instead of timeout mismatch,
power supplies, cables, or controllers.

All four disks had permanent "FAILED" smartctl status after this, yes?

Phil

^ permalink raw reply

* Re: proactive disk replacement
From: John Stoffel @ 2017-03-22 14:51 UTC (permalink / raw)
  To: Jeff Allison; +Cc: linux-raid
In-Reply-To: <3FA2E00F-B107-4F3C-A9D3-A10CA5F81EC0@allygray.2y.net>


Jeff> Hi all I’ve had a poke around but am yet to find something
Jeff> definitive.  I have a raid 5 array of 4 disks amounting to
Jeff> approx 5.5tb. Now this disks are getting a bit long in the tooth
Jeff> so before I get into problems I’ve bought 4 new disks to replace
Jeff> them.

Can I suggest that you buy another disk and convert into a RAID6 setup
for even more resiliency?  Esp with that much data (great that you
have backups!) the piece of mind of an extra disk is well worth the
cost in my mind.

Personally, I just go with RAID1 mirrors on large disks like this for
my home system.  I don't have *that* much stuff... though my disks too
are getting long in tooth.  

Jeff> I have a backup so if it all goes west I’m covered. So I’m
Jeff> looking for suggestions.

Jeff> My current plan is just to replace the 2tb drives with the new
Jeff> 3tb drives and move on, I’d like to do it on line with out
Jeff> having to trash the array and start again, so does anyone have a
Jeff> game plan for doing that.

You don't say how your system is setup, whether or not you have LVM on
top of the MD RAID5 array or not.  If you, you could simply do:

1. Build a new RAID6 array with five disks (buying another one like I
   suggest above).

2. Add this into your VG with the 4x2tb disks.

3. pvmove all your data onto the new PVs:

   pvmove -b <VG> <old-raid5-PV>

And once it's done, you can them remove that PV from the VG and pull
them from the system.  Or turn them into a scratch space until they
die...

Jeff> Or is a 9tb raid 5 array the wrong thing to be doing and should
Jeff> I be doing something else 6tb raid 10 or something I’m open to
Jeff> suggestions.

Depends on how good your backups are and how critical it is that this
data stay online.

John

^ permalink raw reply

* Re: [PATCH] percpu-refcount: support synchronous switch to atomic mode.
From: Tejun Heo @ 2017-03-22 15:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christoph Lameter, linux-kernel, Shaohua Li, Linux-RAID
In-Reply-To: <87o9wuaxm3.fsf@notabene.neil.brown.name>

On Wed, Mar 22, 2017 at 12:50:12PM +1100, NeilBrown wrote:
> 
> percpu_ref_switch_to_atomic_sync() schedules the switch
> to atomic mode, then waits for it to complete.
> 
> Also export percpu_ref_switch_to_* so they can be used from modules.
> 
> This will be used in md/raid to count the number of pending write
> requests to an array.
> We occasionally need to check if the count is zero, but most often
> we don't care.
> We always want updates to the counter to be fast, as in some cases
> we count every 4K page.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

Acked-by: Tejun Heo <tj@kernel.org>

Please feel free to route through other patches through md tree.  If
preferable, I can route it through percpu/for-4.12.

Thanks.

-- 
tejun

^ permalink raw reply

* [PATCH 0/2] md/dm-raid: add/use API to switch raid4/5/6 journal cache modes
From: heinzm @ 2017-03-22 16:44 UTC (permalink / raw)
  To: linux-raid; +Cc: heinzm

From: Heinz Mauelshagen <heinzm@redhat.com>

Upstream commit 7a771ceac771d009f7203c40b256b0608d7ea2f8 added
journal support to dm-raid to close the raid4/5/6 "write hole"
but doesn't provide an API to switch modes from device-mapper.

Add missing API to MD and use it from device-mapper dm-raid
thus supporting 'write-back' journal caching in the raid target.


Heinz Mauelshagen (2):
  md: add raid4/5/6 journal mode API (for dm-raid use)
  dm raid: add raid4/5/6 journal mode support

 Documentation/device-mapper/dm-raid.txt | 11 +++-
 drivers/md/dm-raid.c                    | 91 ++++++++++++++++++++++++++++-----
 drivers/md/raid5-cache.c                | 62 ++++++++++++----------
 drivers/md/raid5.h                      | 11 ++++
 4 files changed, 133 insertions(+), 42 deletions(-)

-- 
2.9.3


^ permalink raw reply

* [PATCH 1/2] md: add raid4/5/6 journal mode API (for dm-raid use)
From: heinzm @ 2017-03-22 16:44 UTC (permalink / raw)
  To: linux-raid; +Cc: heinzm

From: Heinz Mauelshagen <heinzm@redhat.com>

Upstream commit 2ded370373a400c20cf0c6e941e724e61582a867
started the addition of "write-back" mode to MD (raid5-cache),
i.e. support write-back caching on the raid journal device.

In order to allow the dm-raid target to switch between
the available "write-through" and "write-back" modes,
provide new r5c_journal_mode_set() API.

Use the new API in existing r5c_journal_mode_store()

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
---
 drivers/md/raid5-cache.c | 62 ++++++++++++++++++++++++++----------------------
 drivers/md/raid5.h       | 11 +++++++++
 2 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be..218b6f3 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -53,16 +53,6 @@
  */
 #define R5L_POOL_SIZE	4
 
-/*
- * r5c journal modes of the array: write-back or write-through.
- * write-through mode has identical behavior as existing log only
- * implementation.
- */
-enum r5c_journal_mode {
-	R5C_JOURNAL_MODE_WRITE_THROUGH = 0,
-	R5C_JOURNAL_MODE_WRITE_BACK = 1,
-};
-
 static char *r5c_journal_mode_str[] = {"write-through",
 				       "write-back"};
 /*
@@ -2327,40 +2317,56 @@ static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
 	return ret;
 }
 
-static ssize_t r5c_journal_mode_store(struct mddev *mddev,
-				      const char *page, size_t length)
+/*
+ * Set journal cache mode on @mddev (external API initially needed by dm-raid).
+ *
+ * @mode as defined in 'enum r5c_journal_mode'.
+ *
+ */
+int r5c_journal_mode_set(struct mddev *mddev, int mode)
 {
 	struct r5conf *conf = mddev->private;
 	struct r5l_log *log = conf->log;
-	int val = -1, i;
-	int len = length;
 
 	if (!log)
 		return -ENODEV;
 
-	if (len && page[len - 1] == '\n')
-		len -= 1;
-	for (i = 0; i < ARRAY_SIZE(r5c_journal_mode_str); i++)
-		if (strlen(r5c_journal_mode_str[i]) == len &&
-		    strncmp(page, r5c_journal_mode_str[i], len) == 0) {
-			val = i;
-			break;
-		}
-	if (val < R5C_JOURNAL_MODE_WRITE_THROUGH ||
-	    val > R5C_JOURNAL_MODE_WRITE_BACK)
+	if (mode < R5C_JOURNAL_MODE_WRITE_THROUGH ||
+	    mode > R5C_JOURNAL_MODE_WRITE_BACK)
 		return -EINVAL;
 
 	if (raid5_calc_degraded(conf) > 0 &&
-	    val == R5C_JOURNAL_MODE_WRITE_BACK)
+	    mode == R5C_JOURNAL_MODE_WRITE_BACK)
 		return -EINVAL;
 
 	mddev_suspend(mddev);
-	conf->log->r5c_journal_mode = val;
+	conf->log->r5c_journal_mode = mode;
 	mddev_resume(mddev);
 
 	pr_debug("md/raid:%s: setting r5c cache mode to %d: %s\n",
-		 mdname(mddev), val, r5c_journal_mode_str[val]);
-	return length;
+		 mdname(mddev), mode, r5c_journal_mode_str[mode]);
+	return 0;
+}
+EXPORT_SYMBOL(r5c_journal_mode_set);
+
+static ssize_t r5c_journal_mode_store(struct mddev *mddev,
+				      const char *page, size_t length)
+{
+	int mode = ARRAY_SIZE(r5c_journal_mode_str);
+	size_t len = length;
+
+	if (len < 2)
+		return -EINVAL;
+
+	if (page[len - 1] == '\n')
+		len--;
+
+	while (mode--)
+		if (strlen(r5c_journal_mode_str[mode]) == len &&
+		    !strncmp(page, r5c_journal_mode_str[mode], len))
+			break;
+
+	return r5c_journal_mode_set(mddev, mode) ?: length;
 }
 
 struct md_sysfs_entry
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 4bb27b9..ec8ca15 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -547,6 +547,16 @@ struct r5worker_group {
 	int stripes_cnt;
 };
 
+/*
+ * r5c journal modes of the array: write-back or write-through.
+ * write-through mode has identical behavior as existing log only
+ * implementation.
+ */
+enum r5c_journal_mode {
+	R5C_JOURNAL_MODE_WRITE_THROUGH = 0,
+	R5C_JOURNAL_MODE_WRITE_BACK = 1,
+};
+
 enum r5_cache_state {
 	R5_INACTIVE_BLOCKED,	/* release of inactive stripes blocked,
 				 * waiting for 25% to be free
@@ -795,4 +805,5 @@ extern void r5c_check_cached_full_stripe(struct r5conf *conf);
 extern struct md_sysfs_entry r5c_journal_mode;
 extern void r5c_update_on_rdev_error(struct mddev *mddev);
 extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
+extern int r5c_journal_mode_set(struct mddev *mddev, int journal_mode);
 #endif
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/2] dm raid: add raid4/5/6 journal mode support
From: heinzm @ 2017-03-22 16:44 UTC (permalink / raw)
  To: linux-raid; +Cc: heinzm

From: Heinz Mauelshagen <heinzm@redhat.com>

Upstream commit 7a771ceac771d009f7203c40b256b0608d7ea2f8
added journal support to close the raid4/5/6 "write hole".

Uitilizing prerequisite
"md: add raid4/5/6 journal mode API (for dm-raid use)",
add support to switch the journal device cache mode between
write-through (the current default) and write-back using
the new r5c_journal_mode_set() API.

Mind in case of journal device failures the former 'only' causes
the "write hole" to reoccur but the latter (i.e. "write-back")
will cause data loss with dirty cache entries thus requiring
resilient storage.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
---
 Documentation/device-mapper/dm-raid.txt |  12 +++-
 drivers/md/dm-raid.c                    | 104 ++++++++++++++++++++++++++++----
 2 files changed, 102 insertions(+), 14 deletions(-)

diff --git a/Documentation/device-mapper/dm-raid.txt b/Documentation/device-mapper/dm-raid.txt
index 95c4c8d..baaf73c 100644
--- a/Documentation/device-mapper/dm-raid.txt
+++ b/Documentation/device-mapper/dm-raid.txt
@@ -170,6 +170,14 @@ The target is named "raid" and it accepts the following parameters:
 		Takeover/reshape is not possible with a raid4/5/6 journal device;
 		it has to be deconfigured before requesting these.
 
+	[journal_mode <mode>]
+		This option sets the caching mode an journaled raid4/5/6 raid sets
+		(see 'journal_dev <dev>' above) to 'writethrough' or 'writeback'.
+		In case 'writeback' is selected the journal device has to be
+		resilient and obviously may not suffer from the 'write hole'
+		problem itself (e.g. use raid1/raid10) to avoid a single
+		point of failure.
+
 <#raid_devs>: The number of devices composing the array.
 	Each device consists of two entries.  The first is the device
 	containing the metadata (if any); the second is the one containing the
@@ -254,7 +262,8 @@ recovery.  Here is a fuller description of the individual fields:
 	<data_offset>   The current data offset to the start of the user data on
 			each component device of a raid set (see the respective
 			raid parameter to support out-of-place reshaping).
-	<journal_char>	'A' - active raid4/5/6 journal device.
+	<journal_char>	'A' - active write-through journal device.
+			'a' - active write-back journal device.
 			'D' - dead journal device.
 			'-' - no journal device.
 
@@ -334,3 +343,4 @@ Version History
 1.10.1  Fix data corruption on reshape request
 1.11.0  Fix table line argument order
 	(wrong raid10_copies/raid10_format sequence)
+1.11.1  Add support for raid4/5/6 journal mode
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index e07185f..0f61bb6 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (C) 2010-2011 Neil Brown
- * Copyright (C) 2010-2016 Red Hat, Inc. All rights reserved.
+ * Copyright (C) 2010-2017 Red Hat, Inc. All rights reserved.
  *
  * This file is released under the GPL.
  */
@@ -79,7 +79,10 @@ struct raid_dev {
 #define __CTR_FLAG_RAID10_USE_NEAR_SETS 14 /* 2 */ /* Only with raid10! */
 
 /* New for v1.10.0 */
-#define __CTR_FLAG_JOURNAL_DEV		15 /* 2 */ /* Only with raid4/5/6! */
+#define __CTR_FLAG_JOURNAL_DEV		15 /* 2 */ /* Only with raid4/5/6 (journal device)! */
+
+/* New for v1.11.1 */
+#define __CTR_FLAG_JOURNAL_MODE		16 /* 2 */ /* Only with raid4/5/6 (journal mode)! */
 
 /*
  * Flags for rs->ctr_flags field.
@@ -100,6 +103,7 @@ struct raid_dev {
 #define CTR_FLAG_DATA_OFFSET		(1 << __CTR_FLAG_DATA_OFFSET)
 #define CTR_FLAG_RAID10_USE_NEAR_SETS	(1 << __CTR_FLAG_RAID10_USE_NEAR_SETS)
 #define CTR_FLAG_JOURNAL_DEV		(1 << __CTR_FLAG_JOURNAL_DEV)
+#define CTR_FLAG_JOURNAL_MODE		(1 << __CTR_FLAG_JOURNAL_MODE)
 
 #define RESUME_STAY_FROZEN_FLAGS (CTR_FLAG_DELTA_DISKS | CTR_FLAG_DATA_OFFSET)
 
@@ -175,7 +179,8 @@ struct raid_dev {
 				 CTR_FLAG_REGION_SIZE | \
 				 CTR_FLAG_DELTA_DISKS | \
 				 CTR_FLAG_DATA_OFFSET | \
-				 CTR_FLAG_JOURNAL_DEV)
+				 CTR_FLAG_JOURNAL_DEV | \
+				 CTR_FLAG_JOURNAL_MODE)
 
 #define RAID6_VALID_FLAGS	(CTR_FLAG_SYNC | \
 				 CTR_FLAG_REBUILD | \
@@ -186,7 +191,8 @@ struct raid_dev {
 				 CTR_FLAG_REGION_SIZE | \
 				 CTR_FLAG_DELTA_DISKS | \
 				 CTR_FLAG_DATA_OFFSET | \
-				 CTR_FLAG_JOURNAL_DEV)
+				 CTR_FLAG_JOURNAL_DEV | \
+				 CTR_FLAG_JOURNAL_MODE)
 /* ...valid options definitions per raid level */
 
 /*
@@ -239,6 +245,7 @@ struct raid_set {
 	struct journal_dev {
 		struct dm_dev *dev;
 		struct md_rdev rdev;
+		int mode;
 	} journal_dev;
 
 	struct raid_dev dev[0];
@@ -326,6 +333,7 @@ static struct arg_name_flag {
 	{ CTR_FLAG_DELTA_DISKS, "delta_disks"},
 	{ CTR_FLAG_RAID10_USE_NEAR_SETS, "raid10_use_near_sets"},
 	{ CTR_FLAG_JOURNAL_DEV, "journal_dev" },
+	{ CTR_FLAG_JOURNAL_MODE, "journal_mode" },
 };
 
 /* Return argument name string for given @flag */
@@ -344,6 +352,39 @@ static const char *dm_raid_arg_name_by_flag(const uint32_t flag)
 	return NULL;
 }
 
+/* Define correlation of raid456 journal cache modes and dm-raid target line parameters */
+static struct {
+	const int mode;
+	const char *param;
+} _raid456_journal_mode[] = {
+	{ R5C_JOURNAL_MODE_WRITE_THROUGH , "writethrough" },
+	{ R5C_JOURNAL_MODE_WRITE_BACK    , "writeback" }
+};
+
+/* Return MD raid4/5/6 journal mode for dm @journal_mode one */
+static int dm_raid_journal_mode_to_md(const char *mode)
+{
+	int m = ARRAY_SIZE(_raid456_journal_mode);
+
+	while (m--)
+		if (!strcasecmp(mode, _raid456_journal_mode[m].param))
+			return _raid456_journal_mode[m].mode;
+
+	return -EINVAL;
+}
+
+/* Return dm-raid raid4/5/6 journal mode string for @mode */
+static const char *md_journal_mode_to_dm_raid(const int mode)
+{
+	int m = ARRAY_SIZE(_raid456_journal_mode);
+
+	while (m--)
+		if (mode == _raid456_journal_mode[m].mode)
+			return _raid456_journal_mode[m].param;
+
+	return "unknown";
+}
+
 /*
  * Bool helpers to test for various raid levels of a raid set.
  * It's level as reported by the superblock rather than
@@ -1183,7 +1224,7 @@ static int parse_raid_params(struct raid_set *rs, struct dm_arg_set *as,
 			continue;
 		}
 
-		/* "journal_dev dev" */
+		/* "journal_dev <dev>" */
 		if (!strcasecmp(key, dm_raid_arg_name_by_flag(CTR_FLAG_JOURNAL_DEV))) {
 			int r;
 			struct md_rdev *jdev;
@@ -1211,10 +1252,32 @@ static int parse_raid_params(struct raid_set *rs, struct dm_arg_set *as,
 				rs->ti->error = "No space for raid4/5/6 journal";
 				return -ENOSPC;
 			}
+			rs->journal_dev.mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
 			set_bit(Journal, &jdev->flags);
 			continue;
 		}
 
+		/* "journal_mode <mode>" ("journal_dev" mandatory!) */
+		if (!strcasecmp(key, dm_raid_arg_name_by_flag(CTR_FLAG_JOURNAL_MODE))) {
+			int r;
+
+			if (!test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags)) {
+				rs->ti->error = "raid4/5/6 'journal_mode' is invalid without 'journal_dev'";
+				return -EINVAL;
+			}
+			if (test_and_set_bit(__CTR_FLAG_JOURNAL_MODE, &rs->ctr_flags)) {
+				rs->ti->error = "Only one raid4/5/6 'journal_mode' argument allowed";
+				return -EINVAL;
+			}
+			r = dm_raid_journal_mode_to_md(arg);
+			if (r < 0) {
+				rs->ti->error = "Invalid 'journal_mode' argument";
+				return r;
+			}
+			rs->journal_dev.mode = r;
+			continue;
+		}
+
 		/*
 		 * Parameters with number values from here on.
 		 */
@@ -3076,6 +3139,16 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	rs->callbacks.congested_fn = raid_is_congested;
 	dm_table_add_target_callbacks(ti->table, &rs->callbacks);
 
+	/* If raid4/5/6 journal mode explictely requested (only possible with journal dev) -> set it */
+	if (test_bit(__CTR_FLAG_JOURNAL_MODE, &rs->ctr_flags)) {
+		r = r5c_journal_mode_set(&rs->md, rs->journal_dev.mode);
+		if (r) {
+			ti->error = "Failed to set raid4/5/6 journal mode";
+			mddev_unlock(&rs->md);
+			goto bad_journal_mode_set;
+		}
+	}
+
 	mddev_suspend(&rs->md);
 
 	/* Try to adjust the raid4/5/6 stripe cache size to the stripe size */
@@ -3109,6 +3182,7 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	mddev_unlock(&rs->md);
 	return 0;
 
+bad_journal_mode_set:
 bad_stripe_cache:
 bad_check_reshape:
 	md_stop(&rs->md);
@@ -3180,18 +3254,18 @@ static const char *decipher_sync_action(struct mddev *mddev)
  * Status characters:
  *
  *  'D' = Dead/Failed raid set component or raid4/5/6 journal device
- *  'a' = Alive but not in-sync
- *  'A' = Alive and in-sync raid set component or alive raid4/5/6 journal device
+ *  'a' = Alive but not in-sync raid set component _or_ alive raid4/5/6 'write_back' journal device
+ *  'A' = Alive and in-sync raid set component _or_ alive raid4/5/6 'write_through' journal device
  *  '-' = Non-existing device (i.e. uspace passed '- -' into the ctr)
  */
-static const char *__raid_dev_status(struct md_rdev *rdev, bool array_in_sync)
+static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev, bool array_in_sync)
 {
 	if (!rdev->bdev)
 		return "-";
 	else if (test_bit(Faulty, &rdev->flags))
 		return "D";
 	else if (test_bit(Journal, &rdev->flags))
-		return "A";
+		return (rs->journal_dev.mode == R5C_JOURNAL_MODE_WRITE_THROUGH) ? "A" : "a";
 	else if (!array_in_sync || !test_bit(In_sync, &rdev->flags))
 		return "a";
 	else
@@ -3315,7 +3389,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
 
 		/* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */
 		for (i = 0; i < rs->raid_disks; i++)
-			DMEMIT(__raid_dev_status(&rs->dev[i].rdev, array_in_sync));
+			DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev, array_in_sync));
 
 		/*
 		 * In-sync/Reshape ratio:
@@ -3366,7 +3440,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
 		 * v1.10.0+:
 		 */
 		DMEMIT(" %s", test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ?
-			      __raid_dev_status(&rs->journal_dev.rdev, 0) : "-");
+			      __raid_dev_status(rs, &rs->journal_dev.rdev, 0) : "-");
 		break;
 
 	case STATUSTYPE_TABLE:
@@ -3381,7 +3455,8 @@ static void raid_status(struct dm_target *ti, status_type_t type,
 				  write_mostly_params +
 				  hweight32(rs->ctr_flags & CTR_FLAG_OPTIONS_NO_ARGS) +
 				  hweight32(rs->ctr_flags & CTR_FLAG_OPTIONS_ONE_ARG) * 2 +
-				  (test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ? 2 : 0);
+				  (test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ? 2 : 0) +
+				  (test_bit(__CTR_FLAG_JOURNAL_MODE, &rs->ctr_flags) ? 2 : 0);
 
 		/* Emit table line */
 		/* This has to be in the documented order for userspace! */
@@ -3433,6 +3508,9 @@ static void raid_status(struct dm_target *ti, status_type_t type,
 		if (test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags))
 			DMEMIT(" %s %s", dm_raid_arg_name_by_flag(CTR_FLAG_JOURNAL_DEV),
 					__get_dev_name(rs->journal_dev.dev));
+		if (test_bit(__CTR_FLAG_JOURNAL_MODE, &rs->ctr_flags))
+			DMEMIT(" %s %s", dm_raid_arg_name_by_flag(CTR_FLAG_JOURNAL_MODE),
+					 md_journal_mode_to_dm_raid(rs->journal_dev.mode));
 		DMEMIT(" %d", rs->raid_disks);
 		for (i = 0; i < rs->raid_disks; i++)
 			DMEMIT(" %s %s", __get_dev_name(rs->dev[i].meta_dev),
@@ -3793,7 +3871,7 @@ static void raid_resume(struct dm_target *ti)
 
 static struct target_type raid_target = {
 	.name = "raid",
-	.version = {1, 11, 0},
+	.version = {1, 11, 1},
 	.module = THIS_MODULE,
 	.ctr = raid_ctr,
 	.dtr = raid_dtr,
-- 
2.9.3


^ permalink raw reply related

* Re: mdadm: one question about the readonly and readwrite feature
From: NeilBrown @ 2017-03-22 21:55 UTC (permalink / raw)
  To: Zhilong Liu, Jes Sorensen; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <774691e9-6b1e-5141-bc37-6e768c1c8fdc@suse.com>

[-- Attachment #1: Type: text/plain, Size: 4103 bytes --]

On Wed, Mar 22 2017, Zhilong Liu wrote:

> Hi, Neil;
>
>    Excuse me, according to read 'mdadm/tests/ToTest', I'm a little 
> confused about "readonly"
> and "readwrite" feature, and I've no idea how to fix it. Thus I report 
> this question and I'm sorry
> for this long description email.
>
> relevant linux/driver/md commit: 260fa034ef7a4ff8b73068b48ac497edd5217491
>
>    My question: If the array has been set the MD_CLOSING flag, although 
> hasn't removed the sysfs
> folder because sysfs_remove_group() wasn't invoked, and now, how should 
> mdadm continue to
> control this 'readonly' array?

MD_CLOSING should only be set for a short period or time to avoid
certain races.  After the operation that set it completes, it should be
cleared.
It looks like this is a bug that was introduced in
Commit: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
when MD_STILL_CLOSED was renamed to MD_CLOSING.

If we already had the tests for readonly/readwrite that you are working
on, we might have caught the bug earlier - so I'm glad you are working
on these tests..

Thanks,
NeilBrown


>    Of course, once we cannot operate the array, the 'readwrite' feature 
> would be never worked.
>
> Test step:
> # ./mdadm -CR /dev/md0 -b internal -l1 -n2 /dev/loop[0-1] --assume-clean
> # ./mdadm -o /dev/mdX
>
> # in md.h
> enum mddev_flags {
>          MD_ARRAY_FIRST_USE,     /* First use of array, needs 
> initialization */
>          MD_CLOSING,             /* If set, we are closing the array, do 
> not open it then */
>
> 1. In mdadm tool:
>    the func: Manage_ro(dv->devname, mdfd, -1) would be never invoked 
> once the array has been
> set 'readonly' before. the open_mddev() cannot get a valid file 
> descriptor any more. Most of mdadm
> commands would be failure, I have to execute the "echo clear > 
> /sys/block/mdX/md/array_state".
>
> # refer to mdadm.c
> ... ...
> static int misc_list(struct mddev_dev *devlist,
>                       struct mddev_ident *ident,
>                       char *dump_directory,
>                       struct supertype *ss, struct context *c)
> {
> ... ...
>                  switch(dv->devname[0] == '/') {
>                          case 0:
>                                  mdfd = open_dev(dv->devname);
>                                  if (mdfd >= 0) break;
>                          case 1:
>                                  mdfd = open_mddev(dv->devname, 1);
>                  }
>                  if (mdfd>=0) {
>                          switch(dv->disposition) {
>                          case 'R':
>                                  c->runstop = 1;
>                                  rv |= Manage_run(dv->devname, mdfd, c); 
> break;
>                          case 'S':
>                                  rv |= Manage_stop(dv->devname, mdfd, 
> c->verbose, 0); break;
>                          case 'o':
>                                  rv |= Manage_ro(dv->devname, mdfd, 1); 
> break;
>                          case 'w':
>                                  rv |= Manage_ro(dv->devname, mdfd, -1); 
> break;
>                          }
> 2. in md driver:
> For readonly, the code path is:
> ioctl(fd, STOP_ARRAY_RO, NULL)  - - > set_bit(MD_CLOSING, &mddev->flags) 
> - - > md_set_readonly()
>
> cut a piece of code: - -> md_set_readonly() of md.c:
> ... ...
>          if (mddev->pers) {
>                  __md_stop_writes(mddev);
>
>                  err  = -ENXIO;
>                  if (mddev->ro==1)
>                          goto out;
>                  mddev->ro = 1;
>                  set_disk_ro(mddev->gendisk, 1);
>                  clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>                  set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);   - - > 
> I think it did nothing once readonly has been set.
>                  md_wakeup_thread(mddev->thread);
>                  sysfs_notify_dirent_safe(mddev->sysfs_state);
>                  err = 0;
> ... ...
>
> Thanks for your patience,
> -Zhilong

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: mdadm: one question about the readonly and readwrite feature
From: Guoqing Jiang @ 2017-03-23  1:54 UTC (permalink / raw)
  To: NeilBrown, Zhilong Liu, Jes Sorensen; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <87zigdt1r1.fsf@notabene.neil.brown.name>



On 03/23/2017 05:55 AM, NeilBrown wrote:
> On Wed, Mar 22 2017, Zhilong Liu wrote:
>
>> Hi, Neil;
>>
>>     Excuse me, according to read 'mdadm/tests/ToTest', I'm a little
>> confused about "readonly"
>> and "readwrite" feature, and I've no idea how to fix it. Thus I report
>> this question and I'm sorry
>> for this long description email.
>>
>> relevant linux/driver/md commit: 260fa034ef7a4ff8b73068b48ac497edd5217491
>>
>>     My question: If the array has been set the MD_CLOSING flag, although
>> hasn't removed the sysfs
>> folder because sysfs_remove_group() wasn't invoked, and now, how should
>> mdadm continue to
>> control this 'readonly' array?
> MD_CLOSING should only be set for a short period or time to avoid
> certain races.  After the operation that set it completes, it should be
> cleared.
> It looks like this is a bug that was introduced in
> Commit: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
> when MD_STILL_CLOSED was renamed to MD_CLOSING.

I guess it is because we set MD_CLOSING for STOP_ARRAY_RO cmd, then commit
af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") did below changes:

@@ -7075,9 +7073,13 @@ static int md_open(struct block_device *bdev, 
fmode_t mode)
         if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
                 goto out;

+       if (test_bit(MD_CLOSING, &mddev->flags)) {
+               mutex_unlock(&mddev->open_mutex);
+               return -ENODEV;
+       }

Maybe we need to differentiate "STOP_ARRAY" and "STOP_ARRAY_RO", or 
revert above
changes.

Thanks,
Guoqing

>
> If we already had the tests for readonly/readwrite that you are working
> on, we might have caught the bug earlier - so I'm glad you are working
> on these tests..
>
> Thanks,
> NeilBrown
>
>
>>     Of course, once we cannot operate the array, the 'readwrite' feature
>> would be never worked.
>>
>> Test step:
>> # ./mdadm -CR /dev/md0 -b internal -l1 -n2 /dev/loop[0-1] --assume-clean
>> # ./mdadm -o /dev/mdX
>>
>> # in md.h
>> enum mddev_flags {
>>           MD_ARRAY_FIRST_USE,     /* First use of array, needs
>> initialization */
>>           MD_CLOSING,             /* If set, we are closing the array, do
>> not open it then */
>>
>> 1. In mdadm tool:
>>     the func: Manage_ro(dv->devname, mdfd, -1) would be never invoked
>> once the array has been
>> set 'readonly' before. the open_mddev() cannot get a valid file
>> descriptor any more. Most of mdadm
>> commands would be failure, I have to execute the "echo clear >
>> /sys/block/mdX/md/array_state".
>>
>> # refer to mdadm.c
>> ... ...
>> static int misc_list(struct mddev_dev *devlist,
>>                        struct mddev_ident *ident,
>>                        char *dump_directory,
>>                        struct supertype *ss, struct context *c)
>> {
>> ... ...
>>                   switch(dv->devname[0] == '/') {
>>                           case 0:
>>                                   mdfd = open_dev(dv->devname);
>>                                   if (mdfd >= 0) break;
>>                           case 1:
>>                                   mdfd = open_mddev(dv->devname, 1);
>>                   }
>>                   if (mdfd>=0) {
>>                           switch(dv->disposition) {
>>                           case 'R':
>>                                   c->runstop = 1;
>>                                   rv |= Manage_run(dv->devname, mdfd, c);
>> break;
>>                           case 'S':
>>                                   rv |= Manage_stop(dv->devname, mdfd,
>> c->verbose, 0); break;
>>                           case 'o':
>>                                   rv |= Manage_ro(dv->devname, mdfd, 1);
>> break;
>>                           case 'w':
>>                                   rv |= Manage_ro(dv->devname, mdfd, -1);
>> break;
>>                           }
>> 2. in md driver:
>> For readonly, the code path is:
>> ioctl(fd, STOP_ARRAY_RO, NULL)  - - > set_bit(MD_CLOSING, &mddev->flags)
>> - - > md_set_readonly()
>>
>> cut a piece of code: - -> md_set_readonly() of md.c:
>> ... ...
>>           if (mddev->pers) {
>>                   __md_stop_writes(mddev);
>>
>>                   err  = -ENXIO;
>>                   if (mddev->ro==1)
>>                           goto out;
>>                   mddev->ro = 1;
>>                   set_disk_ro(mddev->gendisk, 1);
>>                   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>                   set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);   - - >
>> I think it did nothing once readonly has been set.
>>                   md_wakeup_thread(mddev->thread);
>>                   sysfs_notify_dirent_safe(mddev->sysfs_state);
>>                   err = 0;
>> ... ...
>>
>> Thanks for your patience,
>> -Zhilong


^ permalink raw reply

* Re: [PATCH] percpu-refcount: support synchronous switch to atomic mode.
From: Shaohua Li @ 2017-03-23  2:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: NeilBrown, Christoph Lameter, linux-kernel, Linux-RAID
In-Reply-To: <20170322150029.GA7532@htj.duckdns.org>

On Wed, Mar 22, 2017 at 11:00:29AM -0400, Tejun Heo wrote:
> On Wed, Mar 22, 2017 at 12:50:12PM +1100, NeilBrown wrote:
> > 
> > percpu_ref_switch_to_atomic_sync() schedules the switch
> > to atomic mode, then waits for it to complete.
> > 
> > Also export percpu_ref_switch_to_* so they can be used from modules.
> > 
> > This will be used in md/raid to count the number of pending write
> > requests to an array.
> > We occasionally need to check if the count is zero, but most often
> > we don't care.
> > We always want updates to the counter to be fast, as in some cases
> > we count every 4K page.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.com>
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> 
> Please feel free to route through other patches through md tree.  If
> preferable, I can route it through percpu/for-4.12.

Thanks, Tejun. I applied it to md tree, that will avoid a lot of troubles.

Thanks,
Shaohua

^ permalink raw reply

* Re: REALLY Fix bug in [md PATCH 02/15] md/raid5: simplfy delaying of writes while metadata is updated.
From: Shaohua Li @ 2017-03-23  2:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, hch
In-Reply-To: <87a88eavi3.fsf@notabene.neil.brown.name>

On Wed, Mar 22, 2017 at 01:35:48PM +1100, Neil Brown wrote:
> On Wed, Mar 22 2017, NeilBrown wrote:
> 
> > Using r1_bio->sector in call_bio_endio is more
> > obviously-correct than bio->bi_iter.bi_sector,
> > though both should have the same value.
> >
> > The inc_pending() call in handle_read_error() was missing.
> > One should always accompany a bio_inc_remaining.
> >
> > Signed-off-by: NeilBrown <neilb@suse.com>
> > ---
> >
> > Sorry, I left an unused variable...
> > NeilBrown
> 
> Sorry again - I replied to the wrong email (not a good day...)
> and gave the wrong subject.
> This should be
>   REALLY Fix bugs in [md PATCH 10/15] md/raid1: stop using bi_phys_segment
> 
> :-(

Thanks, applied all the patches except the 'trace completion' patch.

Thanks,
Shaohua

^ permalink raw reply

* Re: mdadm: one question about the readonly and readwrite feature
From: Guoqing Jiang @ 2017-03-23  2:26 UTC (permalink / raw)
  To: NeilBrown, Zhilong Liu, Jes Sorensen; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <58D32AE2.1030303@suse.com>



On 03/23/2017 09:54 AM, Guoqing Jiang wrote:
>
>
> On 03/23/2017 05:55 AM, NeilBrown wrote:
>> On Wed, Mar 22 2017, Zhilong Liu wrote:
>>
>>> Hi, Neil;
>>>
>>>     Excuse me, according to read 'mdadm/tests/ToTest', I'm a little
>>> confused about "readonly"
>>> and "readwrite" feature, and I've no idea how to fix it. Thus I report
>>> this question and I'm sorry
>>> for this long description email.
>>>
>>> relevant linux/driver/md commit: 
>>> 260fa034ef7a4ff8b73068b48ac497edd5217491
>>>
>>>     My question: If the array has been set the MD_CLOSING flag, 
>>> although
>>> hasn't removed the sysfs
>>> folder because sysfs_remove_group() wasn't invoked, and now, how should
>>> mdadm continue to
>>> control this 'readonly' array?
>> MD_CLOSING should only be set for a short period or time to avoid
>> certain races.  After the operation that set it completes, it should be
>> cleared.
>> It looks like this is a bug that was introduced in
>> Commit: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
>> when MD_STILL_CLOSED was renamed to MD_CLOSING.
>
> I guess it is because we set MD_CLOSING for STOP_ARRAY_RO cmd, then 
> commit
> af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") did below changes:
>
> @@ -7075,9 +7073,13 @@ static int md_open(struct block_device *bdev, 
> fmode_t mode)
>         if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
>                 goto out;
>
> +       if (test_bit(MD_CLOSING, &mddev->flags)) {
> +               mutex_unlock(&mddev->open_mutex);
> +               return -ENODEV;
> +       }
>
> Maybe we need to differentiate "STOP_ARRAY" and "STOP_ARRAY_RO", or 
> revert above
> changes.

Or maybe something like below (only compile test).

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 42e68b2e0b41..c40e863fe191 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -65,6 +65,7 @@
  #include <linux/raid/md_p.h>
  #include <linux/raid/md_u.h>
  #include <linux/slab.h>
+#include <linux/genhd.h>
  #include <trace/events/block.h>
  #include "md.h"
  #include "bitmap.h"
@@ -7237,7 +7238,7 @@ static int md_open(struct block_device *bdev, 
fmode_t mode)
         if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
                 goto out;

-       if (test_bit(MD_CLOSING, &mddev->flags)) {
+       if (!get_disk_ro(mddev->gendisk) && test_bit(MD_CLOSING, 
&mddev->flags)) {
                 mutex_unlock(&mddev->open_mutex);
                 err = -ENODEV;
                 goto out;

Thanks,
Guoqing

^ permalink raw reply related

* Re: mdadm: one question about the readonly and readwrite feature
From: NeilBrown @ 2017-03-23  3:42 UTC (permalink / raw)
  To: Guoqing Jiang, Zhilong Liu, Jes Sorensen; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <58D32AE2.1030303@suse.com>

[-- Attachment #1: Type: text/plain, Size: 2965 bytes --]

On Thu, Mar 23 2017, Guoqing Jiang wrote:

> On 03/23/2017 05:55 AM, NeilBrown wrote:
>> On Wed, Mar 22 2017, Zhilong Liu wrote:
>>
>>> Hi, Neil;
>>>
>>>     Excuse me, according to read 'mdadm/tests/ToTest', I'm a little
>>> confused about "readonly"
>>> and "readwrite" feature, and I've no idea how to fix it. Thus I report
>>> this question and I'm sorry
>>> for this long description email.
>>>
>>> relevant linux/driver/md commit: 260fa034ef7a4ff8b73068b48ac497edd5217491
>>>
>>>     My question: If the array has been set the MD_CLOSING flag, although
>>> hasn't removed the sysfs
>>> folder because sysfs_remove_group() wasn't invoked, and now, how should
>>> mdadm continue to
>>> control this 'readonly' array?
>> MD_CLOSING should only be set for a short period or time to avoid
>> certain races.  After the operation that set it completes, it should be
>> cleared.
>> It looks like this is a bug that was introduced in
>> Commit: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
>> when MD_STILL_CLOSED was renamed to MD_CLOSING.
>
> I guess it is because we set MD_CLOSING for STOP_ARRAY_RO cmd, then commit
> af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") did below changes:
>
> @@ -7075,9 +7073,13 @@ static int md_open(struct block_device *bdev, 
> fmode_t mode)
>          if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
>                  goto out;
>
> +       if (test_bit(MD_CLOSING, &mddev->flags)) {
> +               mutex_unlock(&mddev->open_mutex);
> +               return -ENODEV;
> +       }
>
> Maybe we need to differentiate "STOP_ARRAY" and "STOP_ARRAY_RO", or 
> revert above
> changes.
>

No, your missing the point.
What you describe above would change the effect of what MD_CLOSING does.
We don't want to change it.  It is good.
The problem is that when it has finished doing what it needs to do,
we aren't clearing it. That is simply a bug.

So something like this is needed.

NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 7e168ff1ae90..567aba246497 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6934,6 +6934,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
 	void __user *argp = (void __user *)arg;
 	struct mddev *mddev = NULL;
 	int ro;
+	bool did_set_md_closing = false;
 
 	if (!md_ioctl_valid(cmd))
 		return -ENOTTY;
@@ -7023,7 +7024,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
 			err = -EBUSY;
 			goto out;
 		}
+		WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags));
 		set_bit(MD_CLOSING, &mddev->flags);
+		did_set_md_closing = true;
 		mutex_unlock(&mddev->open_mutex);
 		sync_blockdev(bdev);
 	}
@@ -7216,6 +7219,8 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
 		mddev->hold_active = 0;
 	mddev_unlock(mddev);
 out:
+	if (did_set_md_closing)
+		clear_bit(MD_CLOSING, &mddev->flags);
 	return err;
 }
 #ifdef CONFIG_COMPAT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: mdadm: one question about the readonly and readwrite feature
From: Zhilong Liu @ 2017-03-23  3:54 UTC (permalink / raw)
  To: NeilBrown, Guoqing Jiang, Jes Sorensen; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <871stobqvl.fsf@notabene.neil.brown.name>


On 03/23/2017 11:42 AM, NeilBrown wrote:
> On Thu, Mar 23 2017, Guoqing Jiang wrote:
>
>> On 03/23/2017 05:55 AM, NeilBrown wrote:
>>> On Wed, Mar 22 2017, Zhilong Liu wrote:
>>>
>>>> Hi, Neil;
>>>>
>>>>      Excuse me, according to read 'mdadm/tests/ToTest', I'm a little
>>>> confused about "readonly"
>>>> and "readwrite" feature, and I've no idea how to fix it. Thus I report
>>>> this question and I'm sorry
>>>> for this long description email.
>>>>
>>>> relevant linux/driver/md commit: 260fa034ef7a4ff8b73068b48ac497edd5217491
>>>>
>>>>      My question: If the array has been set the MD_CLOSING flag, although
>>>> hasn't removed the sysfs
>>>> folder because sysfs_remove_group() wasn't invoked, and now, how should
>>>> mdadm continue to
>>>> control this 'readonly' array?
>>> MD_CLOSING should only be set for a short period or time to avoid
>>> certain races.  After the operation that set it completes, it should be
>>> cleared.
>>> It looks like this is a bug that was introduced in
>>> Commit: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
>>> when MD_STILL_CLOSED was renamed to MD_CLOSING.
>> I guess it is because we set MD_CLOSING for STOP_ARRAY_RO cmd, then commit
>> af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") did below changes:
>>
>> @@ -7075,9 +7073,13 @@ static int md_open(struct block_device *bdev,
>> fmode_t mode)
>>           if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
>>                   goto out;
>>
>> +       if (test_bit(MD_CLOSING, &mddev->flags)) {
>> +               mutex_unlock(&mddev->open_mutex);
>> +               return -ENODEV;
>> +       }
>>
>> Maybe we need to differentiate "STOP_ARRAY" and "STOP_ARRAY_RO", or
>> revert above
>> changes.
>>
> No, your missing the point.
> What you describe above would change the effect of what MD_CLOSING does.
> We don't want to change it.  It is good.
> The problem is that when it has finished doing what it needs to do,
> we aren't clearing it. That is simply a bug.
>
> So something like this is needed.

I can understand this method, :-). So I would do further learning on 
this code path.
I would response on it later. Thanks very much.

Thanks,
-Zhilong
> NeilBrown
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 7e168ff1ae90..567aba246497 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6934,6 +6934,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>   	void __user *argp = (void __user *)arg;
>   	struct mddev *mddev = NULL;
>   	int ro;
> +	bool did_set_md_closing = false;
>   
>   	if (!md_ioctl_valid(cmd))
>   		return -ENOTTY;
> @@ -7023,7 +7024,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>   			err = -EBUSY;
>   			goto out;
>   		}
> +		WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags));
>   		set_bit(MD_CLOSING, &mddev->flags);
> +		did_set_md_closing = true;
>   		mutex_unlock(&mddev->open_mutex);
>   		sync_blockdev(bdev);
>   	}
> @@ -7216,6 +7219,8 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>   		mddev->hold_active = 0;
>   	mddev_unlock(mddev);
>   out:
> +	if (did_set_md_closing)
> +		clear_bit(MD_CLOSING, &mddev->flags);
>   	return err;
>   }
>   #ifdef CONFIG_COMPAT


^ permalink raw reply

* Re: [PATCH] block: trace completion of all bios.
From: NeilBrown @ 2017-03-23  6:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-raid, Mike Snitzer, linux-kernel, linux-block,
	dm-devel, Shaohua Li, Alasdair Kergon
In-Reply-To: <20170322125149.GA29606@infradead.org>


[-- Attachment #1.1: Type: text/plain, Size: 901 bytes --]

On Wed, Mar 22 2017, Christoph Hellwig wrote:

> On Wed, Mar 22, 2017 at 01:38:09PM +1100, NeilBrown wrote:
>> 
>> Currently only dm and md/raid5 bios trigger trace_block_bio_complete().
>> Now that we have bio_chain(), it is not possible, in general, for a
>> driver to know when the bio is really complete.  Only bio_endio()
>> knows that.
>> 
>> So move the trace_block_bio_complete() call to bio_endio().
>
> This will cause duplicate events for request based drivers.  You'll
> need to have a bio_endio_notrace or similar without that the request
> completion path can call.

Ah... I hadn't noticed that the request completion was the same event
type as the bio completion...  Thanks.  Also after being processed by the
request handler, bi_sector and bi_size have changed so the trace
messsage would be wrong.

I've sorted that out and will repost.

Thanks,
NeilBrown



[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply

* [PATCH v2] block: trace completion of all bios.
From: NeilBrown @ 2017-03-23  6:29 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: linux-raid, Martin K . Petersen, Mike Snitzer, linux-kernel,
	linux-block, dm-devel, Shaohua Li, Alasdair Kergon
In-Reply-To: <20170322125149.GA29606@infradead.org>


[-- Attachment #1.1: Type: text/plain, Size: 7116 bytes --]


Currently only dm and md/raid5 bios trigger trace_block_bio_complete().
Now that we have bio_chain(), it is not possible, in general, for a
driver to know when the bio is really complete.  Only bio_endio()
knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
   trace event at the 'request' level, there is no point generating
   one at the bio level too.  In this case the bi_sector and bi_size
   will have changed, so the bio level event would be wrong

2/ If the bio hasn't actually been queued yet, but is being aborted
   early, then a trace event could be confusing.  Some filesystems
   call bio_endio() and will need to use a different interface to
   avoid tracing

3/ The bio_integrity code interposes itself by replacing bi_end_io,
   then restores it and calls bio_endio() again.  This would produce
   two identical trace events if left like that.

To handle these, we provide bio_endio_notrace().  This patch only adds
uses of this in core code.  Separate patches will be needed to update
the filesystems to avoid tracing.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio-integrity.c |  4 ++--
 block/bio.c           | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-core.c      |  2 +-
 drivers/md/dm.c       |  1 -
 drivers/md/raid5.c    |  8 --------
 include/linux/bio.h   |  1 +
 6 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713d48bc..28581e2f68fb 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -370,7 +370,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
 
 	/* Restore original bio completion handler */
 	bio->bi_end_io = bip->bip_end_io;
-	bio_endio(bio);
+	bio_endio_notrace(bio);
 }
 
 /**
@@ -397,7 +397,7 @@ void bio_integrity_endio(struct bio *bio)
 	 */
 	if (bio->bi_error) {
 		bio->bi_end_io = bip->bip_end_io;
-		bio_endio(bio);
+		bio_endio_notrace(bio);
 
 		return;
 	}
diff --git a/block/bio.c b/block/bio.c
index 5eec5e08417f..c8e5d24abd52 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1811,6 +1811,45 @@ static inline bool bio_remaining_done(struct bio *bio)
 }
 
 /**
+ * bio_endio_notrace - end I/O on a bio without tracing
+ * @bio:	bio
+ *
+ * Description:
+ *   bio_endio_notrace() will end I/O on the whole bio.
+ *   bio_endio_notrace() should only be call if a completion trace
+ *   event is not needed.  This can be the case if a request-level
+ *   completion event has already been generated, if the bio is
+ *   being completed early, before it was even queued.
+ *
+ **/
+void bio_endio_notrace(struct bio *bio)
+{
+again:
+	if (!bio_remaining_done(bio))
+		return;
+
+	/*
+	 * Need to have a real endio function for chained bios, otherwise
+	 * various corner cases will break (like stacking block devices that
+	 * save/restore bi_end_io) - however, we want to avoid unbounded
+	 * recursion and blowing the stack. Tail call optimization would
+	 * handle this, but compiling with frame pointers also disables
+	 * gcc's sibling call optimization.
+	 */
+	if (bio->bi_end_io == bio_chain_endio) {
+		bio = __bio_chain_endio(bio);
+		goto again;
+	}
+
+	if (bio->bi_bdev)
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+					 bio, bio->bi_error);
+	if (bio->bi_end_io)
+		bio->bi_end_io(bio);
+}
+EXPORT_SYMBOL(bio_endio_notrace);
+
+/**
  * bio_endio - end I/O on a bio
  * @bio:	bio
  *
@@ -1818,6 +1857,10 @@ static inline bool bio_remaining_done(struct bio *bio)
  *   bio_endio() will end I/O on the whole bio. bio_endio() is the preferred
  *   way to end I/O on a bio. No one should call bi_end_io() directly on a
  *   bio unless they own it and thus know that it has an end_io function.
+ *
+ *   bio_endio() can be called several times on a bio that has been chained
+ *   using bio_chain().  The ->bi_end_io() function will only be call the
+ *   time.  At this point the BLK_TA_COMPLETE tracing event will be generated.
  **/
 void bio_endio(struct bio *bio)
 {
@@ -1838,6 +1881,9 @@ void bio_endio(struct bio *bio)
 		goto again;
 	}
 
+	if (bio->bi_bdev)
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+					 bio, bio->bi_error);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
diff --git a/block/blk-core.c b/block/blk-core.c
index 0eeb99ef654f..b6c76580a796 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -142,7 +142,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
-		bio_endio(bio);
+		bio_endio_notrace(bio);
 }
 
 void blk_dump_rq_flags(struct request *rq, char *msg)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f4ffd1eb8f44..f5f09ace690a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -810,7 +810,6 @@ static void dec_pending(struct dm_io *io, int error)
 			queue_io(md, bio);
 		} else {
 			/* done with normal IO or empty flush */
-			trace_block_bio_complete(md->queue, bio, io_error);
 			bio->bi_error = io_error;
 			bio_endio(bio);
 		}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9a3b7da34137..f684cb566721 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5141,8 +5141,6 @@ static void raid5_align_endio(struct bio *bi)
 	rdev_dec_pending(rdev, conf->mddev);
 
 	if (!error) {
-		trace_block_bio_complete(bdev_get_queue(raid_bi->bi_bdev),
-					 raid_bi, 0);
 		bio_endio(raid_bi);
 		if (atomic_dec_and_test(&conf->active_aligned_reads))
 			wake_up(&conf->wait_for_quiescent);
@@ -5727,10 +5725,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
-
-
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
 		bio_endio(bi);
 	}
 }
@@ -6138,8 +6132,6 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	}
 	remaining = raid5_dec_bi_active_stripes(raid_bio);
 	if (remaining == 0) {
-		trace_block_bio_complete(bdev_get_queue(raid_bio->bi_bdev),
-					 raid_bio, 0);
 		bio_endio(raid_bio);
 	}
 	if (atomic_dec_and_test(&conf->active_aligned_reads))
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..e0552bee227b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -418,6 +418,7 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
 extern blk_qc_t submit_bio(struct bio *);
 
 extern void bio_endio(struct bio *);
+extern void bio_endio_notrace(struct bio *);
 
 static inline void bio_io_error(struct bio *bio)
 {
-- 
2.12.0


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related

* Re: mdadm: one question about the readonly and readwrite feature
From: Zhilong Liu @ 2017-03-23  6:50 UTC (permalink / raw)
  To: NeilBrown, Guoqing Jiang, Jes Sorensen; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <871stobqvl.fsf@notabene.neil.brown.name>


On 03/23/2017 11:42 AM, NeilBrown wrote:
> On Thu, Mar 23 2017, Guoqing Jiang wrote:
>
>> On 03/23/2017 05:55 AM, NeilBrown wrote:
>>> On Wed, Mar 22 2017, Zhilong Liu wrote:
>>>
>>>> Hi, Neil;
>>>>
>>>>      Excuse me, according to read 'mdadm/tests/ToTest', I'm a little
>>>> confused about "readonly"
>>>> and "readwrite" feature, and I've no idea how to fix it. Thus I report
>>>> this question and I'm sorry
>>>> for this long description email.
>>>>
>>>> relevant linux/driver/md commit: 260fa034ef7a4ff8b73068b48ac497edd5217491
>>>>
>>>>      My question: If the array has been set the MD_CLOSING flag, although
>>>> hasn't removed the sysfs
>>>> folder because sysfs_remove_group() wasn't invoked, and now, how should
>>>> mdadm continue to
>>>> control this 'readonly' array?
>>> MD_CLOSING should only be set for a short period or time to avoid
>>> certain races.  After the operation that set it completes, it should be
>>> cleared.
>>> It looks like this is a bug that was introduced in
>>> Commit: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
>>> when MD_STILL_CLOSED was renamed to MD_CLOSING.
>> I guess it is because we set MD_CLOSING for STOP_ARRAY_RO cmd, then commit
>> af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") did below changes:
>>
>> @@ -7075,9 +7073,13 @@ static int md_open(struct block_device *bdev,
>> fmode_t mode)
>>           if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
>>                   goto out;
>>
>> +       if (test_bit(MD_CLOSING, &mddev->flags)) {
>> +               mutex_unlock(&mddev->open_mutex);
>> +               return -ENODEV;
>> +       }
>>
>> Maybe we need to differentiate "STOP_ARRAY" and "STOP_ARRAY_RO", or
>> revert above
>> changes.
>>
> No, your missing the point.
> What you describe above would change the effect of what MD_CLOSING does.
> We don't want to change it.  It is good.
> The problem is that when it has finished doing what it needs to do,
> we aren't clearing it. That is simply a bug.
>
> So something like this is needed.

Sorry again, I have another question in md_set_readonly() of md.c
Although it has stopped all md writes operations, and I still prefer
to do set_disk_ro after md_wakeup_thread. refer to following code.

cut the piece of code:
... ...
static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
{
... ...
         if (mddev->pers) {
                 __md_stop_writes(mddev);

                 err  = -ENXIO;
                 if (mddev->ro==1)
                         goto out;
                 mddev->ro = 1;
                 set_disk_ro(mddev->gendisk, 1);         ## --> I prefer 
to do it after md_wakeup_thread.
                 clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
                 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
                 md_wakeup_thread(mddev->thread);     ## --> do 
set_disk_ro after this step.
                 sysfs_notify_dirent_safe(mddev->sysfs_state);
                 err = 0;
         }
out:
         mutex_unlock(&mddev->open_mutex);
         return err;
}

Thanks,
-Zhilong
> NeilBrown
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 7e168ff1ae90..567aba246497 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6934,6 +6934,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>   	void __user *argp = (void __user *)arg;
>   	struct mddev *mddev = NULL;
>   	int ro;
> +	bool did_set_md_closing = false;
>   
>   	if (!md_ioctl_valid(cmd))
>   		return -ENOTTY;
> @@ -7023,7 +7024,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>   			err = -EBUSY;
>   			goto out;
>   		}
> +		WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags));
>   		set_bit(MD_CLOSING, &mddev->flags);
> +		did_set_md_closing = true;
>   		mutex_unlock(&mddev->open_mutex);
>   		sync_blockdev(bdev);
>   	}
> @@ -7216,6 +7219,8 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>   		mddev->hold_active = 0;
>   	mddev_unlock(mddev);
>   out:
> +	if (did_set_md_closing)
> +		clear_bit(MD_CLOSING, &mddev->flags);
>   	return err;
>   }
>   #ifdef CONFIG_COMPAT


^ permalink raw reply

* Re: mdadm: one question about the readonly and readwrite feature
From: NeilBrown @ 2017-03-23  7:06 UTC (permalink / raw)
  To: Zhilong Liu, Guoqing Jiang, Jes Sorensen; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <600b22b5-762e-da9c-0d20-9023de4361b3@suse.com>

[-- Attachment #1: Type: text/plain, Size: 4657 bytes --]

On Thu, Mar 23 2017, Zhilong Liu wrote:

> On 03/23/2017 11:42 AM, NeilBrown wrote:
>> On Thu, Mar 23 2017, Guoqing Jiang wrote:
>>
>>> On 03/23/2017 05:55 AM, NeilBrown wrote:
>>>> On Wed, Mar 22 2017, Zhilong Liu wrote:
>>>>
>>>>> Hi, Neil;
>>>>>
>>>>>      Excuse me, according to read 'mdadm/tests/ToTest', I'm a little
>>>>> confused about "readonly"
>>>>> and "readwrite" feature, and I've no idea how to fix it. Thus I report
>>>>> this question and I'm sorry
>>>>> for this long description email.
>>>>>
>>>>> relevant linux/driver/md commit: 260fa034ef7a4ff8b73068b48ac497edd5217491
>>>>>
>>>>>      My question: If the array has been set the MD_CLOSING flag, although
>>>>> hasn't removed the sysfs
>>>>> folder because sysfs_remove_group() wasn't invoked, and now, how should
>>>>> mdadm continue to
>>>>> control this 'readonly' array?
>>>> MD_CLOSING should only be set for a short period or time to avoid
>>>> certain races.  After the operation that set it completes, it should be
>>>> cleared.
>>>> It looks like this is a bug that was introduced in
>>>> Commit: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
>>>> when MD_STILL_CLOSED was renamed to MD_CLOSING.
>>> I guess it is because we set MD_CLOSING for STOP_ARRAY_RO cmd, then commit
>>> af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") did below changes:
>>>
>>> @@ -7075,9 +7073,13 @@ static int md_open(struct block_device *bdev,
>>> fmode_t mode)
>>>           if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
>>>                   goto out;
>>>
>>> +       if (test_bit(MD_CLOSING, &mddev->flags)) {
>>> +               mutex_unlock(&mddev->open_mutex);
>>> +               return -ENODEV;
>>> +       }
>>>
>>> Maybe we need to differentiate "STOP_ARRAY" and "STOP_ARRAY_RO", or
>>> revert above
>>> changes.
>>>
>> No, your missing the point.
>> What you describe above would change the effect of what MD_CLOSING does.
>> We don't want to change it.  It is good.
>> The problem is that when it has finished doing what it needs to do,
>> we aren't clearing it. That is simply a bug.
>>
>> So something like this is needed.
>
> Sorry again, I have another question in md_set_readonly() of md.c
> Although it has stopped all md writes operations, and I still prefer
> to do set_disk_ro after md_wakeup_thread. refer to following code.

Why?
How do the actions performed by set_disk_ro() interact with the actions
performed by md_wakeup_thread()?

I'm not saying the code shouldn't be changed, just that there needs to
be a clear explanation of the benefit.

NeilBrown


>
> cut the piece of code:
> ... ...
> static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
> {
> ... ...
>          if (mddev->pers) {
>                  __md_stop_writes(mddev);
>
>                  err  = -ENXIO;
>                  if (mddev->ro==1)
>                          goto out;
>                  mddev->ro = 1;
>                  set_disk_ro(mddev->gendisk, 1);         ## --> I prefer 
> to do it after md_wakeup_thread.
>                  clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>                  set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>                  md_wakeup_thread(mddev->thread);     ## --> do 
> set_disk_ro after this step.
>                  sysfs_notify_dirent_safe(mddev->sysfs_state);
>                  err = 0;
>          }
> out:
>          mutex_unlock(&mddev->open_mutex);
>          return err;
> }
>
> Thanks,
> -Zhilong
>> NeilBrown
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 7e168ff1ae90..567aba246497 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -6934,6 +6934,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>>   	void __user *argp = (void __user *)arg;
>>   	struct mddev *mddev = NULL;
>>   	int ro;
>> +	bool did_set_md_closing = false;
>>   
>>   	if (!md_ioctl_valid(cmd))
>>   		return -ENOTTY;
>> @@ -7023,7 +7024,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>>   			err = -EBUSY;
>>   			goto out;
>>   		}
>> +		WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags));
>>   		set_bit(MD_CLOSING, &mddev->flags);
>> +		did_set_md_closing = true;
>>   		mutex_unlock(&mddev->open_mutex);
>>   		sync_blockdev(bdev);
>>   	}
>> @@ -7216,6 +7219,8 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>>   		mddev->hold_active = 0;
>>   	mddev_unlock(mddev);
>>   out:
>> +	if (did_set_md_closing)
>> +		clear_bit(MD_CLOSING, &mddev->flags);
>>   	return err;
>>   }
>>   #ifdef CONFIG_COMPAT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: mdadm: one question about the readonly and readwrite feature
From: Zhilong Liu @ 2017-03-23  8:14 UTC (permalink / raw)
  To: NeilBrown, Guoqing Jiang, Jes Sorensen; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <87poh8a2vz.fsf@notabene.neil.brown.name>


On 03/23/2017 03:06 PM, NeilBrown wrote:
> On Thu, Mar 23 2017, Zhilong Liu wrote:
>
>> On 03/23/2017 11:42 AM, NeilBrown wrote:
>>> On Thu, Mar 23 2017, Guoqing Jiang wrote:
>>>
>>>> On 03/23/2017 05:55 AM, NeilBrown wrote:
>>>>> On Wed, Mar 22 2017, Zhilong Liu wrote:
>>>>>
>>>>>> Hi, Neil;
>>>>>>
>>>>>>       Excuse me, according to read 'mdadm/tests/ToTest', I'm a little
>>>>>> confused about "readonly"
>>>>>> and "readwrite" feature, and I've no idea how to fix it. Thus I report
>>>>>> this question and I'm sorry
>>>>>> for this long description email.
>>>>>>
>>>>>> relevant linux/driver/md commit: 260fa034ef7a4ff8b73068b48ac497edd5217491
>>>>>>
>>>>>>       My question: If the array has been set the MD_CLOSING flag, although
>>>>>> hasn't removed the sysfs
>>>>>> folder because sysfs_remove_group() wasn't invoked, and now, how should
>>>>>> mdadm continue to
>>>>>> control this 'readonly' array?
>>>>> MD_CLOSING should only be set for a short period or time to avoid
>>>>> certain races.  After the operation that set it completes, it should be
>>>>> cleared.
>>>>> It looks like this is a bug that was introduced in
>>>>> Commit: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
>>>>> when MD_STILL_CLOSED was renamed to MD_CLOSING.
>>>> I guess it is because we set MD_CLOSING for STOP_ARRAY_RO cmd, then commit
>>>> af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") did below changes:
>>>>
>>>> @@ -7075,9 +7073,13 @@ static int md_open(struct block_device *bdev,
>>>> fmode_t mode)
>>>>            if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
>>>>                    goto out;
>>>>
>>>> +       if (test_bit(MD_CLOSING, &mddev->flags)) {
>>>> +               mutex_unlock(&mddev->open_mutex);
>>>> +               return -ENODEV;
>>>> +       }
>>>>
>>>> Maybe we need to differentiate "STOP_ARRAY" and "STOP_ARRAY_RO", or
>>>> revert above
>>>> changes.
>>>>
>>> No, your missing the point.
>>> What you describe above would change the effect of what MD_CLOSING does.
>>> We don't want to change it.  It is good.
>>> The problem is that when it has finished doing what it needs to do,
>>> we aren't clearing it. That is simply a bug.
>>>
>>> So something like this is needed.
>> Sorry again, I have another question in md_set_readonly() of md.c
>> Although it has stopped all md writes operations, and I still prefer
>> to do set_disk_ro after md_wakeup_thread. refer to following code.
> Why?
> How do the actions performed by set_disk_ro() interact with the actions
> performed by md_wakeup_thread()?
>
> I'm not saying the code shouldn't be changed, just that there needs to
> be a clear explanation of the benefit.

here just according to my understanding for readonly code path, welcome
the correction in my comments if I misunderstand this feature, :-).
... ...
static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
{
         int err = 0;
         int did_freeze = 0;

         if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) {
                 did_freeze = 1;
                 set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);  --> 
here has set bit as FROZEN
                 md_wakeup_thread(mddev->thread);
         }
... ...

         if (mddev->pers) {
                 __md_stop_writes(mddev);
/*
  *  set FROZEN again, maybe can use test_and_set_bit(FROZEN) better, it 
doesn't matter.
  *  it flushed and synced all data, bitmap and superblock to array.
  */
                 err  = -ENXIO;
                 if (mddev->ro==1)
                         goto out;
                 mddev->ro = 1;
/*
  *  Here, I means that set_disk_ro until the daemon thread has 
completed all operations
  *  include of sync and recovery progress. set_disk_ro when the array 
has cleaned totally,
  *  then it would be safer.
  *  I'm not sure MD_RECOVERY_NEEDED would be running once the array has 
set_disk_ro,
  *  actually I don't know how to test this scenario, thus asked this 
question.
  */
                 set_disk_ro(mddev->gendisk, 1);
                 clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
                 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
                 md_wakeup_thread(mddev->thread);
                 sysfs_notify_dirent_safe(mddev->sysfs_state);

Thanks very much,
-Zhilong
> NeilBrown
>
>
>> cut the piece of code:
>> ... ...
>> static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
>> {
>> ... ...
>>           if (mddev->pers) {
>>                   __md_stop_writes(mddev);
>>
>>                   err  = -ENXIO;
>>                   if (mddev->ro==1)
>>                           goto out;
>>                   mddev->ro = 1;
>>                   set_disk_ro(mddev->gendisk, 1);         ## --> I prefer
>> to do it after md_wakeup_thread.
>>                   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>                   set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>                   md_wakeup_thread(mddev->thread);     ## --> do
>> set_disk_ro after this step.
>>                   sysfs_notify_dirent_safe(mddev->sysfs_state);
>>                   err = 0;
>>           }
>> out:
>>           mutex_unlock(&mddev->open_mutex);
>>           return err;
>> }
>>
>> Thanks,
>> -Zhilong
>>> NeilBrown
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 7e168ff1ae90..567aba246497 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -6934,6 +6934,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>>>    	void __user *argp = (void __user *)arg;
>>>    	struct mddev *mddev = NULL;
>>>    	int ro;
>>> +	bool did_set_md_closing = false;
>>>    
>>>    	if (!md_ioctl_valid(cmd))
>>>    		return -ENOTTY;
>>> @@ -7023,7 +7024,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>>>    			err = -EBUSY;
>>>    			goto out;
>>>    		}
>>> +		WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags));
>>>    		set_bit(MD_CLOSING, &mddev->flags);
>>> +		did_set_md_closing = true;
>>>    		mutex_unlock(&mddev->open_mutex);
>>>    		sync_blockdev(bdev);
>>>    	}
>>> @@ -7216,6 +7219,8 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>>>    		mddev->hold_active = 0;
>>>    	mddev_unlock(mddev);
>>>    out:
>>> +	if (did_set_md_closing)
>>> +		clear_bit(MD_CLOSING, &mddev->flags);
>>>    	return err;
>>>    }
>>>    #ifdef CONFIG_COMPAT


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox