* raid1 bitmap and multiple removed disks
From: Diego Guella @ 2016-11-23 8:50 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
(2nd attempt: the previous one didn't make it)
Hi,
I am using linux raid1 for a double-purpose: redundancy and backup.
I have a raid1 array of 5 disks, 3 of which are kept for backup purposes.
Let's call disks A, B, C, D, E.
Disks A and B are _always_ connected to the system.
Disks C, D, E are backup disks.
Here follows a description of how I use the backup disks.
This morning I connect disk C, and let it resync.
Tomorrow morning, I shut down the system, remove disk C and keep it away
as a daily backup.
I connect the next disk (D), then start up the system.
Linux raid1 recognizes the "old" disk and does not allow it to enter the
array (this is evidenced by system logs).
I then add disk D to the array, and let it resync.
The next day, I connect the next disk (E), and so on, rotating them.
The "connect and disconnect" is always performed when the system is
powered off, although sometimes I hot-connect the disk with the system
already powered up.
The purpose of this is to have an emergency backup: I can disconnect ALL
disks from the system and connect only one of the daily backups, going
"back to the past"(TM).
This array has a write-intent bitmap, in order to speed up the resync
(it is a 4TB array, and sometimes it needs nearly 20 hours to resync
without bitmaps due to system load).
This worked flawlessly (for some years) until some days ago, when the
array suffered a strange inconsistency, and the filesystem nearly gone
nuts in about 20 minutes of uptime. I will elaborate more on this later.
Since that problem happened, some questions come to my mind:
What raid1 bitmaps allow me to do?
Can they record _correctly_ the state of multiple removed disks, in
order to overwrite only out-of-sync chunks of multiple removed disks?
In other words, am I allowed to do what I described above?
If not, can I change something in my actions in order to have a daily
backup using raid1?
System details:
# cat /etc/debian_version
6.0.10
# mdadm --version
mdadm - v3.1.4 - 31st August 2010
^ permalink raw reply
* Bug#784070: Newly-created arrays don't auto-assemble - related to hostname change?
From: Michael Tokarev @ 2016-11-23 9:03 UTC (permalink / raw)
To: NeilBrown, 784070, Andy Smith; +Cc: linux-raid
In-Reply-To: <87wpfuc4ab.fsf@notabene.neil.brown.name>
23.11.2016 05:34, NeilBrown wrote:
> On Tue, Nov 22 2016, Andy Smith wrote:
>
>> Hi Neil,
>>
>> On Tue, Nov 22, 2016 at 09:56:28AM +1100, NeilBrown wrote:
>>> Thanks. Sorry this is taking a lot of back-and-forth...
>>
>> No worries. This is very interesting to me and I'd also like to know
>> what is going wrong even if I have a work-around.
>
> Thanks.
> I tried this on a scratch Debian VM I had lying around, and found I
> could exactly reproduce your symptoms.
> I found that, unlike on the first Debian system I looked at, there is a
> line in /lib/udev/rules.d/64-md-raid-assembly.rules
>
> # Disable incremental assembly to fix Debian bug #784070
> GOTO="md_inc_end"
>
> Remove that and the problem goes away. Arrgg...
>
> I wish people maintainers would *ask* upstream when they don't
> understand, rather than just breaking things.
Neil, with all my respect, this is a bit over-reaction :)
It was long ago when we disabled incremental assembly when
you turned it on by default, and kept old static way to
assemble arrays, because neither our initrd nor regular
userpsace weren't ready for that. At the time jessie come
out, we didn't have enough time to sort it out, so we kept
it for jessie too.
After restoring your (upstream) rules, things WILL break in
other place. Someone already tried that and had to revert
it back to what we have now. Initrd should have some
initial infrastructure for event-based work before this will
be possible.
I wanted to fix it all for stretch. But once I had a conflict
with the d-i team I don't work on mdadm (or any other package
touching d-i) anymore.
Thanks,
/mjt
^ permalink raw reply
* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Guoqing Jiang @ 2016-11-23 9:05 UTC (permalink / raw)
To: Shaohua Li, Coly Li
Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn
In-Reply-To: <20161122213541.btgw4cpoly5j4jpc@kernel.org>
On 11/23/2016 05:35 AM, Shaohua Li wrote:
> On Tue, Nov 22, 2016 at 05:54:00AM +0800, Coly Li wrote:
>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>> introduces a sliding resync window for raid1 I/O barrier, this idea limits
>> I/O barriers to happen only inside a slidingresync window, for regular
>> I/Os out of this resync window they don't need to wait for barrier any
>> more. On large raid1 device, it helps a lot to improve parallel writing
>> I/O throughput when there are background resync I/Os performing at
>> same time.
>>
>> The idea of sliding resync widow is awesome, but there are several
>> challenges are very difficult to solve,
>> - code complexity
>> Sliding resync window requires several veriables to work collectively,
>> this is complexed and very hard to make it work correctly. Just grep
>> "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>> the original resync window patch. This is not the end, any further
>> related modification may easily introduce more regreassion.
>> - multiple sliding resync windows
>> Currently raid1 code only has a single sliding resync window, we cannot
>> do parallel resync with current I/O barrier implementation.
>> Implementing multiple resync windows are much more complexed, and very
>> hard to make it correctly.
>>
>> Therefore I decide to implement a much simpler raid1 I/O barrier, by
>> removing resync window code, I believe life will be much easier.
>>
>> The brief idea of the simpler barrier is,
>> - Do not maintain a logbal unique resync window
>> - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>> I/O only has to wait for a resync I/O when both them have same barrier
>> bucket index, vice versa.
>> - I/O barrier can be recuded to an acceptable number if there are enought
>> barrier buckets
>>
>> Here I explain how the barrier buckets are designed,
>> - BARRIER_UNIT_SECTOR_SIZE
>> The whole LBA address space of a raid1 device is divided into multiple
>> barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>> Bio request won't go across border of barrier unit size, that means
>> maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>> - BARRIER_BUCKETS_NR
>> There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
>> hit different barrier units, they only need to compete I/O barrier with
>> other I/Os which hit the same barrier bucket index with each other. The
>> index of a barrier bucket which a bio should look for is calculated by
>> get_barrier_bucket_idx(),
>> (sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
>> sector is the start sector number of a bio. align_to_barrier_unit_end()
>> will make sure the finall bio sent into generic_make_request() won't
>> exceed border of the barrier unit size.
>> - RRIER_BUCKETS_NR
>> Number of barrier buckets is defined by,
>> #define BARRIER_BUCKETS_NR (PAGE_SIZE/sizeof(long))
>> For 4KB page size, there are 512 buckets for each raid1 device. That
>> means the propobility of full random I/O barrier confliction may be
>> reduced down to 1/512.
> Thanks! The idea is awesome and does makes the code easier to understand.
Fully agree!
> Open question:
> - Need review from md clustring developer, I don't touch related code now.
> Don't think it matters, but please open eyes, Guoqing!
Thanks for reminding, I agree.
Anyway, I will try to comment it though I am sticking with lvm2 bugs
now and
run some tests with the two patches applied.
Thanks,
Guoqing
^ permalink raw reply
* Re: Bug#784070: Newly-created arrays don't auto-assemble - related to hostname change?
From: SOUBEYRAND Yann - externe @ 2016-11-23 9:09 UTC (permalink / raw)
To: neilb@suse.com, 784070@bugs.debian.org
Cc: andy@strugglers.net, linux-raid@vger.kernel.org
In-Reply-To: <87wpfuc4ab.fsf@notabene.neil.brown.name>
Le mercredi 23 novembre 2016 à 13:34 +1100, neilb@suse.com a écrit :
> On Tue, Nov 22 2016, Andy Smith wrote:
>
> > Hi Neil,
> >
> > On Tue, Nov 22, 2016 at 09:56:28AM +1100, NeilBrown wrote:
> >> Thanks. Sorry this is taking a lot of back-and-forth...
> >
> > No worries. This is very interesting to me and I'd also like to know
> > what is going wrong even if I have a work-around.
>
> Thanks.
> I tried this on a scratch Debian VM I had lying around, and found I
> could exactly reproduce your symptoms.
> I found that, unlike on the first Debian system I looked at, there is a
> line in /lib/udev/rules.d/64-md-raid-assembly.rules
>
> # Disable incremental assembly to fix Debian bug #784070
> GOTO="md_inc_end"
>
> Remove that and the problem goes away. Arrgg...
>
> I wish people maintainers would *ask* upstream when they don't
> understand, rather than just breaking things.
>
> NeilBrown
Hi Neil,
I encourage you to have a look at the full history of this bug to see
why this line was introduced. You will then see that it has been removed
in Stretch.
Regards
Yann
Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse.
Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message.
Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus.
____________________________________________________
This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval.
If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message.
E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
^ permalink raw reply
* Re: MD Remnants After --stop
From: Marc Smith @ 2016-11-23 15:21 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
In-Reply-To: <878tsbcbub.fsf@notabene.neil.brown.name>
On Tue, Nov 22, 2016 at 6:51 PM, NeilBrown <neilb@suse.com> wrote:
> On Wed, Nov 23 2016, Marc Smith wrote:
>
>> Hi,
>>
>> Sorry, I'm not trying to beat a dead horse here, but I do feel
>> something has changed... I just tested with Linux 4.5.2 and when
>> stopping an md array (with mdadm --stop) the entry in /sys/block/ is
>> removed, and even the /dev/mdXXX and /dev/md/name link are removed
>> properly.
>>
>> When testing with Linux 4.9-rc3, the entries in /sys/block/ still
>> remain (array_state attribute value is "clear") after using mdadm
>> --stop and the /dev/mdXXX device exists (the /dev/md/name link is
>> removed, by udev I assume).
>
> With the latest (git) mdadm, when events are reported by "udevadm monitor"??
>
> I only see remove events, and the entries from /dev and /sys are
> removed.
>
> If I could reproduce your problem, I would fix it...
On one set of hosts I can reliably reproduce this issue, and on
another system I could previously reproduce this, but now seems to be
working fine... I haven't found the connection (same distro / kernel
versions on all hosts).
# mdadm --version
mdadm - v3.4-100-g52a9408 - 26th October 2016
From 'mdadm --stop /dev/md/blah1' (non-clustered RAID0 array):
--snip--
# udevadm monitor -pku
calling: monitor
monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent
KERNEL[32930.834312] change /devices/virtual/block/md126 (block)
ACTION=change
DEVNAME=/dev/md126
DEVPATH=/devices/virtual/block/md126
DEVTYPE=disk
MAJOR=9
MINOR=126
SEQNUM=3678
SUBSYSTEM=block
UDEV [32930.836032] change /devices/virtual/block/md126 (block)
ACTION=change
DEVNAME=/dev/md126
DEVPATH=/devices/virtual/block/md126
DEVTYPE=disk
MAJOR=9
MINOR=126
SEQNUM=3678
SUBSYSTEM=block
SYSTEMD_READY=0
USEC_INITIALIZED=843336612
--snip--
Kernel logs from that:
--snip--
[32928.465695] md126: detected capacity change from 146681102336 to 0
[32928.465699] md: md126 stopped.
[32928.465702] md: unbind<dm-3>
[32928.465798] udevd[499]: inotify event: 8 for /dev/md126
[32928.465964] udevd[499]: device /dev/md126 closed, synthesising 'change'
[32928.466029] udevd[499]: seq 3678 queued, 'change' 'block'
[32928.466129] udevd[499]: seq 3678 forked new worker [27035]
[32928.466357] udevd[27035]: seq 3678 running
[32928.466423] udevd[27035]: removing watch on '/dev/md126'
[32928.466492] udevd[27035]: IMPORT 'probe-bcache -o udev /dev/md126'
/usr/lib/udev/rules.d/69-bcache.rules:16
[32928.466712] udevd[27036]: starting 'probe-bcache -o udev /dev/md126'
[32928.467540] udevd[27035]: 'probe-bcache -o udev /dev/md126' [27036]
exit with return code 0
[32928.467564] udevd[27035]: update old name,
'/dev/disk/by-id/md-name-tgtnode1.parodyne.com:blah1' no longer
belonging to '/devices/virtual/block/md126'
[32928.470851] md: export_rdev(dm-3)
[32928.470920] md: unbind<dm-2>
[32928.478843] md: export_rdev(dm-2)
--snip--
From 'mdadm --stop /dev/md/asdf1' (clustered RAID1 array):
--snip--
# udevadm monitor -pku
calling: monitor
monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent
KERNEL[34402.247229] change /devices/virtual/block/md127 (block)
ACTION=change
DEVNAME=/dev/md127
DEVPATH=/devices/virtual/block/md127
DEVTYPE=disk
MAJOR=9
MINOR=127
SEQNUM=3679
SUBSYSTEM=block
KERNEL[34402.247885] offline
/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
ACTION=offline
DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
SEQNUM=3680
SUBSYSTEM=dlm
UDEV [34402.248269] offline
/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
ACTION=offline
DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
SEQNUM=3680
SUBSYSTEM=dlm
USEC_INITIALIZED=402248230
KERNEL[34402.248841] remove
/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
ACTION=remove
DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
SEQNUM=3681
SUBSYSTEM=dlm
UDEV [34402.248899] change /devices/virtual/block/md127 (block)
ACTION=change
DEVNAME=/dev/md127
DEVPATH=/devices/virtual/block/md127
DEVTYPE=disk
MAJOR=9
MINOR=127
SEQNUM=3679
SUBSYSTEM=block
SYSTEMD_READY=0
USEC_INITIALIZED=1273670
UDEV [34402.248990] remove
/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
ACTION=remove
DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
SEQNUM=3681
SUBSYSTEM=dlm
USEC_INITIALIZED=2248905
--snip--
Kernel logs from that:
--snip--
[34399.753876] udevd[499]: inotify event: 8 for /dev/md127
[34399.765389] md127: detected capacity change from 73340747776 to 0
[34399.765393] md: md127 stopped.
[34399.765579] udevd[499]: device /dev/md127 closed, synthesising 'change'
[34399.765656] udevd[499]: seq 3679 queued, 'change' 'block'
[34399.765751] udevd[499]: seq 3679 forked new worker [6317]
[34399.765878] udevd[6317]: seq 3679 running
[34399.765943] udevd[6317]: removing watch on '/dev/md127'
[34399.766012] udevd[6317]: IMPORT 'probe-bcache -o udev /dev/md127'
/usr/lib/udev/rules.d/69-bcache.rules:16
[34399.766259] udevd[6318]: starting 'probe-bcache -o udev /dev/md127'
[34399.766295] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: leaving the
lockspace group...
[34399.766421] udevd[499]: seq 3680 queued, 'offline' 'dlm'
[34399.766549] udevd[499]: seq 3680 forked new worker [6319]
[34399.767080] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: group event done 0 0
[34399.767297] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e:
release_lockspace final free
[34399.767320] md: unbind<dm-1>
[34399.795574] md: export_rdev(dm-1)
[34399.795640] md: unbind<dm-0>
[34399.803565] md: export_rdev(dm-0)
--snip--
On the other machines where the md array stopped correctly (removing
the entries from /dev and /sys) I do see the 'remove' events with
"udevadm monitor". What produces those remove events? Is that
something directly from the mdadm tool, or indirectly as part of the
stop/tear-down that mdadm initiates?
--Marc
>
> NeilBrown
>
>>
>> Looks like Linux 4.9 is at rc6 now -- have there been any changes that
>> would correct this behavior? Or is this expected behavior with the
>> latest version? Not sure when this changed, but I did go back to 4.5.2
>> and confirmed everything is removed correctly in that version, not
>> sure if this is different starting in 4.9, or something between 4.5
>> and 4.9.
>>
>> Can anyone else confirm with Linux 4.9 that the /sys/block/mdXXX entry
>> lingers after using mdadm --stop? I suppose it could be some other
>> system that is causing this on my machines. I tested using the latest
>> from the master branch of mdadm and get the same result.
>>
>>
>> Thanks for your time and help.
>>
>> --Marc
>>
>>
>> On Mon, Nov 21, 2016 at 9:08 AM, Marc Smith <marc.smith@mcc.edu> wrote:
>>> On Sun, Nov 20, 2016 at 10:42 PM, NeilBrown <neilb@suse.com> wrote:
>>>> On Sat, Nov 19 2016, Marc Smith wrote:
>>>>
>>>>> On Mon, Nov 7, 2016 at 12:44 AM, NeilBrown <neilb@suse.com> wrote:
>>>>>> On Sat, Nov 05 2016, Marc Smith wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> It may be that I've never noticed this before, so maybe its not a
>>>>>>> problem... after using '--stop' to deactivate/stop an MD array, there
>>>>>>> are remnants of it lingering, namely an entry in /sys/block (eg,
>>>>>>> /sys/block/md127) and the device node in /dev remains (eg,
>>>>>>> /dev/md127).
>>>>>>>
>>>>>>> Is this normal? Like I said, it probably is, and I've just never
>>>>>>> noticed it before. I assume its not going to hurt anything, but is
>>>>>>> there a way to clean it up, without rebooting? Obviously I could
>>>>>>> remove the /dev entry, but what about /sys/block?
>>>>>>>
>>>>>>
>>>>>> You can remove them both by running
>>>>>> mdadm -S /dev/md127
>>>>>>
>>>>>> but they'll probably just reappear again.
>>>>>>
>>>>>> This seems to be an on-going battle between md and udev. I've "fixed"
>>>>>> it at least once, but it keeps coming back.
>>>>>>
>>>>>> When md removes the md127 device, a message is sent to udev.
>>>>>> As part of its response to this message, udev tries to open /dev/md127.
>>>>>> Because of the rather unusual way that md devices are created (it made
>>>>>> sense nearly 20 years ago when it was designed), opening /dev/md127
>>>>>> causes md to create device md127 again.
>>>>>>
>>>>>> You could
>>>>>> mv /dev/md127 /dev/md127X
>>>>>> mdadm -S /dev/md127X
>>>>>> rm /dev/md127X
>>>>>> that stop udev from opening /dev/md127. It seems to work reliably.
>>>>>>
>>>>>> md used to generate a CHANGE event before the REMOVE event, and only the
>>>>>> CHANGE event caused udev to open the device file. I removed that and
>>>>>> the problem went away. Apparently some change has happened to udev and
>>>>>> now it opens the file in response to REMOVE as well.
>>>>>
>>>>> I used "udevadm monitor -pku" to watch the events when running "mdadm
>>>>> --stop /dev/md127" and this is what I see:
>>>>>
>>>>> --snip--
>>>>> KERNEL[163074.119778] change /devices/virtual/block/md127 (block)
>>>>> ACTION=change
>>>>> DEVNAME=/dev/md127
>>>>> DEVPATH=/devices/virtual/block/md127
>>>>> DEVTYPE=disk
>>>>> MAJOR=9
>>>>> MINOR=127
>>>>> SEQNUM=3701
>>>>> SUBSYSTEM=block
>>>>>
>>>>> UDEV [163074.121569] change /devices/virtual/block/md127 (block)
>>>>> ACTION=change
>>>>> DEVNAME=/dev/md127
>>>>> DEVPATH=/devices/virtual/block/md127
>>>>> DEVTYPE=disk
>>>>> MAJOR=9
>>>>> MINOR=127
>>>>> SEQNUM=3701
>>>>> SUBSYSTEM=block
>>>>> SYSTEMD_READY=0
>>>>> USEC_INITIALIZED=370470
>>>>> --snip--
>>>>>
>>>>> I don't see any 'remove' event generated. I should mention if I hadn't
>>>>> already that I'm testing md-cluster (--bitmap=clustered), and
>>>>> currently using Linux 4.9-rc3.
>>>>
>>>> What version of mdadm are you using?
>>>
>>> v3.4
>>>
>>>
>>>> You need one which contains
>>>> Commit: 229e66cb9689 ("Manage.c: Only issue change events for kernels older than 2.6.28")
>>>>
>>>> which hasn't made it into a release yet. But if you are playing with
>>>> md-cluster, I would guess you are using the latest from git...
>>>
>>> Wasn't, but I will now. Thanks.
>>>
>>> --Marc
>>>
>>>
>>>>
>>>> NeilBrown
^ permalink raw reply
* [PATCH] md/r5cache: run_no_space_stripes() when R5C_LOG_CRITICAL == 0
From: Song Liu @ 2016-11-23 20:17 UTC (permalink / raw)
To: linux-raid
Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
liuzhengyuan, Song Liu
With writeback cache, we define log space critical as
free_space < 2 * reclaim_required_space
So the deassert of R5C_LOG_CRITICAL could happen when
1. free_space increases
2. reclaim_required_space decreases
Currently, run_no_space_stripes() is called when 1 happens, but
not (always) when 2 happens.
With this patch, run_no_space_stripes() is call when
R5C_LOG_CRITICAL is cleared.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5-cache.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 8cb79fc..0e24aab 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -370,6 +370,7 @@ static inline void r5c_update_log_state(struct r5l_log *log)
struct r5conf *conf = log->rdev->mddev->private;
sector_t free_space;
sector_t reclaim_space;
+ bool wake_reclaim = false;
if (!r5c_is_writeback(log))
return;
@@ -379,12 +380,18 @@ static inline void r5c_update_log_state(struct r5l_log *log)
reclaim_space = r5c_log_required_to_flush_cache(conf);
if (free_space < 2 * reclaim_space)
set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
- else
+ else {
+ if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state))
+ wake_reclaim = true;
clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
+ }
if (free_space < 3 * reclaim_space)
set_bit(R5C_LOG_TIGHT, &conf->cache_state);
else
clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
+
+ if (wake_reclaim)
+ r5l_wake_reclaim(log, 0);
}
/*
@@ -1348,6 +1355,10 @@ static void r5c_do_reclaim(struct r5conf *conf)
spin_unlock(&conf->device_lock);
spin_unlock_irqrestore(&log->stripe_in_journal_lock, flags);
}
+
+ if (!test_bit(R5C_LOG_CRITICAL, &conf->cache_state))
+ r5l_run_no_space_stripes(log);
+
md_wakeup_thread(conf->mddev->thread);
}
--
2.9.3
^ permalink raw reply related
* Re: Please help RAID1 complete fail no superblock
From: Wols Lists @ 2016-11-23 20:25 UTC (permalink / raw)
To: y k; +Cc: George Rapp, linux-raid@vger.kernel.org
In-Reply-To: <CAOLyU3WMJc+En=DbOW-n2uAXe7b1Gvo6jGCr0zLv5LumrxKJNQ@mail.gmail.com>
On 22/11/16 16:15, y k wrote:
> Thank you for the quick response,
>
> Attached are the outputs of fdisk -l for both drives.
>
> I read all the correspondence in the archived email thread "RAID10 with
> 2 drives
> auto-assembled as RAID1",
> but I don't understand what I should do.
>
> Could you please guide me?
> I'm afraid to execute any unfamiliar commands that might lower my
> chances of saving my data.
Basically, you need to run hexdump, and look for evidence of a damaged
superblock, and/or a filesystem superblock.
Hexdump will only read a disk, not do anything to it. That thread should
tell you what strings to look for to locate any superblocks, which are
probably on a 4K boundary.
If you find them, the other experts will be able, hopefully, to tell you
how to get the array back. What should happen is mdadm looks at the disk
for the partition table, then looks at the partitions for an array
superblock, then passes the array to linux to be mounted. Obviously,
something in that chain has been broken, and unfortunately you now have
to do that manually. Hexdump is your tool to examine the disk and look
for the information that linux should have found for you automatically.
In the next few days, I'm planning to add a page to the wiki explaining
how to use hexdump and find this information.
>
> Thank you so much!
>
> Yaniv
>
> On Tue, Nov 22, 2016 at 5:05 PM, Wols Lists <antlists@youngman.org.uk
> <mailto:antlists@youngman.org.uk>> wrote:
>
> On 22/11/16 14:49, George Rapp wrote:
> > On Tue, Nov 22, 2016 at 9:31 AM, theelectricengineer@gmail.com <mailto:theelectricengineer@gmail.com>
> > <theelectricengineer@gmail.com
> <mailto:theelectricengineer@gmail.com>> wrote:
> >> Hello good people of the linux-raid group,
> >>
> >> I really need your help with my RAID1 that has completely failed, on which I have photos of my beloved deceased grandparents.
> >>
> >> I had the RAID1 array for about 3 years, and I had replaced one of the drives when it failed about one year ago. All worked fine.
> >>
> >> A few weeks ago, I bought a new computer and wanted to move the array to the new computer.
> >> I read several guides, and thought that all I had to do was turn both computers off, move the drives, and in the new computer execute: mdadm --assemble --scan
> >> which I did.
> >>
> >> The output was: mdadm: /dev/sdb has no superblock - assembly aborted
> >>
> >> I panicked, moved the drives back, but the old RAID wouldn't start!
> >> The old computer too says that one of the drives has no superblock
> >> and, even worse, that the other is UNALLOCATED SPACE!!!
> >>
> >> I don't understand why the partition on the second drive disappeared, and I'm so worried.
> >>
> >> PLEASE, PLEASE help me...
> >>
> >> I read the wiki pages (raid wiki kernel), but I'm afraid to run any commands that might make things worse.
> >>
> >> I would be very happy if I could restore the data from either one of the drives,
> >> and copy it to the new drives in my new computer (as I should have done before moving the drives).
> >>
> >
> > Yadiv -
> >
> > One more piece of information that would be helpful is the partition
> > tables for each drive:
> >
> > # fdisk -l /dev/sdb /dev/sdc
> >
>
> And search the archive for the following thread - "RAID10 with 2 drives
> auto-assembled as RAID1". It's pretty recent.
>
> You've got a mirror, which means both of your drives should have a valid
> filesystem on them, so things look optimistic. It's just a case of
> finding it - that thread should help you look. The experts will chime
> in, but if you can find the string sequences they're looking for, it'll
> help you recover the partition(s). What filesystem were you using?
>
> Cheers,
> Wol
>
>
^ permalink raw reply
* [PATCH] RFC: dm: avoid the mutex lock in dm_bufio_shrink_count()
From: Mikulas Patocka @ 2016-11-23 20:42 UTC (permalink / raw)
To: David Rientjes
Cc: Douglas Anderson, Mike Snitzer, shli, Dmitry Torokhov,
linux-kernel, linux-raid, dm-devel, linux, Sonny Rao,
Alasdair Kergon
In-Reply-To: <alpine.DEB.2.10.1611171413450.99747@chino.kir.corp.google.com>
On Thu, 17 Nov 2016, David Rientjes wrote:
> On Thu, 17 Nov 2016, Douglas Anderson wrote:
>
> > diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> > index b3ba142e59a4..885ba5482d9f 100644
> > --- a/drivers/md/dm-bufio.c
> > +++ b/drivers/md/dm-bufio.c
> > @@ -89,6 +89,7 @@ struct dm_bufio_client {
> >
> > struct list_head lru[LIST_SIZE];
> > unsigned long n_buffers[LIST_SIZE];
> > + unsigned long n_all_buffers;
> >
> > struct block_device *bdev;
> > unsigned block_size;
> > @@ -485,6 +486,7 @@ static void __link_buffer(struct dm_buffer *b, sector_t block, int dirty)
> > struct dm_bufio_client *c = b->c;
> >
> > c->n_buffers[dirty]++;
> > + c->n_all_buffers++;
> > b->block = block;
> > b->list_mode = dirty;
> > list_add(&b->lru_list, &c->lru[dirty]);
> > @@ -502,6 +504,7 @@ static void __unlink_buffer(struct dm_buffer *b)
> > BUG_ON(!c->n_buffers[b->list_mode]);
> >
> > c->n_buffers[b->list_mode]--;
> > + c->n_all_buffers--;
> > __remove(b->c, b);
> > list_del(&b->lru_list);
> > }
> > @@ -515,6 +518,7 @@ static void __relink_lru(struct dm_buffer *b, int dirty)
> >
> > BUG_ON(!c->n_buffers[b->list_mode]);
> >
> > + /* NOTE: don't update n_all_buffers: -1 + 1 = 0 */
> > c->n_buffers[b->list_mode]--;
> > c->n_buffers[dirty]++;
> > b->list_mode = dirty;
> > @@ -1588,17 +1592,10 @@ static unsigned long
> > dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > {
> > struct dm_bufio_client *c;
> > - unsigned long count;
> >
> > c = container_of(shrink, struct dm_bufio_client, shrinker);
> > - if (sc->gfp_mask & __GFP_FS)
> > - dm_bufio_lock(c);
> > - else if (!dm_bufio_trylock(c))
> > - return 0;
> >
> > - count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
> > - dm_bufio_unlock(c);
> > - return count;
> > + return c->n_all_buffers;
> > }
> >
> > /*
>
> Would be better to just avoid taking the mutex at all and returning
> c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY] with a comment that
> the estimate might be wrong, but the actual count may vary between
> ->count_objects() and ->scan_objects() anyway, so we don't actually care?
Yes - here I'm sending a patch that reads c->n_buffers without the lock.
From: Mikulas Patocka <mpatocka@redhat.com>
dm-bufio: don't take the lock in dm_bufio_shrink_count
dm_bufio_shrink_count is called from do_shrink_slab to find out how many
freeable objects are there. The reported value doesn't have to be precise,
so we don't need to take the dm-bufio lock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/md/dm-bufio.c | 13 ++-----------
1 file changed, 2 insertions(+), 11 deletions(-)
Index: linux-2.6/drivers/md/dm-bufio.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-bufio.c
+++ linux-2.6/drivers/md/dm-bufio.c
@@ -1587,18 +1587,9 @@ dm_bufio_shrink_scan(struct shrinker *sh
static unsigned long
dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
{
- struct dm_bufio_client *c;
- unsigned long count;
+ struct dm_bufio_client *c = container_of(shrink, struct dm_bufio_client, shrinker);
- c = container_of(shrink, struct dm_bufio_client, shrinker);
- if (sc->gfp_mask & __GFP_FS)
- dm_bufio_lock(c);
- else if (!dm_bufio_trylock(c))
- return 0;
-
- count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
- dm_bufio_unlock(c);
- return count;
+ return ACCESS_ONCE(c->n_buffers[LIST_CLEAN]) + ACCESS_ONCE(c->n_buffers[LIST_DIRTY]);
}
/*
^ permalink raw reply
* [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
From: Mikulas Patocka @ 2016-11-23 20:57 UTC (permalink / raw)
To: Douglas Anderson
Cc: Alasdair Kergon, Mike Snitzer, shli, Dmitry Torokhov,
linux-kernel, linux-raid, dm-devel, David Rientjes, Sonny Rao,
linux
In-Reply-To: <1479410660-31408-1-git-send-email-dianders@chromium.org>
Hi
The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
allocation doesn't. Your patch would incorrectly reuse buffers in a
situation when the memory is filled with clean cached pages.
Here I'm proposing an alternate patch that first tries GFP_NOWAIT
allocation, then drops the lock and tries GFP_NOIO allocation.
Note that the root cause why you are seeing this stacktrace is, that your
block device is congested - i.e. there are too many requests in the
device's queue - and note that fixing this wait won't fix the root cause
(congested device).
The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
13/16 size of the nr_requests value.
If you don't want your device to report the congested status, you can
increase /sys/block/<device>/queue/nr_requests - you should test if your
chromebook is faster of slower with this setting increased. But note that
this setting won't increase the IO-per-second of the device.
Mikulas
On Thu, 17 Nov 2016, Douglas Anderson wrote:
> We've seen in-field reports showing _lots_ (18 in one case, 41 in
> another) of tasks all sitting there blocked on:
>
> mutex_lock+0x4c/0x68
> dm_bufio_shrink_count+0x38/0x78
> shrink_slab.part.54.constprop.65+0x100/0x464
> shrink_zone+0xa8/0x198
>
> In the two cases analyzed, we see one task that looks like this:
>
> Workqueue: kverityd verity_prefetch_io
>
> __switch_to+0x9c/0xa8
> __schedule+0x440/0x6d8
> schedule+0x94/0xb4
> schedule_timeout+0x204/0x27c
> schedule_timeout_uninterruptible+0x44/0x50
> wait_iff_congested+0x9c/0x1f0
> shrink_inactive_list+0x3a0/0x4cc
> shrink_lruvec+0x418/0x5cc
> shrink_zone+0x88/0x198
> try_to_free_pages+0x51c/0x588
> __alloc_pages_nodemask+0x648/0xa88
> __get_free_pages+0x34/0x7c
> alloc_buffer+0xa4/0x144
> __bufio_new+0x84/0x278
> dm_bufio_prefetch+0x9c/0x154
> verity_prefetch_io+0xe8/0x10c
> process_one_work+0x240/0x424
> worker_thread+0x2fc/0x424
> kthread+0x10c/0x114
>
> ...and that looks to be the one holding the mutex.
>
> The problem has been reproduced on fairly easily:
> 0. Be running Chrome OS w/ verity enabled on the root filesystem
> 1. Pick test patch: http://crosreview.com/412360
> 2. Install launchBalloons.sh and balloon.arm from
> http://crbug.com/468342
> ...that's just a memory stress test app.
> 3. On a 4GB rk3399 machine, run
> nice ./launchBalloons.sh 4 900 100000
> ...that tries to eat 4 * 900 MB of memory and keep accessing.
> 4. Login to the Chrome web browser and restore many tabs
>
> With that, I've seen printouts like:
> DOUG: long bufio 90758 ms
> ...and stack trace always show's we're in dm_bufio_prefetch().
>
> The problem is that we try to allocate memory with GFP_NOIO while
> we're holding the dm_bufio lock. Instead we should be using
> GFP_NOWAIT. Using GFP_NOIO can cause us to sleep while holding the
> lock and that causes the above problems.
>
> The current behavior explained by David Rientjes:
>
> It will still try reclaim initially because __GFP_WAIT (or
> __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO. This is the cause of
> contention on dm_bufio_lock() that the thread holds. You want to
> pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
> mutex that can be contended by a concurrent slab shrinker (if
> count_objects didn't use a trylock, this pattern would trivially
> deadlock).
>
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> Note that this change was developed and tested against the Chrome OS
> 4.4 kernel tree, not mainline. Due to slight differences in verity
> between mainline and Chrome OS it became too difficult to reproduce my
> testing setup on mainline. This patch still seems correct and
> relevant to upstream, so I'm posting it. If this is not acceptible to
> you then please ignore this patch.
>
> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
> reproduce the long delays described in the patch. Presumably
> something changed in either the kernel config or the memory management
> code between the two kernel versions that made this crop up. In a
> similar vein, it is possible that problems described in this patch are
> no longer reproducible upstream. However, the arguments made in this
> patch (that we don't want to block while holding the mutex) still
> apply so I think the patch may still have merit.
>
> drivers/md/dm-bufio.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index b3ba142e59a4..3c767399cc59 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> * dm-bufio is resistant to allocation failures (it just keeps
> * one buffer reserved in cases all the allocations fail).
> * So set flags to not try too hard:
> - * GFP_NOIO: don't recurse into the I/O layer
> + * GFP_NOWAIT: don't wait; if we need to sleep we'll release our
> + * mutex and wait ourselves.
> * __GFP_NORETRY: don't retry and rather return failure
> * __GFP_NOMEMALLOC: don't use emergency reserves
> * __GFP_NOWARN: don't print a warning in case of failure
> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> */
> while (1) {
> if (dm_bufio_cache_size_latch != 1) {
> - b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> + b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
> + __GFP_NOMEMALLOC | __GFP_NOWARN);
> if (b)
> return b;
> }
> --
> 2.8.0.rc3.226.g39d4020
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>
From: Mikulas Patocka <mpatocka@redhat.com>
Subject: dm-bufio: drop the lock when doing GFP_NOIO alloaction
Drop the lock when doing GFP_NOIO alloaction beacuse the allocation can
take some time.
Note that we won't do GFP_NOIO allocation when we loop for the second
time, because the lock shouldn't be dropped between __wait_for_free_buffer
and __get_unclaimed_buffer.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/md/dm-bufio.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
Index: linux-2.6/drivers/md/dm-bufio.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-bufio.c
+++ linux-2.6/drivers/md/dm-bufio.c
@@ -822,11 +822,13 @@ enum new_flag {
static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
{
struct dm_buffer *b;
+ bool tried_noio_alloc = false;
/*
* dm-bufio is resistant to allocation failures (it just keeps
* one buffer reserved in cases all the allocations fail).
* So set flags to not try too hard:
+ * GFP_NOWAIT: don't sleep and don't release cache
* GFP_NOIO: don't recurse into the I/O layer
* __GFP_NORETRY: don't retry and rather return failure
* __GFP_NOMEMALLOC: don't use emergency reserves
@@ -837,7 +839,7 @@ static struct dm_buffer *__alloc_buffer_
*/
while (1) {
if (dm_bufio_cache_size_latch != 1) {
- b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
if (b)
return b;
}
@@ -845,6 +847,15 @@ static struct dm_buffer *__alloc_buffer_
if (nf == NF_PREFETCH)
return NULL;
+ if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
+ dm_bufio_unlock(c);
+ b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ dm_bufio_lock(c);
+ if (b)
+ return b;
+ tried_noio_alloc = true;
+ }
+
if (!list_empty(&c->reserved_buffers)) {
b = list_entry(c->reserved_buffers.next,
struct dm_buffer, lru_list);
^ permalink raw reply
* [PATCH v3] md/r5cache: handle alloc_page failure
From: Song Liu @ 2016-11-23 21:20 UTC (permalink / raw)
To: linux-raid
Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
liuzhengyuan, Song Liu
RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.
To handle alloc_page() failures, this patch adds an extra page to
disk_info. When alloc_page fails, handle_stripe() trys to use these
pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
the stripe is added to delayed_list.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5-cache.c | 28 ++++++++++++++++++++++
drivers/md/raid5.c | 61 +++++++++++++++++++++++++++++++++++++++++-------
drivers/md/raid5.h | 6 +++++
3 files changed, 86 insertions(+), 9 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 8cb79fc..3817d2b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2334,15 +2334,43 @@ int r5c_try_caching_write(struct r5conf *conf,
*/
void r5c_release_extra_page(struct stripe_head *sh)
{
+ struct r5conf *conf = sh->raid_conf;
int i;
+ bool using_disk_info_extra_page;
+
+ using_disk_info_extra_page =
+ sh->dev[0].orig_page == conf->disks[0].extra_page;
for (i = sh->disks; i--; )
if (sh->dev[i].page != sh->dev[i].orig_page) {
struct page *p = sh->dev[i].orig_page;
sh->dev[i].orig_page = sh->dev[i].page;
+ if (!using_disk_info_extra_page)
+ put_page(p);
+ }
+
+ if (using_disk_info_extra_page)
+ clear_bit(R5C_EXTRA_PAGE_IN_USE, &conf->cache_state);
+}
+
+void r5c_use_extra_page(struct stripe_head *sh)
+{
+ struct r5conf *conf = sh->raid_conf;
+ int i;
+ struct r5dev *dev;
+ struct page *p;
+
+ for (i = sh->disks; i--; ) {
+ dev = &sh->dev[i];
+ if (dev->orig_page != dev->page) {
+ p = dev->orig_page;
+ dev->orig_page = dev->page;
put_page(p);
}
+ dev->orig_page = conf->disks[i].extra_page;
+ BUG_ON(!dev->orig_page);
+ }
}
/*
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index dbab8c7..7656530 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -876,6 +876,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
/* writing out phase */
+ if (s->waiting_extra_page)
+ return;
if (r5l_write_stripe(conf->log, sh) == 0)
return;
} else { /* caching phase */
@@ -2007,6 +2009,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
INIT_LIST_HEAD(&sh->batch_list);
INIT_LIST_HEAD(&sh->lru);
INIT_LIST_HEAD(&sh->r5c);
+ INIT_LIST_HEAD(&sh->log_list);
atomic_set(&sh->count, 1);
sh->log_start = MaxSector;
for (i = 0; i < disks; i++) {
@@ -3580,10 +3583,10 @@ static void handle_stripe_clean_event(struct r5conf *conf,
break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
}
-static void handle_stripe_dirtying(struct r5conf *conf,
- struct stripe_head *sh,
- struct stripe_head_state *s,
- int disks)
+static int handle_stripe_dirtying(struct r5conf *conf,
+ struct stripe_head *sh,
+ struct stripe_head_state *s,
+ int disks)
{
int rmw = 0, rcw = 0, i;
sector_t recovery_cp = conf->mddev->recovery_cp;
@@ -3649,12 +3652,32 @@ static void handle_stripe_dirtying(struct r5conf *conf,
dev->page == dev->orig_page &&
!test_bit(R5_LOCKED, &sh->dev[sh->pd_idx].flags)) {
/* alloc page for prexor */
- dev->orig_page = alloc_page(GFP_NOIO);
+ struct page *p = alloc_page(GFP_NOIO);
+
+ if (p) {
+ dev->orig_page = p;
+ continue;
+ }
- /* will handle failure in a later patch*/
- BUG_ON(!dev->orig_page);
+ /*
+ * alloc_page() failed, try use
+ * disk_info->extra_page
+ */
+ if (!test_and_set_bit(R5C_EXTRA_PAGE_IN_USE,
+ &conf->cache_state)) {
+ r5c_use_extra_page(sh);
+ break;
+ }
+
+ /* extra_page in use, add to delayed_list */
+ set_bit(STRIPE_DELAYED, &sh->state);
+ s->waiting_extra_page = 1;
+ return -EAGAIN;
}
+ }
+ for (i = disks; i--; ) {
+ struct r5dev *dev = &sh->dev[i];
if ((dev->towrite ||
i == sh->pd_idx || i == sh->qd_idx ||
test_bit(R5_InJournal, &dev->flags)) &&
@@ -3730,6 +3753,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
(s->locked == 0 && (rcw == 0 || rmw == 0) &&
!test_bit(STRIPE_BIT_DELAY, &sh->state)))
schedule_reconstruction(sh, s, rcw == 0, 0);
+ return 0;
}
static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
@@ -4545,8 +4569,12 @@ static void handle_stripe(struct stripe_head *sh)
if (ret == -EAGAIN ||
/* stripe under reclaim: !caching && injournal */
(!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
- s.injournal > 0))
- handle_stripe_dirtying(conf, sh, &s, disks);
+ s.injournal > 0)) {
+ ret = handle_stripe_dirtying(conf, sh, &s,
+ disks);
+ if (ret == -EAGAIN)
+ goto finish;
+ }
}
}
@@ -6458,6 +6486,8 @@ static void raid5_free_percpu(struct r5conf *conf)
static void free_conf(struct r5conf *conf)
{
+ int i;
+
if (conf->log)
r5l_exit_log(conf->log);
if (conf->shrinker.nr_deferred)
@@ -6466,6 +6496,12 @@ static void free_conf(struct r5conf *conf)
free_thread_groups(conf);
shrink_stripes(conf);
raid5_free_percpu(conf);
+ for (i = 0; i < conf->raid_disks; i++)
+ if (conf->disks[i].extra_page) {
+ put_page(conf->disks[i].extra_page);
+ conf->disks[i].extra_page = NULL;
+ }
+
kfree(conf->disks);
kfree(conf->stripe_hashtbl);
kfree(conf);
@@ -6612,9 +6648,16 @@ static struct r5conf *setup_conf(struct mddev *mddev)
conf->disks = kzalloc(max_disks * sizeof(struct disk_info),
GFP_KERNEL);
+
if (!conf->disks)
goto abort;
+ for (i = 0; i < conf->raid_disks; i++) {
+ conf->disks[i].extra_page = alloc_page(GFP_KERNEL);
+ if (!conf->disks[i].extra_page)
+ goto abort;
+ }
+
conf->mddev = mddev;
if ((conf->stripe_hashtbl = kzalloc(PAGE_SIZE, GFP_KERNEL)) == NULL)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index d13fe45..ed8e136 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -276,6 +276,7 @@ struct stripe_head_state {
struct md_rdev *blocked_rdev;
int handle_bad_blocks;
int log_failed;
+ int waiting_extra_page;
};
/* Flags for struct r5dev.flags */
@@ -439,6 +440,7 @@ enum {
struct disk_info {
struct md_rdev *rdev, *replacement;
+ struct page *extra_page; /* extra page to use in prexor */
};
/*
@@ -559,6 +561,9 @@ enum r5_cache_state {
* only process stripes that are already
* occupying the log
*/
+ R5C_EXTRA_PAGE_IN_USE, /* a stripe is using disk_info.extra_page
+ * for prexor
+ */
};
struct r5conf {
@@ -765,6 +770,7 @@ extern void
r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
struct stripe_head_state *s);
extern void r5c_release_extra_page(struct stripe_head *sh);
+extern void r5c_use_extra_page(struct stripe_head *sh);
extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
extern void r5c_handle_cached_data_endio(struct r5conf *conf,
struct stripe_head *sh, int disks, struct bio_list *return_bi);
--
2.9.3
^ permalink raw reply related
* Re: Please help RAID1 complete fail no superblock
From: YK @ 2016-11-23 21:47 UTC (permalink / raw)
To: Wols Lists; +Cc: George Rapp, linux-raid@vger.kernel.org, Phil Turmel
In-Reply-To: <5835FB44.2080102@youngman.org.uk>
On 11/23/2016 10:25 PM, Wols Lists wrote:
> Basically, you need to run hexdump, and look for evidence of a damaged
> superblock, and/or a filesystem superblock.
>
> That thread should tell you what strings to look for to locate any superblocks, which are
> probably on a 4K boundary.
Thank you for the explanation.
How do I tell the difference between a filesystem superblock and a
damanged one in the hexdump output?
I found in stackexchange instructions to look for "magic number 0x53EF",
which I then realized was searched by the egrep command in the other
thread you mentioned.
> If you find them, the other experts will be able, hopefully, to tell you
> how to get the array back
I do see "53 ef" in the fifth line of the output for my sdb1 hexdump !
I'll greatly appreciate it if someone could instruct me in how to mount
my sdb1 partition.
I see in some guide online, that one could mount a lost partition with a
command like:
mount /dev/sdd -o ro,offset=2097152 /mnt/test/
How did they calculate that offset number?
And is there anything else I should do to recover the partition before
or after mount?
I have a new HDD to which I could copy all the data, if I manage to
mount the partition.
So, I could create a new array on the old drives, and then copy the data
back.
^ permalink raw reply
* Re: MD Remnants After --stop
From: NeilBrown @ 2016-11-23 23:38 UTC (permalink / raw)
To: Marc Smith; +Cc: linux-raid
In-Reply-To: <CAHkw+Ldv+-uXGCQ+tKfkraZ3VF4V8sk203+Oopm7RvygGQQYGQ@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 7582 bytes --]
On Thu, Nov 24 2016, Marc Smith wrote:
> On Tue, Nov 22, 2016 at 6:51 PM, NeilBrown <neilb@suse.com> wrote:
>> On Wed, Nov 23 2016, Marc Smith wrote:
>>
>>> Hi,
>>>
>>> Sorry, I'm not trying to beat a dead horse here, but I do feel
>>> something has changed... I just tested with Linux 4.5.2 and when
>>> stopping an md array (with mdadm --stop) the entry in /sys/block/ is
>>> removed, and even the /dev/mdXXX and /dev/md/name link are removed
>>> properly.
>>>
>>> When testing with Linux 4.9-rc3, the entries in /sys/block/ still
>>> remain (array_state attribute value is "clear") after using mdadm
>>> --stop and the /dev/mdXXX device exists (the /dev/md/name link is
>>> removed, by udev I assume).
>>
>> With the latest (git) mdadm, when events are reported by "udevadm monitor"??
>>
>> I only see remove events, and the entries from /dev and /sys are
>> removed.
>>
>> If I could reproduce your problem, I would fix it...
>
> On one set of hosts I can reliably reproduce this issue, and on
> another system I could previously reproduce this, but now seems to be
> working fine... I haven't found the connection (same distro / kernel
> versions on all hosts).
>
> # mdadm --version
> mdadm - v3.4-100-g52a9408 - 26th October 2016
>
>
> From 'mdadm --stop /dev/md/blah1' (non-clustered RAID0 array):
>
> --snip--
> # udevadm monitor -pku
> calling: monitor
> monitor will print the received events for:
> UDEV - the event which udev sends out after rule processing
> KERNEL - the kernel uevent
>
> KERNEL[32930.834312] change /devices/virtual/block/md126 (block)
> ACTION=change
> DEVNAME=/dev/md126
> DEVPATH=/devices/virtual/block/md126
> DEVTYPE=disk
> MAJOR=9
> MINOR=126
> SEQNUM=3678
> SUBSYSTEM=block
>
> UDEV [32930.836032] change /devices/virtual/block/md126 (block)
> ACTION=change
> DEVNAME=/dev/md126
> DEVPATH=/devices/virtual/block/md126
> DEVTYPE=disk
> MAJOR=9
> MINOR=126
> SEQNUM=3678
> SUBSYSTEM=block
> SYSTEMD_READY=0
> USEC_INITIALIZED=843336612
Using the recent version of mdadm, and a kernel newer than 4.6 (commit
399146b8) a CHANGE event should only be generated when the array is
started. To see a CHANGE of stop, you must have an older mdadm or and
older kernel.... unless something else is synthesizing a CHANGE event.
When using a clustered array, you can also see a CHANGE event when a new
disk is being added to an array.
> --snip--
>
> Kernel logs from that:
> --snip--
> [32928.465695] md126: detected capacity change from 146681102336 to 0
> [32928.465699] md: md126 stopped.
> [32928.465702] md: unbind<dm-3>
> [32928.465798] udevd[499]: inotify event: 8 for /dev/md126
> [32928.465964] udevd[499]: device /dev/md126 closed, synthesising 'change'
> [32928.466029] udevd[499]: seq 3678 queued, 'change' 'block'
> [32928.466129] udevd[499]: seq 3678 forked new worker [27035]
> [32928.466357] udevd[27035]: seq 3678 running
> [32928.466423] udevd[27035]: removing watch on '/dev/md126'
> [32928.466492] udevd[27035]: IMPORT 'probe-bcache -o udev /dev/md126'
> /usr/lib/udev/rules.d/69-bcache.rules:16
> [32928.466712] udevd[27036]: starting 'probe-bcache -o udev /dev/md126'
> [32928.467540] udevd[27035]: 'probe-bcache -o udev /dev/md126' [27036]
> exit with return code 0
> [32928.467564] udevd[27035]: update old name,
> '/dev/disk/by-id/md-name-tgtnode1.parodyne.com:blah1' no longer
> belonging to '/devices/virtual/block/md126'
> [32928.470851] md: export_rdev(dm-3)
> [32928.470920] md: unbind<dm-2>
> [32928.478843] md: export_rdev(dm-2)
> --snip--
>
>
> From 'mdadm --stop /dev/md/asdf1' (clustered RAID1 array):
>
> --snip--
> # udevadm monitor -pku
> calling: monitor
> monitor will print the received events for:
> UDEV - the event which udev sends out after rule processing
> KERNEL - the kernel uevent
>
> KERNEL[34402.247229] change /devices/virtual/block/md127 (block)
> ACTION=change
> DEVNAME=/dev/md127
> DEVPATH=/devices/virtual/block/md127
> DEVTYPE=disk
> MAJOR=9
> MINOR=127
> SEQNUM=3679
> SUBSYSTEM=block
>
> KERNEL[34402.247885] offline
> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
> ACTION=offline
> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
> SEQNUM=3680
> SUBSYSTEM=dlm
>
> UDEV [34402.248269] offline
> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
> ACTION=offline
> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
> SEQNUM=3680
> SUBSYSTEM=dlm
> USEC_INITIALIZED=402248230
>
> KERNEL[34402.248841] remove
> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
> ACTION=remove
> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
> SEQNUM=3681
> SUBSYSTEM=dlm
>
> UDEV [34402.248899] change /devices/virtual/block/md127 (block)
> ACTION=change
> DEVNAME=/dev/md127
> DEVPATH=/devices/virtual/block/md127
> DEVTYPE=disk
> MAJOR=9
> MINOR=127
> SEQNUM=3679
> SUBSYSTEM=block
> SYSTEMD_READY=0
> USEC_INITIALIZED=1273670
>
> UDEV [34402.248990] remove
> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
> ACTION=remove
> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
> SEQNUM=3681
> SUBSYSTEM=dlm
> USEC_INITIALIZED=2248905
> --snip--
>
>
> Kernel logs from that:
> --snip--
> [34399.753876] udevd[499]: inotify event: 8 for /dev/md127
> [34399.765389] md127: detected capacity change from 73340747776 to 0
> [34399.765393] md: md127 stopped.
> [34399.765579] udevd[499]: device /dev/md127 closed, synthesising 'change'
That is weird. udev is synthesizing a CHANGE event. How nice of it
(not!).
It doesn't do this for dm, only other devices.
This happens when an open-for-write is closed.
If you edit mdopen.c and change the O_RDWR in open_mddev() to O_RDONLY,
this change will go away.
> [34399.765656] udevd[499]: seq 3679 queued, 'change' 'block'
> [34399.765751] udevd[499]: seq 3679 forked new worker [6317]
> [34399.765878] udevd[6317]: seq 3679 running
> [34399.765943] udevd[6317]: removing watch on '/dev/md127'
> [34399.766012] udevd[6317]: IMPORT 'probe-bcache -o udev /dev/md127'
> /usr/lib/udev/rules.d/69-bcache.rules:16
> [34399.766259] udevd[6318]: starting 'probe-bcache -o udev /dev/md127'
> [34399.766295] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: leaving the
> lockspace group...
> [34399.766421] udevd[499]: seq 3680 queued, 'offline' 'dlm'
> [34399.766549] udevd[499]: seq 3680 forked new worker [6319]
> [34399.767080] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: group event done 0 0
> [34399.767297] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e:
> release_lockspace final free
> [34399.767320] md: unbind<dm-1>
> [34399.795574] md: export_rdev(dm-1)
> [34399.795640] md: unbind<dm-0>
> [34399.803565] md: export_rdev(dm-0)
> --snip--
>
>
> On the other machines where the md array stopped correctly (removing
> the entries from /dev and /sys) I do see the 'remove' events with
> "udevadm monitor". What produces those remove events? Is that
> something directly from the mdadm tool, or indirectly as part of the
> stop/tear-down that mdadm initiates?
REMOVE events are generated by
md_free() -> del_gendisk() -> blk_unregister_queue()
which should happen when the md object is finally released.
If there is no CHANGE event, then nothing should stop this from
happening.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: [PATCH 1/2] raid5-cache: suspend reclaim thread instead of shutdown
From: NeilBrown @ 2016-11-24 0:00 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving
In-Reply-To: <20161123063017.GA41488@shli-mbp.local>
[-- Attachment #1: Type: text/plain, Size: 3821 bytes --]
On Wed, Nov 23 2016, Shaohua Li wrote:
> On Wed, Nov 23, 2016 at 01:41:45PM +1100, NeilBrown wrote:
>> On Tue, Nov 22 2016, Shaohua Li wrote:
>>
>> > There is mechanism to suspend a kernel thread. Use it instead of playing
>> > create/destroy game.
>>
>> Good idea!
>>
>> >
>> > Signed-off-by: Shaohua Li <shli@fb.com>
>> > ---
>> > drivers/md/md.c | 4 +++-
>> > drivers/md/raid5-cache.c | 18 +++++-------------
>> > 2 files changed, 8 insertions(+), 14 deletions(-)
>> >
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c
>> > index d3cef77..f548469 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -7136,10 +7136,12 @@ static int md_thread(void *arg)
>> > wait_event_interruptible_timeout
>> > (thread->wqueue,
>> > test_bit(THREAD_WAKEUP, &thread->flags)
>> > - || kthread_should_stop(),
>> > + || kthread_should_stop() || kthread_should_park(),
>> > thread->timeout);
>> >
>> > clear_bit(THREAD_WAKEUP, &thread->flags);
>> > + if (kthread_should_park())
>> > + kthread_parkme();
>> > if (!kthread_should_stop())
>> > thread->run(thread);
>> > }
>> > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> > index 8cb79fc..5f817bd 100644
>> > --- a/drivers/md/raid5-cache.c
>> > +++ b/drivers/md/raid5-cache.c
>> > @@ -19,6 +19,7 @@
>> > #include <linux/raid/md_p.h>
>> > #include <linux/crc32c.h>
>> > #include <linux/random.h>
>> > +#include <linux/kthread.h>
>> > #include "md.h"
>> > #include "raid5.h"
>> > #include "bitmap.h"
>> > @@ -1437,23 +1438,14 @@ void r5l_quiesce(struct r5l_log *log, int state)
>> > struct mddev *mddev;
>> > if (!log || state == 2)
>> > return;
>> > - if (state == 0) {
>> > - /*
>> > - * This is a special case for hotadd. In suspend, the array has
>> > - * no journal. In resume, journal is initialized as well as the
>> > - * reclaim thread.
>> > - */
>> > - if (log->reclaim_thread)
>> > - return;
>> > - log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
>> > - log->rdev->mddev, "reclaim");
>> > - log->reclaim_thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
>> > - } else if (state == 1) {
>> > + if (state == 0)
>> > + kthread_unpark(log->reclaim_thread->tsk);
>>
>> The old code tested for log->reclaim_thread being NULL. This new
>> version will just crash.
>
> But the reclaim_thread couldn't be NULL if log != NULL. Am I missing anything?
Yes, you are right. The old code had a test that the new code didn't,
which rang warning bells for me.
Both now that we don't de-register the thread in r5l_quiesce(),
log->reclaim_thread will never be NULL, so the test isn't needed.
>
>> > + else if (state == 1) {
>> > /* make sure r5l_write_super_and_discard_space exits */
>> > mddev = log->rdev->mddev;
>> > wake_up(&mddev->sb_wait);
>> > + kthread_park(log->reclaim_thread->tsk);
>>
>> r5l_do_reclaim has a wait loop internally. I think you need that to
>> abort when kthread_should_park(), else this will block indefinitely.
>
> Sounds not harmful to me. The loop in r5l_do_reclaim will eventually end if all
> data is reclaimed. Then the thread will be in the md_thread() loop. In that
> loop, the thread will not sleep because the wait checks kthread_should_park().
> Then the thread will get parked.
Maybe ... it just looks odd.
what is that while(1) {} loop really waiting for? It waits for there to
be more than reclaim_target work to do, or for all the _ios lists to be
empty. By the time r5l_quiesce() is called, all active stripes should
have drained, so I guess that will abort quickly.
Why is it waiting there, rather than in md_thread()? Why do we need
that loop?
Thanks,
NeilBrown
>
> Thanks,
> Shaohua
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: raid1 bitmap and multiple removed disks
From: NeilBrown @ 2016-11-24 0:26 UTC (permalink / raw)
To: Diego Guella; +Cc: linux-raid
In-Reply-To: <ce2ba842-22f0-eebf-856c-f07d38574489@deviltechnologies.com>
[-- Attachment #1: Type: text/plain, Size: 2801 bytes --]
On Wed, Nov 23 2016, Diego Guella wrote:
> (2nd attempt: the previous one didn't make it)
> Hi,
>
> I am using linux raid1 for a double-purpose: redundancy and backup.
>
> I have a raid1 array of 5 disks, 3 of which are kept for backup purposes.
> Let's call disks A, B, C, D, E.
> Disks A and B are _always_ connected to the system.
> Disks C, D, E are backup disks.
> Here follows a description of how I use the backup disks.
> This morning I connect disk C, and let it resync.
> Tomorrow morning, I shut down the system, remove disk C and keep it away
> as a daily backup.
> I connect the next disk (D), then start up the system.
> Linux raid1 recognizes the "old" disk and does not allow it to enter the
> array (this is evidenced by system logs).
> I then add disk D to the array, and let it resync.
So this would be a full resync - right?
> The next day, I connect the next disk (E), and so on, rotating them.
> The "connect and disconnect" is always performed when the system is
> powered off, although sometimes I hot-connect the disk with the system
> already powered up.
> The purpose of this is to have an emergency backup: I can disconnect ALL
> disks from the system and connect only one of the daily backups, going
> "back to the past"(TM).
>
> This array has a write-intent bitmap, in order to speed up the resync
> (it is a 4TB array, and sometimes it needs nearly 20 hours to resync
> without bitmaps due to system load).
>
> This worked flawlessly (for some years) until some days ago, when the
> array suffered a strange inconsistency, and the filesystem nearly gone
> nuts in about 20 minutes of uptime. I will elaborate more on this
> later.
Did you ever test your backups?
>
> Since that problem happened, some questions come to my mind:
> What raid1 bitmaps allow me to do?
- accelerate resync after a crash.
- accelerate recovery when you remove a drive and re-add it.
> Can they record _correctly_ the state of multiple removed disks, in
> order to overwrite only out-of-sync chunks of multiple removed disks?
All that is recorded is the set of regions which have been written to
since the array was last in a non-degraded state.
> In other words, am I allowed to do what I described above?
If the recovery that happened when you swapped drives was not a full
recovery, then probably not.
> If not, can I change something in my actions in order to have a daily
> backup using raid1?
I wrote something about this a few years ago...
http://permalink.gmane.org/gmane.linux.raid/35074
or this thread
http://www.spinics.net/lists/raid/msg35532.html
NeilBrown
>
>
> System details:
> # cat /etc/debian_version
> 6.0.10
> # mdadm --version
> mdadm - v3.1.4 - 31st August 2010
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
From: NeilBrown @ 2016-11-24 0:31 UTC (permalink / raw)
Cc: Shaohua Li, Christoph Hellwig, linux-raid
In-Reply-To: <20161123084540.GE16966@lst.de>
[-- Attachment #1: Type: text/plain, Size: 1565 bytes --]
On Wed, Nov 23 2016, Christoph Hellwig wrote:
> On Wed, Nov 23, 2016 at 01:08:38PM +1100, NeilBrown wrote:
>> For raid1/raid10 we could do a very similar thing. There is an
>> awkwardness in raid1 w.r.t waiting for bi_phys_segments to reach 1, but
>> that might disappear if Coly's resync changes go through.
>> Alternately it might make sense to use bio_split so there is one r1_bio
>> per bio.
>> I might try the raid10 version and see what it looks like.
>
> For RAID1 reads there already is one r1_bio per bio
unless the bad-block-log records some bad/missing blocks in the middle
of the range being read.
> - in fact I have
> a hack that doesn't allocate a r1_bio at all, but that one currently
> does not handle reads from degraded arrays at all. Due you remember
> why we only mark a leg fail for a given bio instead of on a per-device
> or at least per sector-range?
When we detect a read error, we try to "correct" it by reading the data
From elsewhere and over-writing the "bad" block. If that works, there
was no need to mark the whole device as faulty. If it does fail, that
is when the device is failed.
We could have a "soft-write-mostly" flag to discourage reads from a
questionable device, until it has been properly tested.
>
> For writes it would make sense to allocate the new bio for each mirror
> using a bio_set with front_pad for raid1-specific data, but I haven't
> really looked into the details yet.
Yes, you could. I don't know if it would be an improvement, or just a
change.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: [PATCH] md/r5cache: run_no_space_stripes() when R5C_LOG_CRITICAL == 0
From: Shaohua Li @ 2016-11-24 0:42 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161123201745.972168-1-songliubraving@fb.com>
On Wed, Nov 23, 2016 at 12:17:45PM -0800, Song Liu wrote:
> With writeback cache, we define log space critical as
>
> free_space < 2 * reclaim_required_space
>
> So the deassert of R5C_LOG_CRITICAL could happen when
> 1. free_space increases
> 2. reclaim_required_space decreases
>
> Currently, run_no_space_stripes() is called when 1 happens, but
> not (always) when 2 happens.
>
> With this patch, run_no_space_stripes() is call when
> R5C_LOG_CRITICAL is cleared.
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/md/raid5-cache.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 8cb79fc..0e24aab 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -370,6 +370,7 @@ static inline void r5c_update_log_state(struct r5l_log *log)
> struct r5conf *conf = log->rdev->mddev->private;
> sector_t free_space;
> sector_t reclaim_space;
> + bool wake_reclaim = false;
>
> if (!r5c_is_writeback(log))
> return;
> @@ -379,12 +380,18 @@ static inline void r5c_update_log_state(struct r5l_log *log)
> reclaim_space = r5c_log_required_to_flush_cache(conf);
> if (free_space < 2 * reclaim_space)
> set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> - else
> + else {
> + if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state))
> + wake_reclaim = true;
> clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> + }
This doesn't look correct. shouldn't we clear the bit and do the wakeup right
after stripe_in_journal_count is decreased? Actually all these clear_bit looks
wrong. They should be cleared right after the condition met. In current
implementation, if no new stripe comes, r5c_update_log_state will not get
called.
> if (free_space < 3 * reclaim_space)
> set_bit(R5C_LOG_TIGHT, &conf->cache_state);
> else
> clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +
> + if (wake_reclaim)
> + r5l_wake_reclaim(log, 0);
> }
>
> /*
> @@ -1348,6 +1355,10 @@ static void r5c_do_reclaim(struct r5conf *conf)
> spin_unlock(&conf->device_lock);
> spin_unlock_irqrestore(&log->stripe_in_journal_lock, flags);
> }
> +
> + if (!test_bit(R5C_LOG_CRITICAL, &conf->cache_state))
> + r5l_run_no_space_stripes(log);
> +
Why this? r5l_do_reclaim will call r5l_run_no_space_stripes.
Thanks,
Shaohua
^ permalink raw reply
* Re: [PATCH v3] md/r5cache: handle alloc_page failure
From: Shaohua Li @ 2016-11-24 0:51 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161123212018.1099843-1-songliubraving@fb.com>
On Wed, Nov 23, 2016 at 01:20:18PM -0800, Song Liu wrote:
> RMW of r5c write back cache uses an extra page to store old data for
> prexor. handle_stripe_dirtying() allocates this page by calling
> alloc_page(). However, alloc_page() may fail.
>
> To handle alloc_page() failures, this patch adds an extra page to
> disk_info. When alloc_page fails, handle_stripe() trys to use these
> pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
> the stripe is added to delayed_list.
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/md/raid5-cache.c | 28 ++++++++++++++++++++++
> drivers/md/raid5.c | 61 +++++++++++++++++++++++++++++++++++++++++-------
> drivers/md/raid5.h | 6 +++++
> 3 files changed, 86 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 8cb79fc..3817d2b 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -2334,15 +2334,43 @@ int r5c_try_caching_write(struct r5conf *conf,
> */
> void r5c_release_extra_page(struct stripe_head *sh)
> {
> + struct r5conf *conf = sh->raid_conf;
> int i;
> + bool using_disk_info_extra_page;
> +
> + using_disk_info_extra_page =
> + sh->dev[0].orig_page == conf->disks[0].extra_page;
>
> for (i = sh->disks; i--; )
> if (sh->dev[i].page != sh->dev[i].orig_page) {
> struct page *p = sh->dev[i].orig_page;
>
> sh->dev[i].orig_page = sh->dev[i].page;
> + if (!using_disk_info_extra_page)
> + put_page(p);
> + }
> +
> + if (using_disk_info_extra_page)
> + clear_bit(R5C_EXTRA_PAGE_IN_USE, &conf->cache_state);
Here we need to wakeup delay stripes (eg, wakeup raid5d)
> +}
> +
> +void r5c_use_extra_page(struct stripe_head *sh)
> +{
> + struct r5conf *conf = sh->raid_conf;
> + int i;
> + struct r5dev *dev;
> + struct page *p;
> +
> + for (i = sh->disks; i--; ) {
> + dev = &sh->dev[i];
> + if (dev->orig_page != dev->page) {
> + p = dev->orig_page;
> + dev->orig_page = dev->page;
> put_page(p);
> }
> + dev->orig_page = conf->disks[i].extra_page;
> + BUG_ON(!dev->orig_page);
> + }
> }
>
> /*
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index dbab8c7..7656530 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -876,6 +876,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>
> if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
> /* writing out phase */
> + if (s->waiting_extra_page)
> + return;
> if (r5l_write_stripe(conf->log, sh) == 0)
> return;
> } else { /* caching phase */
> @@ -2007,6 +2009,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
> INIT_LIST_HEAD(&sh->batch_list);
> INIT_LIST_HEAD(&sh->lru);
> INIT_LIST_HEAD(&sh->r5c);
> + INIT_LIST_HEAD(&sh->log_list);
> atomic_set(&sh->count, 1);
> sh->log_start = MaxSector;
> for (i = 0; i < disks; i++) {
> @@ -3580,10 +3583,10 @@ static void handle_stripe_clean_event(struct r5conf *conf,
> break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
> }
>
> -static void handle_stripe_dirtying(struct r5conf *conf,
> - struct stripe_head *sh,
> - struct stripe_head_state *s,
> - int disks)
> +static int handle_stripe_dirtying(struct r5conf *conf,
> + struct stripe_head *sh,
> + struct stripe_head_state *s,
> + int disks)
> {
> int rmw = 0, rcw = 0, i;
> sector_t recovery_cp = conf->mddev->recovery_cp;
> @@ -3649,12 +3652,32 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> dev->page == dev->orig_page &&
> !test_bit(R5_LOCKED, &sh->dev[sh->pd_idx].flags)) {
> /* alloc page for prexor */
> - dev->orig_page = alloc_page(GFP_NOIO);
> + struct page *p = alloc_page(GFP_NOIO);
> +
> + if (p) {
> + dev->orig_page = p;
> + continue;
> + }
>
> - /* will handle failure in a later patch*/
> - BUG_ON(!dev->orig_page);
> + /*
> + * alloc_page() failed, try use
> + * disk_info->extra_page
> + */
> + if (!test_and_set_bit(R5C_EXTRA_PAGE_IN_USE,
> + &conf->cache_state)) {
> + r5c_use_extra_page(sh);
> + break;
> + }
> +
> + /* extra_page in use, add to delayed_list */
> + set_bit(STRIPE_DELAYED, &sh->state);
> + s->waiting_extra_page = 1;
> + return -EAGAIN;
> }
> + }
>
> + for (i = disks; i--; ) {
> + struct r5dev *dev = &sh->dev[i];
> if ((dev->towrite ||
> i == sh->pd_idx || i == sh->qd_idx ||
> test_bit(R5_InJournal, &dev->flags)) &&
> @@ -3730,6 +3753,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> (s->locked == 0 && (rcw == 0 || rmw == 0) &&
> !test_bit(STRIPE_BIT_DELAY, &sh->state)))
> schedule_reconstruction(sh, s, rcw == 0, 0);
> + return 0;
> }
>
> static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
> @@ -4545,8 +4569,12 @@ static void handle_stripe(struct stripe_head *sh)
> if (ret == -EAGAIN ||
> /* stripe under reclaim: !caching && injournal */
> (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
> - s.injournal > 0))
> - handle_stripe_dirtying(conf, sh, &s, disks);
> + s.injournal > 0)) {
> + ret = handle_stripe_dirtying(conf, sh, &s,
> + disks);
> + if (ret == -EAGAIN)
> + goto finish;
> + }
> }
> }
>
> @@ -6458,6 +6486,8 @@ static void raid5_free_percpu(struct r5conf *conf)
>
> static void free_conf(struct r5conf *conf)
> {
> + int i;
> +
> if (conf->log)
> r5l_exit_log(conf->log);
> if (conf->shrinker.nr_deferred)
> @@ -6466,6 +6496,12 @@ static void free_conf(struct r5conf *conf)
> free_thread_groups(conf);
> shrink_stripes(conf);
> raid5_free_percpu(conf);
> + for (i = 0; i < conf->raid_disks; i++)
> + if (conf->disks[i].extra_page) {
> + put_page(conf->disks[i].extra_page);
> + conf->disks[i].extra_page = NULL;
> + }
> +
> kfree(conf->disks);
> kfree(conf->stripe_hashtbl);
> kfree(conf);
> @@ -6612,9 +6648,16 @@ static struct r5conf *setup_conf(struct mddev *mddev)
>
> conf->disks = kzalloc(max_disks * sizeof(struct disk_info),
> GFP_KERNEL);
> +
> if (!conf->disks)
> goto abort;
>
> + for (i = 0; i < conf->raid_disks; i++) {
> + conf->disks[i].extra_page = alloc_page(GFP_KERNEL);
> + if (!conf->disks[i].extra_page)
> + goto abort;
> + }
resize can change raid_disks, so the extra_page should be reallocated at
resize.
Thanks,
Shaohua
^ permalink raw reply
* Re: [PATCH 1/2] raid5-cache: suspend reclaim thread instead of shutdown
From: Shaohua Li @ 2016-11-24 0:56 UTC (permalink / raw)
To: NeilBrown; +Cc: Shaohua Li, linux-raid, Kernel-team, songliubraving
In-Reply-To: <87oa15bvcg.fsf@notabene.neil.brown.name>
On Thu, Nov 24, 2016 at 11:00:15AM +1100, Neil Brown wrote:
> On Wed, Nov 23 2016, Shaohua Li wrote:
>
> > On Wed, Nov 23, 2016 at 01:41:45PM +1100, NeilBrown wrote:
> >> On Tue, Nov 22 2016, Shaohua Li wrote:
> >>
> >> > There is mechanism to suspend a kernel thread. Use it instead of playing
> >> > create/destroy game.
> >>
> >> Good idea!
> >>
> >> >
> >> > Signed-off-by: Shaohua Li <shli@fb.com>
> >> > ---
> >> > drivers/md/md.c | 4 +++-
> >> > drivers/md/raid5-cache.c | 18 +++++-------------
> >> > 2 files changed, 8 insertions(+), 14 deletions(-)
> >> >
> >> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> > index d3cef77..f548469 100644
> >> > --- a/drivers/md/md.c
> >> > +++ b/drivers/md/md.c
> >> > @@ -7136,10 +7136,12 @@ static int md_thread(void *arg)
> >> > wait_event_interruptible_timeout
> >> > (thread->wqueue,
> >> > test_bit(THREAD_WAKEUP, &thread->flags)
> >> > - || kthread_should_stop(),
> >> > + || kthread_should_stop() || kthread_should_park(),
> >> > thread->timeout);
> >> >
> >> > clear_bit(THREAD_WAKEUP, &thread->flags);
> >> > + if (kthread_should_park())
> >> > + kthread_parkme();
> >> > if (!kthread_should_stop())
> >> > thread->run(thread);
> >> > }
> >> > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> >> > index 8cb79fc..5f817bd 100644
> >> > --- a/drivers/md/raid5-cache.c
> >> > +++ b/drivers/md/raid5-cache.c
> >> > @@ -19,6 +19,7 @@
> >> > #include <linux/raid/md_p.h>
> >> > #include <linux/crc32c.h>
> >> > #include <linux/random.h>
> >> > +#include <linux/kthread.h>
> >> > #include "md.h"
> >> > #include "raid5.h"
> >> > #include "bitmap.h"
> >> > @@ -1437,23 +1438,14 @@ void r5l_quiesce(struct r5l_log *log, int state)
> >> > struct mddev *mddev;
> >> > if (!log || state == 2)
> >> > return;
> >> > - if (state == 0) {
> >> > - /*
> >> > - * This is a special case for hotadd. In suspend, the array has
> >> > - * no journal. In resume, journal is initialized as well as the
> >> > - * reclaim thread.
> >> > - */
> >> > - if (log->reclaim_thread)
> >> > - return;
> >> > - log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
> >> > - log->rdev->mddev, "reclaim");
> >> > - log->reclaim_thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
> >> > - } else if (state == 1) {
> >> > + if (state == 0)
> >> > + kthread_unpark(log->reclaim_thread->tsk);
> >>
> >> The old code tested for log->reclaim_thread being NULL. This new
> >> version will just crash.
> >
> > But the reclaim_thread couldn't be NULL if log != NULL. Am I missing anything?
>
> Yes, you are right. The old code had a test that the new code didn't,
> which rang warning bells for me.
> Both now that we don't de-register the thread in r5l_quiesce(),
> log->reclaim_thread will never be NULL, so the test isn't needed.
>
>
> >
> >> > + else if (state == 1) {
> >> > /* make sure r5l_write_super_and_discard_space exits */
> >> > mddev = log->rdev->mddev;
> >> > wake_up(&mddev->sb_wait);
> >> > + kthread_park(log->reclaim_thread->tsk);
> >>
> >> r5l_do_reclaim has a wait loop internally. I think you need that to
> >> abort when kthread_should_park(), else this will block indefinitely.
> >
> > Sounds not harmful to me. The loop in r5l_do_reclaim will eventually end if all
> > data is reclaimed. Then the thread will be in the md_thread() loop. In that
> > loop, the thread will not sleep because the wait checks kthread_should_park().
> > Then the thread will get parked.
>
> Maybe ... it just looks odd.
> what is that while(1) {} loop really waiting for? It waits for there to
> be more than reclaim_target work to do, or for all the _ios lists to be
> empty. By the time r5l_quiesce() is called, all active stripes should
> have drained, so I guess that will abort quickly.
> Why is it waiting there, rather than in md_thread()? Why do we need
> that loop?
The r5c_do_reclaim is called in normal reclaim thread too, eg, not just
quiesce. At that time the wait is necessary, because some stripes are waiting
for free space, who can only be waken up after there is enough free space. You
are correct, the loop will abort quickly in quiesce.
Thanks,
Shaohua
^ permalink raw reply
* Re: Bug#784070: Newly-created arrays don't auto-assemble - related to hostname change?
From: Andy Smith @ 2016-11-24 1:24 UTC (permalink / raw)
To: Michael Tokarev; +Cc: NeilBrown, 784070, linux-raid
In-Reply-To: <c7f7e076-0fed-eb51-3291-f93cf66a49a3@msgid.tls.msk.ru>
Hi,
On Wed, Nov 23, 2016 at 12:03:49PM +0300, Michael Tokarev wrote:
> It was long ago when we disabled incremental assembly when
> you turned it on by default, and kept old static way to
> assemble arrays, because neither our initrd nor regular
> userpsace weren't ready for that.
Okay, so, on Debian jessie then, it is expected that md arrays on
devices that are only present after the initramfs is done working
will not be automatically (incrementally) started?
I saw Yann mentioned that in stretch the GOTO="md_inc_end" has been
removed again. Does that mean that incremental assembly on device
change is expected to work again in stretch (I have not tested it,
and most likely will not have time to do so with this hardware).
Thanks,
Andy
^ permalink raw reply
* Re: [PATCH 1/2] raid5-cache: suspend reclaim thread instead of shutdown
From: NeilBrown @ 2016-11-24 3:13 UTC (permalink / raw)
To: Shaohua Li; +Cc: Shaohua Li, linux-raid, Kernel-team, songliubraving
In-Reply-To: <20161124005609.d7mrshcgxl4kvlas@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 4587 bytes --]
On Thu, Nov 24 2016, Shaohua Li wrote:
> On Thu, Nov 24, 2016 at 11:00:15AM +1100, Neil Brown wrote:
>> On Wed, Nov 23 2016, Shaohua Li wrote:
>>
>> > On Wed, Nov 23, 2016 at 01:41:45PM +1100, NeilBrown wrote:
>> >> On Tue, Nov 22 2016, Shaohua Li wrote:
>> >>
>> >> > There is mechanism to suspend a kernel thread. Use it instead of playing
>> >> > create/destroy game.
>> >>
>> >> Good idea!
>> >>
>> >> >
>> >> > Signed-off-by: Shaohua Li <shli@fb.com>
>> >> > ---
>> >> > drivers/md/md.c | 4 +++-
>> >> > drivers/md/raid5-cache.c | 18 +++++-------------
>> >> > 2 files changed, 8 insertions(+), 14 deletions(-)
>> >> >
>> >> > diff --git a/drivers/md/md.c b/drivers/md/md.c
>> >> > index d3cef77..f548469 100644
>> >> > --- a/drivers/md/md.c
>> >> > +++ b/drivers/md/md.c
>> >> > @@ -7136,10 +7136,12 @@ static int md_thread(void *arg)
>> >> > wait_event_interruptible_timeout
>> >> > (thread->wqueue,
>> >> > test_bit(THREAD_WAKEUP, &thread->flags)
>> >> > - || kthread_should_stop(),
>> >> > + || kthread_should_stop() || kthread_should_park(),
>> >> > thread->timeout);
>> >> >
>> >> > clear_bit(THREAD_WAKEUP, &thread->flags);
>> >> > + if (kthread_should_park())
>> >> > + kthread_parkme();
>> >> > if (!kthread_should_stop())
>> >> > thread->run(thread);
>> >> > }
>> >> > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> >> > index 8cb79fc..5f817bd 100644
>> >> > --- a/drivers/md/raid5-cache.c
>> >> > +++ b/drivers/md/raid5-cache.c
>> >> > @@ -19,6 +19,7 @@
>> >> > #include <linux/raid/md_p.h>
>> >> > #include <linux/crc32c.h>
>> >> > #include <linux/random.h>
>> >> > +#include <linux/kthread.h>
>> >> > #include "md.h"
>> >> > #include "raid5.h"
>> >> > #include "bitmap.h"
>> >> > @@ -1437,23 +1438,14 @@ void r5l_quiesce(struct r5l_log *log, int state)
>> >> > struct mddev *mddev;
>> >> > if (!log || state == 2)
>> >> > return;
>> >> > - if (state == 0) {
>> >> > - /*
>> >> > - * This is a special case for hotadd. In suspend, the array has
>> >> > - * no journal. In resume, journal is initialized as well as the
>> >> > - * reclaim thread.
>> >> > - */
>> >> > - if (log->reclaim_thread)
>> >> > - return;
>> >> > - log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
>> >> > - log->rdev->mddev, "reclaim");
>> >> > - log->reclaim_thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
>> >> > - } else if (state == 1) {
>> >> > + if (state == 0)
>> >> > + kthread_unpark(log->reclaim_thread->tsk);
>> >>
>> >> The old code tested for log->reclaim_thread being NULL. This new
>> >> version will just crash.
>> >
>> > But the reclaim_thread couldn't be NULL if log != NULL. Am I missing anything?
>>
>> Yes, you are right. The old code had a test that the new code didn't,
>> which rang warning bells for me.
>> Both now that we don't de-register the thread in r5l_quiesce(),
>> log->reclaim_thread will never be NULL, so the test isn't needed.
>>
>>
>> >
>> >> > + else if (state == 1) {
>> >> > /* make sure r5l_write_super_and_discard_space exits */
>> >> > mddev = log->rdev->mddev;
>> >> > wake_up(&mddev->sb_wait);
>> >> > + kthread_park(log->reclaim_thread->tsk);
>> >>
>> >> r5l_do_reclaim has a wait loop internally. I think you need that to
>> >> abort when kthread_should_park(), else this will block indefinitely.
>> >
>> > Sounds not harmful to me. The loop in r5l_do_reclaim will eventually end if all
>> > data is reclaimed. Then the thread will be in the md_thread() loop. In that
>> > loop, the thread will not sleep because the wait checks kthread_should_park().
>> > Then the thread will get parked.
>>
>> Maybe ... it just looks odd.
>> what is that while(1) {} loop really waiting for? It waits for there to
>> be more than reclaim_target work to do, or for all the _ios lists to be
>> empty. By the time r5l_quiesce() is called, all active stripes should
>> have drained, so I guess that will abort quickly.
>> Why is it waiting there, rather than in md_thread()? Why do we need
>> that loop?
>
> The r5c_do_reclaim is called in normal reclaim thread too, eg, not just
> quiesce. At that time the wait is necessary, because some stripes are waiting
> for free space, who can only be waken up after there is enough free space. You
> are correct, the loop will abort quickly in quiesce.
OK, that makes sense. Thanks.
Reviewed-by: NeilBrown <neilb@suse.de>
for both these patches.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: [PATCH/RFC] add "failfast" support for raid1/raid10.
From: NeilBrown @ 2016-11-24 4:47 UTC (permalink / raw)
To: Jack Wang
Cc: Shaohua Li, linux-raid, linux-block, Christoph Hellwig,
linux-kernel, hare
In-Reply-To: <CA+res+TZD2Lx4CaGPx6SjTc6VKGLUc=qyYBgi15oSYU1bBuHew@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2463 bytes --]
On Sat, Nov 19 2016, Jack Wang wrote:
> 2016-11-18 6:16 GMT+01:00 NeilBrown <neilb@suse.com>:
>> Hi,
>>
>> I've been sitting on these patches for a while because although they
>> solve a real problem, it is a fairly limited use-case, and I don't
>> really like some of the details.
>>
>> So I'm posting them as RFC in the hope that a different perspective
>> might help me like them better, or find a better approach.
>>
>> The core idea is that when you have multiple copies of data
>> (i.e. mirrored drives) it doesn't make sense to wait for a read from
>> a drive that seems to be having problems. It will probably be faster
>> to just cancel that read, and read from the other device.
>> Similarly, in some circumstances, it might be better to fail a drive
>> that is being slow to respond to writes, rather than cause all writes
>> to be very slow.
>>
>> The particular context where this comes up is when mirroring across
>> storage arrays, where the storage arrays can temporarily take an
>> unusually long time to respond to requests (firmware updates have
>> been mentioned). As the array will have redundancy internally, there
>> is little risk to the data. The mirrored pair is really only for
>> disaster recovery, and it is deemed better to lose the last few
>> minutes of updates in the case of a serious disaster, rather than
>> occasionally having latency issues because one array needs to do some
>> maintenance for a few minutes. The particular storage arrays in
>> question are DASD devices which are part of the s390 ecosystem.
>
> Hi Neil,
>
> Thanks for pushing this feature also to mainline.
> We at Profitbricks use raid1 across IB network, one pserver with
> raid1, both legs on 2 remote storages.
> We've noticed if one remote storage crash , and raid1 still keep
> sending IO to the faulty leg, even after 5 minutes,
> md still redirect I/Os, and md refuse to remove active disks, eg:
That make sense. It cannot remove the active disk until all pending IO
completes, either with an error or with success.
If the target has a long timeout, that can delay progress a lot.
>
> I tried to port you patch from SLES[1], with the patchset, it reduce
> the time to ~30 seconds.
>
> I'm happy to see this feature upstream :)
> I will test again this new patchset.
Thanks for your confirmation that this is more generally useful than I
thought, and I'm always happy to hear for more testing :-)
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* [PATCH v4] md/r5cache: handle alloc_page failure
From: Song Liu @ 2016-11-24 5:14 UTC (permalink / raw)
To: linux-raid
Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
liuzhengyuan, Song Liu
RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.
To handle alloc_page() failures, this patch adds an extra page to
disk_info. When alloc_page fails, handle_stripe() trys to use these
pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
the stripe is added to delayed_list.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5-cache.c | 30 +++++++++++++++++
drivers/md/raid5.c | 84 +++++++++++++++++++++++++++++++++++++++++-------
drivers/md/raid5.h | 6 ++++
3 files changed, 109 insertions(+), 11 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 8cb79fc..c395787 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2334,15 +2334,45 @@ int r5c_try_caching_write(struct r5conf *conf,
*/
void r5c_release_extra_page(struct stripe_head *sh)
{
+ struct r5conf *conf = sh->raid_conf;
int i;
+ bool using_disk_info_extra_page;
+
+ using_disk_info_extra_page =
+ sh->dev[0].orig_page == conf->disks[0].extra_page;
for (i = sh->disks; i--; )
if (sh->dev[i].page != sh->dev[i].orig_page) {
struct page *p = sh->dev[i].orig_page;
sh->dev[i].orig_page = sh->dev[i].page;
+ if (!using_disk_info_extra_page)
+ put_page(p);
+ }
+
+ if (using_disk_info_extra_page) {
+ clear_bit(R5C_EXTRA_PAGE_IN_USE, &conf->cache_state);
+ md_wakeup_thread(conf->mddev->thread);
+ }
+}
+
+void r5c_use_extra_page(struct stripe_head *sh)
+{
+ struct r5conf *conf = sh->raid_conf;
+ int i;
+ struct r5dev *dev;
+ struct page *p;
+
+ for (i = sh->disks; i--; ) {
+ dev = &sh->dev[i];
+ if (dev->orig_page != dev->page) {
+ p = dev->orig_page;
+ dev->orig_page = dev->page;
put_page(p);
}
+ dev->orig_page = conf->disks[i].extra_page;
+ BUG_ON(!dev->orig_page);
+ }
}
/*
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index dbab8c7..5454c3b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -876,6 +876,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
/* writing out phase */
+ if (s->waiting_extra_page)
+ return;
if (r5l_write_stripe(conf->log, sh) == 0)
return;
} else { /* caching phase */
@@ -2007,6 +2009,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
INIT_LIST_HEAD(&sh->batch_list);
INIT_LIST_HEAD(&sh->lru);
INIT_LIST_HEAD(&sh->r5c);
+ INIT_LIST_HEAD(&sh->log_list);
atomic_set(&sh->count, 1);
sh->log_start = MaxSector;
for (i = 0; i < disks; i++) {
@@ -2255,8 +2258,27 @@ static int resize_stripes(struct r5conf *conf, int newsize)
if (ndisks) {
for (i=0; i<conf->raid_disks; i++)
ndisks[i] = conf->disks[i];
- kfree(conf->disks);
- conf->disks = ndisks;
+
+ /* allocate extra_page for ndisks */
+ for (i = 0; i < newsize; i++) {
+ ndisks[i].extra_page = alloc_page(GFP_NOIO);
+ if (!ndisks[i].extra_page)
+ err = -ENOMEM;
+ }
+
+ if (err) {
+ /* if any error, free extra_page for ndisks */
+ for (i = 0; i < newsize; i++)
+ if (ndisks[i].extra_page)
+ put_page(ndisks[i].extra_page);
+ kfree(ndisks);
+ } else {
+ /* if no error, free extra_page for old disks */
+ for (i = 0; i < conf->previous_raid_disks; i++)
+ put_page(ndisks[i].extra_page);
+ kfree(conf->disks);
+ conf->disks = ndisks;
+ }
} else
err = -ENOMEM;
@@ -3580,10 +3602,10 @@ static void handle_stripe_clean_event(struct r5conf *conf,
break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
}
-static void handle_stripe_dirtying(struct r5conf *conf,
- struct stripe_head *sh,
- struct stripe_head_state *s,
- int disks)
+static int handle_stripe_dirtying(struct r5conf *conf,
+ struct stripe_head *sh,
+ struct stripe_head_state *s,
+ int disks)
{
int rmw = 0, rcw = 0, i;
sector_t recovery_cp = conf->mddev->recovery_cp;
@@ -3649,12 +3671,32 @@ static void handle_stripe_dirtying(struct r5conf *conf,
dev->page == dev->orig_page &&
!test_bit(R5_LOCKED, &sh->dev[sh->pd_idx].flags)) {
/* alloc page for prexor */
- dev->orig_page = alloc_page(GFP_NOIO);
+ struct page *p = alloc_page(GFP_NOIO);
+
+ if (p) {
+ dev->orig_page = p;
+ continue;
+ }
+
+ /*
+ * alloc_page() failed, try use
+ * disk_info->extra_page
+ */
+ if (!test_and_set_bit(R5C_EXTRA_PAGE_IN_USE,
+ &conf->cache_state)) {
+ r5c_use_extra_page(sh);
+ break;
+ }
- /* will handle failure in a later patch*/
- BUG_ON(!dev->orig_page);
+ /* extra_page in use, add to delayed_list */
+ set_bit(STRIPE_DELAYED, &sh->state);
+ s->waiting_extra_page = 1;
+ return -EAGAIN;
}
+ }
+ for (i = disks; i--; ) {
+ struct r5dev *dev = &sh->dev[i];
if ((dev->towrite ||
i == sh->pd_idx || i == sh->qd_idx ||
test_bit(R5_InJournal, &dev->flags)) &&
@@ -3730,6 +3772,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
(s->locked == 0 && (rcw == 0 || rmw == 0) &&
!test_bit(STRIPE_BIT_DELAY, &sh->state)))
schedule_reconstruction(sh, s, rcw == 0, 0);
+ return 0;
}
static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
@@ -4545,8 +4588,12 @@ static void handle_stripe(struct stripe_head *sh)
if (ret == -EAGAIN ||
/* stripe under reclaim: !caching && injournal */
(!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
- s.injournal > 0))
- handle_stripe_dirtying(conf, sh, &s, disks);
+ s.injournal > 0)) {
+ ret = handle_stripe_dirtying(conf, sh, &s,
+ disks);
+ if (ret == -EAGAIN)
+ goto finish;
+ }
}
}
@@ -6458,6 +6505,8 @@ static void raid5_free_percpu(struct r5conf *conf)
static void free_conf(struct r5conf *conf)
{
+ int i;
+
if (conf->log)
r5l_exit_log(conf->log);
if (conf->shrinker.nr_deferred)
@@ -6466,6 +6515,12 @@ static void free_conf(struct r5conf *conf)
free_thread_groups(conf);
shrink_stripes(conf);
raid5_free_percpu(conf);
+ for (i = 0; i < conf->raid_disks; i++)
+ if (conf->disks[i].extra_page) {
+ put_page(conf->disks[i].extra_page);
+ conf->disks[i].extra_page = NULL;
+ }
+
kfree(conf->disks);
kfree(conf->stripe_hashtbl);
kfree(conf);
@@ -6612,9 +6667,16 @@ static struct r5conf *setup_conf(struct mddev *mddev)
conf->disks = kzalloc(max_disks * sizeof(struct disk_info),
GFP_KERNEL);
+
if (!conf->disks)
goto abort;
+ for (i = 0; i < conf->raid_disks; i++) {
+ conf->disks[i].extra_page = alloc_page(GFP_KERNEL);
+ if (!conf->disks[i].extra_page)
+ goto abort;
+ }
+
conf->mddev = mddev;
if ((conf->stripe_hashtbl = kzalloc(PAGE_SIZE, GFP_KERNEL)) == NULL)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index d13fe45..ed8e136 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -276,6 +276,7 @@ struct stripe_head_state {
struct md_rdev *blocked_rdev;
int handle_bad_blocks;
int log_failed;
+ int waiting_extra_page;
};
/* Flags for struct r5dev.flags */
@@ -439,6 +440,7 @@ enum {
struct disk_info {
struct md_rdev *rdev, *replacement;
+ struct page *extra_page; /* extra page to use in prexor */
};
/*
@@ -559,6 +561,9 @@ enum r5_cache_state {
* only process stripes that are already
* occupying the log
*/
+ R5C_EXTRA_PAGE_IN_USE, /* a stripe is using disk_info.extra_page
+ * for prexor
+ */
};
struct r5conf {
@@ -765,6 +770,7 @@ extern void
r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
struct stripe_head_state *s);
extern void r5c_release_extra_page(struct stripe_head *sh);
+extern void r5c_use_extra_page(struct stripe_head *sh);
extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
extern void r5c_handle_cached_data_endio(struct r5conf *conf,
struct stripe_head *sh, int disks, struct bio_list *return_bi);
--
2.9.3
^ permalink raw reply related
* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: NeilBrown @ 2016-11-24 5:45 UTC (permalink / raw)
To: Shaohua Li, Coly Li
Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn,
Guoqing Jiang
In-Reply-To: <20161122213541.btgw4cpoly5j4jpc@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 7494 bytes --]
On Wed, Nov 23 2016, Shaohua Li wrote:
>
> Thanks! The idea is awesome and does makes the code easier to understand.
This is a key point - simpler code!!
>
> For hash conflict, I'm worrying about one case. resync does sequential access.
> Say we have a sequential normal IO. If the first normal IO and resync IO have
> conflict, it's possible they will have conflict subsequently, since both are
> doing sequential access. We can change the hash to something like this:
>
> for the first 64M * 512, the hash is 0, 1, ... 511
> For the second 64M * 512, the hash is 1, 3, 5 ...
> The third 64M * 512, the hash is 2, 5, 8 ...
This would happen when the write and the resync are at different
addresses which happen to collide in the hash.
They would only remain in sync if the two proceeded at the same speed.
Normally resync is throttled when there is competing IO, so that is
fairly unlikely.
If this really was important, it might make sense to just use
hash_long() to make the relevant bits of the sector number into and
index. (i.e. keep the code simple)
>>
>> If user has a (realy) large raid1 device, for example 10PB size, we may
>> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
>> it is possible to be a raid1-created-time-defined variable in future.
I don't think the size of the array is very relevant. No matter how
big the array is, resync can be expected to directly interfere with 0.2%
of all writes. So, assuming a fairly random distribution over times,
most writes will most often not be blocked by resync at all.
I wouldn't give any thought to any problems like this until they are
actually measured.
>> static void reschedule_retry(struct r1bio *r1_bio)
>> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>> unsigned long flags;
>> struct mddev *mddev = r1_bio->mddev;
>> struct r1conf *conf = mddev->private;
>> + sector_t sector_nr;
>> + long idx;
I wonder about the use of "long" for "idx".
As that is an array index, it would have to be a REALLY big array before
"long" is better than "int".
>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>> if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>> bio->bi_error = -EIO;
>>
>> - if (done) {
>> + if (done)
>> bio_endio(bio);
>> - /*
>> - * Wake up any possible resync thread that waits for the device
>> - * to go idle.
>> - */
>> - allow_barrier(conf, start_next_window, bi_sector);
>> - }
>> }
>>
>> static void raid_end_bio_io(struct r1bio *r1_bio)
>> {
>> struct bio *bio = r1_bio->master_bio;
>> + struct r1conf *conf = r1_bio->mddev->private;
>>
>> /* if nobody has done the final endio yet, do it now */
>> if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>>
>> call_bio_endio(r1_bio);
>> }
>> +
>> + /*
>> + * Wake up any possible resync thread that waits for the device
>> + * to go idle.
>> + */
>> + allow_barrier(conf, r1_bio->sector);
> Why this change?
I wondered too. I think it may be correct, but it should be in a
separate patch.
When you have a write-mostly device, I think the current code will
allow_barrier() before the writes to the write-mostly devices have
completed.
>>
>> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
>> + sector_t sectors)
>> +{
>> + sector_t len;
>> +
>> + WARN_ON(sectors == 0);
>> + /* len is the number of sectors from start_sector to end of the
>> + * barrier unit which start_sector belongs to.
>> + */
>> + len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
>> + (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
>> + start_sector;
>> +
>> + if (len > sectors)
>> + len = sectors;
>> +
>> + return len;
>> +}
>
> I'd prefer calling bio_split at the early of raid1_make_request and split the
> bio if it crosses the bucket boundary. please see raid10_make_request for
> example. resync will not cross the boundary. So the code will not need to worry
> about the boundary. I believe this will make the code simpiler (all the
> align_to_barrier_unit_end calling can removed) and easy to understand.
align_to_barrier_unit_end() is only called twice, once for writes and
once for resync/recovery. If bio_split() were used, you would still
need the same number of calls.
Changing raid1.c to use bio_split() more may well be a valuable
simplification, but I think it is a separate issue.
>> wait_event_lock_irq(conf->wait_barrier,
>> - !conf->array_frozen &&
>> - conf->barrier < RESYNC_DEPTH &&
>> - conf->current_window_requests == 0 &&
>> - (conf->start_next_window >=
>> - conf->next_resync + RESYNC_SECTORS),
>> + !conf->array_frozen && !conf->nr_pending[idx] &&
>> + conf->barrier[idx] < RESYNC_DEPTH,
>> conf->resync_lock);
>
> So there is a slight behavior change. Originally we guarautee no more than
> RESYNC_DEPTH sync. Now this only checks 'conf->barrier[idx] < RESYNC_DEPTH'.
> How about barrier[idx-1], barrier[idx-2]...? It's possible conf->barrier[idx] <
> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. Not sure how
> important this is, but at least we can check barrier[idx] + barrier[idx-1].
I confess that I'm not entirely sure the point of RESYNC_DEPTH - it was
already in the code when I started working on it.
My best guess is that it limits the number of requests we need to wait
for before regular writes can be permitted in the resync region.
In that case, the current code is good enough.
Your "barrier[idx] + barrier[idx-1]" idea is interesting, but would only
make a difference 1/1024 of the time (bucket is 64Meg, the
RESYNC_BLOCK_SIZE is 64K).
>> +
>> +static void wait_all_barriers(struct r1conf *conf)
>> +{
>> + long idx;
>> +
>> + for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
>> + _wait_barrier(conf, idx);
>
> This is racy. Since resync is sequential, idealy we only need to check the
> bucket sequentially. The compiler could do _wait_barrier(conf, 1) and then
> _wait_barrier(conf, 0). It's possible the bucket 1 has barrier right after the
> check of bucket 0. Even the compiler doesn't reorder, there is case the bucket
> 511 should be check before bucket 0 if the resync is just crossing 512 buckets.
> It would be better we check the resync position and adjust the bucket wait
> order according to the position.
I cannot see how it is racy. No new sync requests will be issued at
this time, so once ->barrier[idx] reaches 0, it will stay at zero.
>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
>> if (!conf)
>> goto abort;
>>
>> + conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
>> + if (!conf->nr_pending)
>> + goto abort;
>
> This allocation is a little wierd. I'd define a count and uses
> sizeof(nr_pending) * count to do the allocation. There is nothing related to
> PAGE_SIZE. Alternatively, just make nr_pending an array in r1conf.
The thing about PAGE_SIZE is that the allocation will succeed and won't
waste space.
If you make all the arrays simple members of r1conf, r1conf will be
about 4pages larger, so will require an 8-page allocation, which is more
likely to fail.
I agree that it seems a little weird, but I think it results in the best
use of memory.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Guoqing Jiang @ 2016-11-24 6:05 UTC (permalink / raw)
To: NeilBrown, Shaohua Li, Coly Li
Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn
In-Reply-To: <871sy1bfd9.fsf@notabene.neil.brown.name>
On 11/24/2016 01:45 PM, NeilBrown wrote:
>
>>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>>> if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>>> bio->bi_error = -EIO;
>>>
>>> - if (done) {
>>> + if (done)
>>> bio_endio(bio);
>>> - /*
>>> - * Wake up any possible resync thread that waits for the device
>>> - * to go idle.
>>> - */
>>> - allow_barrier(conf, start_next_window, bi_sector);
>>> - }
>>> }
>>>
>>> static void raid_end_bio_io(struct r1bio *r1_bio)
>>> {
>>> struct bio *bio = r1_bio->master_bio;
>>> + struct r1conf *conf = r1_bio->mddev->private;
>>>
>>> /* if nobody has done the final endio yet, do it now */
>>> if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>>>
>>> call_bio_endio(r1_bio);
>>> }
>>> +
>>> + /*
>>> + * Wake up any possible resync thread that waits for the device
>>> + * to go idle.
>>> + */
>>> + allow_barrier(conf, r1_bio->sector);
>> Why this change?
> I wondered too. I think it may be correct, but it should be in a
> separate patch.
> When you have a write-mostly device, I think the current code will
> allow_barrier() before the writes to the write-mostly devices have
> completed.
>
Seems the change is moved from call_bio_endio, but call_bio_endio is
also called
from raid1_end_write_request, I think it is better to keep the original
code.
Thanks,
Guoqing
^ permalink raw reply
* Re: [PATCH v4] md/r5cache: handle alloc_page failure
From: NeilBrown @ 2016-11-24 6:18 UTC (permalink / raw)
To: linux-raid
Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
liuzhengyuan, Song Liu
In-Reply-To: <20161124051404.1969242-1-songliubraving@fb.com>
[-- Attachment #1: Type: text/plain, Size: 2517 bytes --]
On Thu, Nov 24 2016, Song Liu wrote:
> +void r5c_use_extra_page(struct stripe_head *sh)
> +{
> + struct r5conf *conf = sh->raid_conf;
> + int i;
> + struct r5dev *dev;
> + struct page *p;
> +
> + for (i = sh->disks; i--; ) {
> + dev = &sh->dev[i];
> + if (dev->orig_page != dev->page) {
> + p = dev->orig_page;
> + dev->orig_page = dev->page;
> put_page(p);
> }
> + dev->orig_page = conf->disks[i].extra_page;
It seems a bit pointless to assign to dev->orig_page twice.
Why not:
if (dev->orig_page != dev->page)
put_page(dev->orig_page);
dev->orig_page = conf->......
??
> @@ -2255,8 +2258,27 @@ static int resize_stripes(struct r5conf *conf, int newsize)
> if (ndisks) {
> for (i=0; i<conf->raid_disks; i++)
> ndisks[i] = conf->disks[i];
> - kfree(conf->disks);
> - conf->disks = ndisks;
> +
> + /* allocate extra_page for ndisks */
> + for (i = 0; i < newsize; i++) {
> + ndisks[i].extra_page = alloc_page(GFP_NOIO);
> + if (!ndisks[i].extra_page)
> + err = -ENOMEM;
> + }
> +
> + if (err) {
> + /* if any error, free extra_page for ndisks */
> + for (i = 0; i < newsize; i++)
> + if (ndisks[i].extra_page)
> + put_page(ndisks[i].extra_page);
> + kfree(ndisks);
> + } else {
> + /* if no error, free extra_page for old disks */
> + for (i = 0; i < conf->previous_raid_disks; i++)
> + put_page(ndisks[i].extra_page);
> + kfree(conf->disks);
> + conf->disks = ndisks;
> + }
This looks a bit odd too. We never reduce conf->pool_size, so we never
need to free anything.
for (i = conf->pool_size; i < newsize; i++)
if ((ndisks[i].extra_page = alloc_page(GFP_NOIO)) == NULL)
err = -ENOMEM;
for (i = conf->pool_size; err == -ENOMEM && i < newsize; i++)
if (ndisks[i].extra_page)
put_page(ndisks[i].extra_page);
Maybe that it a little terse, but something like that would be better I
think.
Certainly you don't need to free ndisks. If the allocation succeeds,
just use the new array whether other allocations succeed or fail.
> @@ -6466,6 +6515,12 @@ static void free_conf(struct r5conf *conf)
> free_thread_groups(conf);
> shrink_stripes(conf);
> raid5_free_percpu(conf);
> + for (i = 0; i < conf->raid_disks; i++)
> + if (conf->disks[i].extra_page) {
> + put_page(conf->disks[i].extra_page);
> + conf->disks[i].extra_page = NULL;
There is no point setting extra_page to NULL, as the whole array is
freed on the next line.
Apart from those few little things, it looks good. Thanks.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox