Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Can't mount Old RHEL 6 Raid with new install of CentOS 7, now can't mount with original RHEL 6
From: Chris Murphy @ 2016-08-16 17:58 UTC (permalink / raw)
  To: John Dawson; +Cc: Chris Murphy, Linux-RAID
In-Reply-To: <57d23adc6db12d6a365dd2c6f50d4bf5@celticblues.com>

On Tue, Aug 16, 2016 at 11:35 AM, John Dawson <linux@celticblues.com> wrote:

> ===============================================
> "sudo mdadm -D /dev/md127" results
> ===============================================
>
> /dev/md127:
>         Version : 1.2
>   Creation Time : Fri Aug  5 16:46:10 2016

Did you really create this array on 10 days ago with CentOS 6 and then
upgrade to CentOS 7 and it didn't work? Or did you happen to try to
fix the problem by doing mdadm --create ?

-- 
Chris Murphy

^ permalink raw reply

* Re: read errors with md RAID5 array
From: Chris Murphy @ 2016-08-16 18:25 UTC (permalink / raw)
  To: Tim Small; +Cc: Andreas Klauer, linux-raid@vger.kernel.org
In-Reply-To: <d1ccb93d-6856-4b07-04ea-3c4ff9541d60@buttersideup.com>

On Tue, Aug 16, 2016 at 5:40 AM, Tim Small <tim@buttersideup.com> wrote:

> # for i in a c d ; do mdadm --examine-badblocks  /dev/sd${i}2 ; done
> Bad-blocks on /dev/sda2:
>           2321554488 for 512 sectors
>           2321555000 for 512 sectors
>           2321555512 for 152 sectors
> Bad-blocks on /dev/sdc2:
>              1656848 for 128 sectors
>             28490768 for 512 sectors
>             28491280 for 392 sectors
>             28572344 for 120 sectors
>             32760864 for 128 sectors
>           2321554488 for 512 sectors
>           2321555000 for 512 sectors
>           2321555512 for 152 sectors
> Bad-blocks on /dev/sdd2:
>              1656848 for 128 sectors
>             28490768 for 512 sectors
>             28491280 for 392 sectors
>             28572344 for 120 sectors
>             32760864 for 128 sectors
>           2321554488 for 512 sectors
>           2321555000 for 512 sectors
>           2321555512 for 152 sectors

Does this actually jive with what the drive is reporting? I would only
expect bad blocks to get populated if there's a write error, and the
user would only opt in to using a bad blocks list if they're basically
saying they refuse (on mainly economic grounds) that they will/can not
replace a drive that has no reserve sectors remaining for remapping.

I'm with Andreas on this aspect that silently accumulating a list of
bad sectors is specious. But I also can't tell if that's a factor in
this. But this is suggesting f'n ass tons of bad sectors. The sdd2
partition alone has over 2000 bad sectors? What? If that were true
it's a disqualified drive.

>
> I didn't know about the bad block functionality in md.  The mdadm manual
> page doesn't say much, so is this the canonical document?
>
> http://neil.brown.name/blog/20100519043730
>
> Until recently, two of the drives (sda, sdc) were running a firmware
> version which (as far as I can work out) made them occasionally lock up
> and disappear from the OS (requiring a power cycle), this firmware has
> now been updated, so hopefully they'll now behave.

There are user reports of firmware updates causing latent problems
that persist until data is overwritten. I personally always do ATA
Secure Erase, or Enhanced Secure Erase, using hdparm, anytime a drive
gets a firmware update. Unless I simply don't care about the data on
the drive.

> Degraded array reporting was also broken on this machine for a couple of
> weeks due to an email misconfiguration (now fixed), so last week I found
> it with sda (ML0220F30ZE35D) apparently missing from the machine, and
> also with pending sectors on sdb (ML0220F31085KD).  The array rebuilt
> quite quickly from the bitmap, and then I turned to trying to resolve
> the pending sectors...

It's worth checking alignment to make sure writes are for sure
happening on 4KiB boundaries. It's sorta old news, but sometimes old
tools were not using aligned values (parted and fdisk frequently
started the first partition at LBA 34 for example, which is not
aligned). This causes internal RMW in the drive, where any write is
actually treated by the drive first as a read, and can produce a
persistent read error that never gets fixed. To avoid the write being
treated as a RMW, it must be a complete 4KiB write to the aligned 512
byte based LBA. Kinda annoying...

> I'm not really sure from the blog post, under what circumstances a bad
> block entry would end up being written to multiple devices in the array,
> and under what circumstances it might be written to all devices in an
> array?  There are no entries on these array members which appear on only
> one array member, and some are present on all three drives - which seems
> strange to me.

Yes, strange. But even if it's cross reference bad sectors across all
drives, 2000+ bad sectors across 3 drives is too many.

> FWIW, what I'd like to do in the future with this array, is to reshape
> it into a 4 drive RAID6, and then grow it to a 5 drive RAID6, and
> possibly replace one or both of sda (ML0220F30ZE35D) and sdc
> (ML0220F31085KD).

Well I would defer to most anyone on this list, but given the state of
things, I have serious doubts about the array. I personally would
qualify five new drives with badblocks -w using the default 5 rounds
of destructive write-read-verify, making sure to change -b to 4096,
and then make a new raid6 array and migrate the data over.

The current state of the array I consider sufficiently fragile that a
reshape risks all the data. So only if you have a current backup that
you're prepared to need to use would I do a reshape then grow (that's
true anyway even if healthy, but in particular with an array that's in
a weak state).

> In the meantime I'm trying to work out what data (if any) is now
> inaccessible.  This is made slightly more interesting because this array
> has 'bcache' sitting in front of it, so I might have good data in the
> cache on the SSD which is marked bad/inaccessible on the raid5 md device.

OK that's a whole separate ball of wax now.  Do you realize that
bcache is orphaned? For all we know you're running into bcache related
bugs at this point.

If your use case really benefits from an SSD cache, you should look at
lvmcache. And in that case you probably ought to evaluate if it makes
more sense to manage the RAID using LVM entirely rather than mdadm.
It's the same md kernel code, but it's created, managed, monitored all
by LVM tools and metadata. So you get things like per logical volume
RAID levels. The feature set of LVM is really incredibly vast and
often overwhelming, and yet on the RAID front mdadm still has more to
offer and I think is easier to use. But for your use case it might be
easier in the long run to consolidate on one set of tools.

-- 
Chris Murphy

^ permalink raw reply

* Re: Can't mount Old RHEL 6 Raid with new install of CentOS 7, now can't mount with original RHEL 6
From: Chris Murphy @ 2016-08-16 18:44 UTC (permalink / raw)
  To: John Dawson; +Cc: Chris Murphy, Linux-RAID
In-Reply-To: <fd2b2aab-a0c1-4e20-89e2-f859b488e4e9@Spark>

On Tue, Aug 16, 2016 at 12:33 PM,  <linux@celticblues.com> wrote:
> The array was originally created well over a year ago, perhaps 2 or 3 years
> ago. It was working great. When the machine was updated to CentOS 7, the
> raid assembled manually fine and was mountable only once. After rebooting
> and trying to mount in /etc/fstab it could no longer be assembled and
> mounted...

How do you explain your mdadm -E and -D output, which shows the
creation time 10 days ago?

> ===============================================
> "sudo mdadm -D /dev/md127" results
> ===============================================
>
> /dev/md127:
> Version : 1.2
> Creation Time : Fri Aug 5 16:46:10 2016

Further the -E output shows Events: 0 which means the array you're
showing us has never been used. The only explanation I can think of is
you used mdadm --create, but something significant is missing from
your explanation because there is no possible way a normal mount or
assemble changes the Creation Time of the array, or the Event count.
But --create will do that.

-- 
Chris Murphy

^ permalink raw reply

* Re: Can't mount Old RHEL 6 Raid with new install of CentOS 7, now can't mount with original RHEL 6
From: Chris Murphy @ 2016-08-16 19:09 UTC (permalink / raw)
  To: John Dawson; +Cc: Chris Murphy, Linux-RAID
In-Reply-To: <02fd3a31-b107-464d-aa58-2ccd5660d974@Spark>

On Tue, Aug 16, 2016 at 12:59 PM,  <linux@celticblues.com> wrote:
> I tried to assemble the raid array again. If that doesn't explain it, then I
> don't know.

Why would trying to assemble the raid change anything? The array is
assembling correctly per the metadata on those two drives. That
there's no ext4 signature on that array, the date of creation of the
array is 10 days ago, all suggests that the original array used
metadata version other than 1.2, putting the ext4 superblock in the
location where 10 days ago mdadm --create obliterated that ext4
superblock with an mdadm 1.2 superblock.

There are a lot of "fix me" guides out there that suggest doing mdadm
--create and it's 99% of the time really horrifically bad advice that
causes data loss that looks awfully lot like what's going on here.

>The two drives that make up the array haven't been
> touched/modified.

Obviously that's not true.

The only possible chance to recover this is if you have an exact
sequence of events modifying the drive, so that they can maybe be
reversed. Without that, any change will do more damage and decrease
the chance of recovery. But seeing as this is a raid0 I suspect you
have a complete backup of the drive anyway so you're probably better
off just starting over with it.

-- 
Chris Murphy

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Matt Garman @ 2016-08-16 19:44 UTC (permalink / raw)
  To: Doug Dumitru, Mdadm
In-Reply-To: <CAFx4rwQj3_JTNiS0zsQjp_sPXWkrp0ggjg_UiR7oJ8u0X9PQVA@mail.gmail.com>

Hi Doug & linux-raid list,

On Tue, Jul 12, 2016 at 9:10 PM, Doug Dumitru <doug@easyco.com> wrote:
> You might want to try running "perf" on your system while it is degraded and
> see where the thread is churning.  I would love to see those results.  I
> would not be surprised to see that the thread is literally "spinning".  If
> so, then the 100% cpu is probably fixable, but it won't actually help
> performance.

I sat on your email for a while, as the machine in question was (is)
production, and we don't have any useful downtime windows to
experiment.  But now we have a second, identical machine.  It
eventually needs to go into production as well, but for now we have
some time to test.

My understanding of "perf" is that it analyzes an individual process.
Would you be willing to elaborate on how I might use it while the
rebuild is taking place?

> In term of single drive missing performance with short reads, you are mostly
> at the mercy of short read IOPS.  If you array is reading 8K blocks at
> 2GB/sec, this is at 250,000 IOPS and you kill off a drive, it will jump to
> 500,000 IOPS.  Reading from the good drives remains as single reads, but
> read from the missing drives require reads from all of the others (with
> raid-5, all but one).  I am not sure how the recovery thread issues these
> recovery read.  Hopefully, it blasts them at the array with abandon (ie,
> submit all 22 requests concurrently), but the code might be less aggressive
> in deference to hard disks.  SSDs love deep queue depths.

I may be jumping ahead a little, but I wonder if there are tuning
parameters that make sense for an array such as this, given the
read-dominant (effectively WORM) workload?  In particular, things like
block-level read-ahead, IO scheduler, queue depth, etc.  I know the
standard answer for these is "test and see" but we don't have a second
100-machine compute farm to test with.  It's quite hard to simulate
such a workload.

> 1)  Consider a single CPU socket solution, like an E6-1650 v3.  Multi-socked
> CPU introduce NUMA and a whole slew of "interesting" system contention
> issues.

I think that's a good idea, but I wanted to have two identical systems.

> 2)  Use good HBA that are direct connected to the disks.  I like LSI 3008
> and the newer 16-port version, although you need to use only 12 ports with
> 6GBit SATA/SAS to keep from over-running the PCI-e slot bandwidth.

We have three LSI MegaRAID SAS-3 3108 9361-8i controllers per system.
8 ports per card.  Drives are indeed direct-connected.  (Technically
there is a backplane, but it's not an expander, just a pass-through
backplane for neat cabling.)

> 3)  Do everything you can to hammer deep queue depths.

Can you elaborate on that?

> 4)  Setup IRQ affinity so that the HBAs spread their IRQ requests across
> cores.

We have spent a lot of time tuning the NIC IRQs, but have not yet
spent any time on the HBA IRQs.  Will do.

> You can probably mitigate the amount of degradation by lowering the rebuild
> speed, but this will make the rebuild take longer, so you are messed up
> either way.  If the server has "down time" at night, you might lower the
> rebuild to a really small value during the day, and up it at night.

I'll have to discuss with my colleagues, but we have the impression
that the max rebuild speed parameter is more of a hint than an actual
"hard" setting.  That is, we tried to do exactly what you suggest:
defer most rebuild work to after-hours when the load was lighter (and
no one would notice).  But we were unable to stop the rebuild from
basically completely crippling the NFS performance during the day.

"Messed up either way" is indeed the right conclusion here.  But I
think we have some bottleneck somewhere that is artificially hurting,
making things worse than they could/should be.

Thanks again for the thoughtful feedback!

-Matt

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Dumitru @ 2016-08-16 22:51 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm
In-Reply-To: <CAJvUf-C7CMqUBe+kPaxxTtmvuCXSRes73kMcAnZw7x8=6DNCdw@mail.gmail.com>

Matt,

One last thing I would highly recommend is:

Secure erase the replacement disk before rebuilding onto it.

If the replacement disk is "pre conditioned" with random writes, even
if very slowly, this will lower the write performance of the disk
during the rebuild.

On Tue, Aug 16, 2016 at 12:44 PM, Matt Garman <matthew.garman@gmail.com> wrote:
> Hi Doug & linux-raid list,
>
> On Tue, Jul 12, 2016 at 9:10 PM, Doug Dumitru <doug@easyco.com> wrote:
>> You might want to try running "perf" on your system while it is degraded and
>> see where the thread is churning.  I would love to see those results.  I
>> would not be surprised to see that the thread is literally "spinning".  If
>> so, then the 100% cpu is probably fixable, but it won't actually help
>> performance.
>
> I sat on your email for a while, as the machine in question was (is)
> production, and we don't have any useful downtime windows to
> experiment.  But now we have a second, identical machine.  It
> eventually needs to go into production as well, but for now we have
> some time to test.
>
> My understanding of "perf" is that it analyzes an individual process.
> Would you be willing to elaborate on how I might use it while the
> rebuild is taking place?
>
>> In term of single drive missing performance with short reads, you are mostly
>> at the mercy of short read IOPS.  If you array is reading 8K blocks at
>> 2GB/sec, this is at 250,000 IOPS and you kill off a drive, it will jump to
>> 500,000 IOPS.  Reading from the good drives remains as single reads, but
>> read from the missing drives require reads from all of the others (with
>> raid-5, all but one).  I am not sure how the recovery thread issues these
>> recovery read.  Hopefully, it blasts them at the array with abandon (ie,
>> submit all 22 requests concurrently), but the code might be less aggressive
>> in deference to hard disks.  SSDs love deep queue depths.
>
> I may be jumping ahead a little, but I wonder if there are tuning
> parameters that make sense for an array such as this, given the
> read-dominant (effectively WORM) workload?  In particular, things like
> block-level read-ahead, IO scheduler, queue depth, etc.  I know the
> standard answer for these is "test and see" but we don't have a second
> 100-machine compute farm to test with.  It's quite hard to simulate
> such a workload.
>
>> 1)  Consider a single CPU socket solution, like an E6-1650 v3.  Multi-socked
>> CPU introduce NUMA and a whole slew of "interesting" system contention
>> issues.
>
> I think that's a good idea, but I wanted to have two identical systems.
>
>> 2)  Use good HBA that are direct connected to the disks.  I like LSI 3008
>> and the newer 16-port version, although you need to use only 12 ports with
>> 6GBit SATA/SAS to keep from over-running the PCI-e slot bandwidth.
>
> We have three LSI MegaRAID SAS-3 3108 9361-8i controllers per system.
> 8 ports per card.  Drives are indeed direct-connected.  (Technically
> there is a backplane, but it's not an expander, just a pass-through
> backplane for neat cabling.)
>
>> 3)  Do everything you can to hammer deep queue depths.
>
> Can you elaborate on that?
>
>> 4)  Setup IRQ affinity so that the HBAs spread their IRQ requests across
>> cores.
>
> We have spent a lot of time tuning the NIC IRQs, but have not yet
> spent any time on the HBA IRQs.  Will do.
>
>> You can probably mitigate the amount of degradation by lowering the rebuild
>> speed, but this will make the rebuild take longer, so you are messed up
>> either way.  If the server has "down time" at night, you might lower the
>> rebuild to a really small value during the day, and up it at night.
>
> I'll have to discuss with my colleagues, but we have the impression
> that the max rebuild speed parameter is more of a hint than an actual
> "hard" setting.  That is, we tried to do exactly what you suggest:
> defer most rebuild work to after-hours when the load was lighter (and
> no one would notice).  But we were unable to stop the rebuild from
> basically completely crippling the NFS performance during the day.
>
> "Messed up either way" is indeed the right conclusion here.  But I
> think we have some bottleneck somewhere that is artificially hurting,
> making things worse than they could/should be.
>
> Thanks again for the thoughtful feedback!
>
> -Matt



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Adam Goryachev @ 2016-08-17  0:27 UTC (permalink / raw)
  To: doug, Matt Garman; +Cc: Mdadm
In-Reply-To: <CAFx4rwQ4FGpR_X+rz2b0pFUVzOR_OL0UCRAiYZkRMAi0n7+rGA@mail.gmail.com>

On 17/08/16 08:51, Doug Dumitru wrote:
> Matt,
>
> One last thing I would highly recommend is:
>
> Secure erase the replacement disk before rebuilding onto it.
>
> If the replacement disk is "pre conditioned" with random writes, even
> if very slowly, this will lower the write performance of the disk
> during the rebuild.
>
> On Tue, Aug 16, 2016 at 12:44 PM, Matt Garman <matthew.garman@gmail.com> wrote:
>> Hi Doug & linux-raid list,
>>
>> On Tue, Jul 12, 2016 at 9:10 PM, Doug Dumitru <doug@easyco.com> wrote:
>>
>>> You can probably mitigate the amount of degradation by lowering the rebuild
>>> speed, but this will make the rebuild take longer, so you are messed up
>>> either way.  If the server has "down time" at night, you might lower the
>>> rebuild to a really small value during the day, and up it at night.
>> I'll have to discuss with my colleagues, but we have the impression
>> that the max rebuild speed parameter is more of a hint than an actual
>> "hard" setting.  That is, we tried to do exactly what you suggest:
>> defer most rebuild work to after-hours when the load was lighter (and
>> no one would notice).  But we were unable to stop the rebuild from
>> basically completely crippling the NFS performance during the day.
>>
>> "Messed up either way" is indeed the right conclusion here.  But I
>> think we have some bottleneck somewhere that is artificially hurting,
>> making things worse than they could/should be.
>>
>> Thanks again for the thoughtful feedback!
>>
>> -Matt
Sorry, probably messed up the quoting/attribution, but I don't think 
that is too important here.
You should find that the max value is in fact an upper bound, so the 
re-sync will *try* to limit the speed to this value, and the minimum is 
a lower bound. However, if you set the minimum too high, then the system 
(as a whole) may not be able to achieve that, and so the resync speed 
might be lower.

I don't think I've seen a case where the resync speed is much higher 
than the max value. Of course, even a small resync speed could have a 
big impact on performance (due to extra seeks on the disks moving to the 
resync area away from the active read/write workload area)...

I think there is also an option to completely stop the resync from 
progressing (changes it to "pending" status), but maybe someone else 
on-list can comment about that. You might be able to totally stop the 
resync during the day, and then set an outage period to stop your 
workload and allow the system to run the resync at maximum speed (just 
set the max value to a really large number).

Sorry, but I'm not an mdadm expert, just sharing my experiences with 
dealing with similar issues.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply

* Re: [PATCH 1/3] MD: hold mddev lock for .quiesce in md_do_sync
From: Shaohua Li @ 2016-08-17  1:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li
In-Reply-To: <87h9aqluoa.fsf@notabene.neil.brown.name>

On Fri, Aug 12, 2016 at 10:04:05AM +1000, Neil Brown wrote:
> On Sat, Aug 06 2016, Shaohua Li wrote:
> 
> > On Thu, Aug 04, 2016 at 01:16:49PM +1000, Neil Brown wrote:
> >> On Wed, Aug 03 2016, NeilBrown wrote:
> >> 
> >> > [ Unknown signature status ]
> >> > On Sun, Jul 31 2016, shli@kernel.org wrote:
> >> >
> >> >> From: Shaohua Li <shli@fb.com>
> >> >>
> >> >> .quiesce is called with mddev lock hold at most places. There are few
> >> >> exceptions. Calling .quesce without the lock hold could create races. For
> >> >> example, the .quesce of raid1 can't be recursively. The purpose of the patches
> >> >> is to fix a race in raid5-cache. The raid5-cache .quesce will write md
> >> >> superblock and should be called with mddev lock hold.
> >> >>
> >> >> Cc: NeilBrown <neilb@suse.com>
> >> >> Signed-off-by: Shaohua Li <shli@fb.com>
> >> >
> >> > Acked-by: NeilBrown <neilb@suse.com>
> >> >
> >> > This should be safe but I'm not sure I really like it.
> >> > The raid1 quiesce could be changed so that it can be called recursively.
> >> > The raid5-cache situation would be harder to get right and maybe this is
> >> > the best solution... It's just that 'quiesce' should be a fairly
> >> > light-weight operation, just waiting for pending requests to flush.  It
> >> > shouldn't really *need* a lock.
> >> 
> >> Actually, the more I think about this, the less I like it.
> >> 
> >> I would much rather make .quiesce lighter weight so that no locking was
> >> needed.
> >> 
> >> For r5l_quiesce, that probable means removed the "r5l_do_reclaim()".
> >> Stopping and restarting the reclaim thread seems reasonable, but calling
> >> r5l_do_reclaim() should not be needed.  It should be done periodically
> >> by the thread, and at 'stop' time, but otherwise isn't needed.
> >> You would need to hold some mutex while calling md_register_thread, but
> >> that could be probably be log->io_mutex, or maybe even some other new
> >> mutex
> >
> > We will have the same deadlock issue with just stopping/restarting the reclaim
> > thread. As stopping the thread will wait for the thread, which probably is
> > doing r5l_do_reclaim and writting the superblock. Since we are writting the
> > superblock, we must hold the reconfig_mutex.
> 
> When you say "writing the superblock" you presumably mean "blocked in
> r5l_write_super_and_discard_space(), waiting for  MD_CHANGE_PENDING to
> be cleared" ??
right
> With a bit of care you could wait for MD_CHANGE_PENDING to clear, or for
> ->quiesce to be set, and then exit gracefully.

Can you give details about this please? .quiesce is called with reconfig_mutex
hold, so the MD_CHANGE_PENDING will never get cleared.

> >
> > Letting raid5_quiesce call r5l_do_reclaim gives us a clean log. Just
> > stop/restart the reclaim thread can't guarantee this, as it's possible some
> > space aren't reclaimed yet. A clean log will simplify a lot of things, for
> > example we change the layout of the array. The log doesn't need to remember
> > which part is for the old layout and which part is the new layout.
> 
> I really think you are putting too much functionality into quiesce.
> When we change the shape of the array, we do much more than just
> quiesce it.  We also call check_reshape and start_reshape etc.
> They are called with reconfig_mutex held and it would be perfectly
> appropriate to finish of the r5l_do_reclaim() work in there.

This makes sense. But I think we don't need worry 'finish of the
r5l_do_reclaim()' does too much things. In most cases, stopping the reclaim
thread will already finish all reclaim.

> >
> > I think we can add a new parameter for .quiesce to indicate if reconfig_mutex
> > is hold. raid5_quiesce can check the parameter and hold reconfig_mutex if
> > necessary.
> 
> Adding a new parameter because it happens to be convenient in one case
> is not necessarily a good idea.  It is often a sign that the interface
> isn't well designed, or isn't well understood, or is being used poorly.
> 
> I really really don't think ->quiesce() should care about whether
> reconfig_mutex is held.  All it should do is drain all IO and stop new
> IO so that other threads can do unusually things in race-free ways.

I agree this isn't a good interface, but I don't have a better solution for
this issue. Ingore reshape now. It's possible .quiesce and reclaim thread could
deadlock. One thread hold reconfig mutex and call raid5_quiesce(), which will
wait for IO finish. reclaim thread write super (wait for reconfig mutex), free
log space and then IO write can finish. So the first thread hold reconfig mutex
and wait reclaim thread to finish IO, while reclaim thread waits for reconfig
mutex.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH] md: do not count journal as spare in GET_ARRAY_INFO
From: Shaohua Li @ 2016-08-17  1:33 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, yizhan, Shaohua Li
In-Reply-To: <1470960885-2860586-1-git-send-email-songliubraving@fb.com>

On Thu, Aug 11, 2016 at 05:14:45PM -0700, Song Liu wrote:
> GET_ARRAY_INFO counts journal as spare (spare_disks), which is not
> accurate. This patch fixes this.
> 
> Reported-by: Yi Zhang <yizhan@redhat.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Shaohua Li <shli@fb.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH V2 00/10] The latest changes for md-cluster
From: Shaohua Li @ 2016-08-17  1:49 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli
In-Reply-To: <1470980563-26062-1-git-send-email-gqjiang@suse.com>

On Fri, Aug 12, 2016 at 01:42:33PM +0800, Guoqing Jiang wrote:
> This version addresses comments from Shaohua, and the 7th
> patch is added as well.

Applied this series, thanks!

^ permalink raw reply

* hi linux
From: David Shine @ 2016-08-17  3:55 UTC (permalink / raw)
  To: linux

Good morning linux

http://oleogaia.com/cry.php?cattle=rs1czed74uzdn74




David Shine


^ permalink raw reply

* How to get md to attempt to scrub blocks in its bad blocks list
From: Tim Small @ 2016-08-17 14:59 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

Hello,

I have had a look around at the other md arrays I manage following the
problems I've recently written about...

I have a machine with a single local SATA disk (an old laptop which only
has capacity for a single disk, and is used for backups), and a remote
iSCSI disk.  Both devices are members of a RAID1, with the remote drive
having the write-mostly flag set.

The iSCSI device had a write error during a resync, due to a transient
memory allocation failure on the remote machine.

I've cleared the write_error and want_replacement flags for the device
according to the kernel md docs, but can't see how to provoke md into
attempting to scrub the bad blocks?

Also in general, are there any mechanisms to get md to retry for longer?
 I've done this so-far:

ISCSIDISK=$(basename `readlink
/dev/disk/by-id/wwn-0x60014059ba55d40d7fc416d928211f5b`)
echo 600 > /sys/block/${ISCSIDISK}/device/timeout

in an attempt to work-around transient write errors due to memory
allocation failures, target machine reboots, network errors etc.

Should I be doing anything else (e.g. can I configure retries for failed
writes)?

Cheers,

Tim.

^ permalink raw reply

* Re: How to get md to attempt to scrub blocks in its bad blocks list
From: Adam Goryachev @ 2016-08-17 16:05 UTC (permalink / raw)
  To: Tim Small, linux-raid@vger.kernel.org
In-Reply-To: <b6c01e91-9786-9705-a452-253794f65bc6@buttersideup.com>



On 18/08/2016 00:59, Tim Small wrote:
> Hello,
>
> I have had a look around at the other md arrays I manage following the
> problems I've recently written about...
>
> I have a machine with a single local SATA disk (an old laptop which only
> has capacity for a single disk, and is used for backups), and a remote
> iSCSI disk.  Both devices are members of a RAID1, with the remote drive
> having the write-mostly flag set.
>
> The iSCSI device had a write error during a resync, due to a transient
> memory allocation failure on the remote machine.
>
> I've cleared the write_error and want_replacement flags for the device
> according to the kernel md docs, but can't see how to provoke md into
> attempting to scrub the bad blocks?
>
> Also in general, are there any mechanisms to get md to retry for longer?
>   I've done this so-far:
>
> ISCSIDISK=$(basename `readlink
> /dev/disk/by-id/wwn-0x60014059ba55d40d7fc416d928211f5b`)
> echo 600 > /sys/block/${ISCSIDISK}/device/timeout
>
> in an attempt to work-around transient write errors due to memory
> allocation failures, target machine reboots, network errors etc.
>
> Should I be doing anything else (e.g. can I configure retries for failed
> writes)?
>
In the past I've done this with NBD and eNBD (Network Block Device and 
Enhanced...), both worked really well. Any remote error/problem meant 
that the remote disk was failed, and fixing the remote issue, and then 
re-add to the array, (with the bitmap enabled) meant a quick resync.

More recently, I use DRBD for the same thing, it seems to be a lot more 
reliable, automatic, etc (8.4.x version, the 9.0.x versions is still a 
bit experimental in my experience).

So, I can't directly answer your questions, other than to suggest you 
consider looking at other technologies that are more directly designed 
to do that.

Regards,
Adam

^ permalink raw reply

* Re: [PATCH] raid10: record correct address of bad block
From: Shaohua Li @ 2016-08-17 16:52 UTC (permalink / raw)
  To: Tomasz Majchrzak
  Cc: linux-raid, aleksey.obitotskiy, pawel.baldysiak,
	artur.paszkiewicz
In-Reply-To: <1470992633-8903-1-git-send-email-tomasz.majchrzak@intel.com>

On Fri, Aug 12, 2016 at 11:03:53AM +0200, Tomasz Majchrzak wrote:
> For failed write request record block address on a device, not block
> address in an array.
> 
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
>  drivers/md/raid10.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index cfa96b5..d18b26d 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2449,6 +2449,7 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
>  
>  	int block_sectors;
>  	sector_t sector;
> +	sector_t data_offset;
>  	int sectors;
>  	int sect_to_write = r10_bio->sectors;
>  	int ok = 1;
> @@ -2462,6 +2463,7 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
>  	sectors = ((r10_bio->sector + block_sectors)
>  		   & ~(sector_t)(block_sectors - 1))
>  		- sector;
> +	data_offset = choose_data_offset(r10_bio, rdev);
>  
>  	while (sect_to_write) {
>  		struct bio *wbio;
> @@ -2471,13 +2473,12 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
>  		wbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
>  		bio_trim(wbio, sector - bio->bi_iter.bi_sector, sectors);
>  		wbio->bi_iter.bi_sector = (r10_bio->devs[i].addr+
> -				   choose_data_offset(r10_bio, rdev) +
> -				   (sector - r10_bio->sector));
> +				   data_offset + (sector - r10_bio->sector));
>  		wbio->bi_bdev = rdev->bdev;
>  		if (submit_bio_wait(WRITE, wbio) < 0)
>  			/* Failure! */
> -			ok = rdev_set_badblocks(rdev, sector,
> -						sectors, 0)
> +			ok = rdev_set_badblocks(rdev, wbio->bi_iter.bi_sector -
> +						data_offset, sectors, 0)

bi_iter is immutable after submit_bio, so please not use it

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH] md: don't print the same repeated messages about delayed sync operation
From: Shaohua Li @ 2016-08-17 17:19 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid
In-Reply-To: <20160816122608.14435-1-artur.paszkiewicz@intel.com>

On Tue, Aug 16, 2016 at 02:26:08PM +0200, Artur Paszkiewicz wrote:
> This fixes a long-standing bug that caused a flood of messages like:
> "md: delaying data-check of md1 until md2 has finished (they share one
> or more physical units)"
> 
> It can be reproduced like this:
> 1. Create at least 3 raid1 arrays on a pair of disks, each on different
>    partitions.
> 2. Request a sync operation like 'check' or 'repair' on 2 arrays by
>    writing to their md/sync_action attribute files. One operation should
>    start and one should be delayed and a message like the above will be
>    printed.
> 3. Issue a write to the third array. Each write will cause 2 copies of
>    the message to be printed.
> 
> This happens when wake_up(&resync_wait) is called, usually by
> md_check_recovery(). Then the delayed sync thread again prints the
> message and is put to sleep. This patch adds a check in md_do_sync() to
> prevent printing this message more than once for the same pair of
> devices.
> 
> Reported-by: Sven Koehler <sven.koehler@gmail.com>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/md.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2c3ab6f..5096b48 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7862,6 +7862,7 @@ void md_do_sync(struct md_thread *thread)
>  	 */
>  
>  	do {
> +		int mddev2_minor = -1;
>  		mddev->curr_resync = 2;
>  
>  	try_again:
> @@ -7891,10 +7892,14 @@ void md_do_sync(struct md_thread *thread)
>  				prepare_to_wait(&resync_wait, &wq, TASK_INTERRUPTIBLE);
>  				if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>  				    mddev2->curr_resync >= mddev->curr_resync) {
> -					printk(KERN_INFO "md: delaying %s of %s"
> -					       " until %s has finished (they"
> -					       " share one or more physical units)\n",
> -					       desc, mdname(mddev), mdname(mddev2));
> +					if (mddev2_minor != mddev2->md_minor) {
> +						mddev2_minor = mddev2->md_minor;
> +						printk(KERN_INFO "md: delaying %s of %s"
> +						       " until %s has finished (they"
> +						       " share one or more physical units)\n",
> +						       desc, mdname(mddev),
> +						       mdname(mddev2));
> +					}
>  					mddev_put(mddev2);
>  					if (signal_pending(current))
>  						flush_signals(current);

applied, thanks!

^ permalink raw reply

* Adding journal to existing raid5 arrary
From: Maarten van Malland @ 2016-08-17 20:20 UTC (permalink / raw)
  To: linux-raid

I was trying to add a previously created raid5 array with:

mdadm --add-journal /dev/md3 /dev/sdc

That gave an error though: "/dev/md3 does not support journal device
/dev/sdc". I'm guessing that happens because the initial array was created
without journal support? Is there anything I can do to get this to work
without recreating the entire array?

^ permalink raw reply

* Re: Adding journal to existing raid5 arrary
From: Song Liu @ 2016-08-17 21:46 UTC (permalink / raw)
  To: maartenvanmalland; +Cc: linux-raid, Song Liu
In-Reply-To: <CAC8wJ3FYKvdoLNmOerTgK+yd7p8ZhPXDLW0Z6V6Q+QpJb9MDMA@mail.gmail.com>

Currently, --add-journal does not support adding journal to an existing array,
We plan to add that soon.

Song


^ permalink raw reply

* Rewrite md raid1 member
From: Chris Dunlop @ 2016-08-18  3:04 UTC (permalink / raw)
  To: linux-raid

G'day all,

What options are there to safely rewrite a disk that's part of a live MD
raid1?

Specifically, I have smartctl reporting a Current_Pending_Sector of 360 on a
member of a raid1 set.

A 'check' of the raid comes up clean. I'd like to see if I can clear the
pending sector count by rewriting the sectors. Whilst rewriting just those
sectors would be ideal, I don't know which they are, so it looks like a
whole disk write is the way to go.

I realise the safest way to fix this is using a spare disk and doing a
replace, allowing me to play with the "pending sector" disk to my heart's
content, but I'm also interested to see if it can be done safely on a live
system...

If the system had a spare hot swap disk bay, and I had a spare disk, I could
add another disk to the system and do the replace.

If I were happy to lose redundancy during the process, I could remove the
disk from the raid, wipe the superblock, add it again, and let it rebuild
the whole raid.

If it weren't the root filesystem, the filesystem could be taken offline
whilst doing the rebuild above to reduce the chance of the lost redundancy
producing undesirable results, but there's still the risk of problems
cropping up on the "good" disk during the rebuild.

If I were happy to wear the down time, I could boot into a rescue disk to do
it.

Another option might be to "dd" from the "good" disk:

dd if=/dev/sda of=/dev/sdb

...except that will put the wrong superblock on there. Using the same disk
for the src and dst might be an option:

dd if=/dev/sdb of=/dev/sdb

...but the seeking would kill the throughput. Perhaps a large blocksize
might help, e.g. bs=64K. Or, there could be some dance of 'dd'ing from the
same disk for the superblock, and 'dd'ing from the other disk for the bulk
data, using the Super Offset and Data Offset from "mdadm -E".

However using 'dd' allows for a window where dd reads data A from sda:X
(sector X), then the system writes data B to md0:X (i.e. to both sda:X and
sdb:X), then dd writes data A to sdb:X, putting the raid out of sync.

This could potentially be fixed by doing a 'repair' of the raid, except
that, as both sda and sdb are returning data but not the same data, it's
possible this will preserve the wrong data (i.e. write the old data A from
sdb:X to sda:X instead of writing the new data B from sda:X to sdb:X).

In this circumstance, how does md decide which is the "good" data? Is there
a way of specifying "in the case of discrepancies, trust sda"?

Perhaps, before writing to sdb, setting it to "blocked" the right thing to
do? I.e.:

echo "blocked" > /sys/block/md0/md/dev-sdb1/state
[ dd stuff per above ]
echo "-blocked" > /sys/block/md0/md/dev-sdb1/state

Per linux/Documentation/md.txt:
----
    Writing "blocked" sets the "blocked" flag.
    Writing "-blocked" clears the "blocked" flags and allows writes
            to complete and possibly simulates an error.
----

I can't find anything that tells me what this actually does in practice. I'm
guessing setting it to "blocked" will stop md writing to that device but
otherwise allow the md device to function normally, and setting it to
"-blocked" will allow writes to proceed and the md device will then use the
write-intent bitmap to copy over any writes that were blocked.

And what does "...and possibly simulates an error" imply?

Or is this 'dd' stuff just nuts, a case of "well that's a novel way of
trashing your data..." and/or "you're welcome to try, but you get to keep
all the pieces and don't come crying to us for help!"?

Thanks for any insights into this!

Cheers,

Chris

^ permalink raw reply

* Re: Rewrite md raid1 member
From: Brad Campbell @ 2016-08-18  3:27 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20160818030451.GA17225@onthe.net.au>

On 18/08/16 11:04, Chris Dunlop wrote:
> G'day all,
>
> What options are there to safely rewrite a disk that's part of a live MD
> raid1?
>
> Specifically, I have smartctl reporting a Current_Pending_Sector of 360 on a
> member of a raid1 set.
>
> A 'check' of the raid comes up clean. I'd like to see if I can clear the
> pending sector count by rewriting the sectors. Whilst rewriting just those
> sectors would be ideal, I don't know which they are, so it looks like a
> whole disk write is the way to go.
>

A smartctl -t long on the drive will error out at the first problematic 
sector and put that LBA in the SMART log, so there's a start.

Another way to determine it is run dd from the drive, and it will abort 
on the first error telling you how many records it managed to copy. With 
the default bs of 512, that gives you a sector number.

> Or is this 'dd' stuff just nuts, a case of "well that's a novel way of
> trashing your data..." and/or "you're welcome to try, but you get to keep
> all the pieces and don't come crying to us for help!"?

Pretty much. If a RAID check is not touching them, then they are likely 
in the vacant area around the superblock. Nothing touches that, and 
playing with it can lead to tears if you misfire and hit the superblock 
or the data.

If the superblock is ok, and the errors are outside of the data area 
I've taken a drive out of the array, used dd_rescue to clone the area of 
the drive in question and then written that back to the disk and 
re-added to the array. That just re-writes the good data and with zeros 
where the bad sectors were.

That is a horrible, horrible procedure that I did on an array I use for 
testing and has no valuable data on. I would not recommend it if you 
care about your array or data.

Brad

^ permalink raw reply

* Re: Rewrite md raid1 member
From: Chris Dunlop @ 2016-08-18  4:01 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid
In-Reply-To: <2dca8e1f-8e80-408f-900e-36f9b1dd6f95@fnarfbargle.com>

On Thu, Aug 18, 2016 at 11:27:55AM +0800, Brad Campbell wrote:
> On 18/08/16 11:04, Chris Dunlop wrote:
>> G'day all,
>>
>> What options are there to safely rewrite a disk that's part of a live MD
>> raid1?
>>
>> Specifically, I have smartctl reporting a Current_Pending_Sector of 360 on a
>> member of a raid1 set.
>>
>> A 'check' of the raid comes up clean. I'd like to see if I can clear the
>> pending sector count by rewriting the sectors. Whilst rewriting just those
>> sectors would be ideal, I don't know which they are, so it looks like a
>> whole disk write is the way to go.
> 
> A smartctl -t long on the drive will error out at the first problematic
> sector and put that LBA in the SMART log, so there's a start.

I should have mentioned: a 'smartctl -t long' on the drive came up clean.

> Another way to determine it is run dd from the drive, and it will abort on
> the first error telling you how many records it managed to copy. With the
> default bs of 512, that gives you a sector number.

A 'dd' read of the whole disk also came up clean.

From what I can gather, a "pending sector" is one that's a bit suspect, but
may actually be ok. It seems mine are ok (at least for reading), but the
pending count won't clear until a write succeeds (or fails, and the sector
is remapped).

>> Or is this 'dd' stuff just nuts, a case of "well that's a novel way of
>> trashing your data..." and/or "you're welcome to try, but you get to keep
>> all the pieces and don't come crying to us for help!"?
> 
> Pretty much. If a RAID check is not touching them, then they are likely in
> the vacant area around the superblock. Nothing touches that, and playing
> with it can lead to tears if you misfire and hit the superblock or the data.

Sure - I understand the risks.

> If the superblock is ok, and the errors are outside of the data area I've
> taken a drive out of the array, used dd_rescue to clone the area of the
> drive in question and then written that back to the disk and re-added to the
> array. That just re-writes the good data and with zeros where the bad
> sectors were.
> 
> That is a horrible, horrible procedure that I did on an array I use for
> testing and has no valuable data on. I would not recommend it if you care
> about your array or data.

I'm interested to see if there's a way of essentially doing the above on a
live system, assuming there's appropriate care taken to not trash any
existing data (including superblocks).

I.e. is it *theoretically* possible to write the same data back to the whole
disk safely. E.g. using 'dd' from/to the same disk is almost there, but, as
described, there's a window of opportunity where you could get stale data on
the disk and a raid repair could then copy that stale data to the good disk.

> Brad

Thanks,

Chris

^ permalink raw reply

* Re: Adding journal to existing raid5 arrary
From: Maarten van Malland @ 2016-08-18  7:51 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid
In-Reply-To: <1471470369-397690-1-git-send-email-songliubraving@fb.com>

Well that explains that then :-). Is it also planned to support
removal of the journal device? I can imagine that there are cases
(such as performance issues) that you would like to revert to the old
situation...

On Wed, Aug 17, 2016 at 11:46 PM, Song Liu <songliubraving@fb.com> wrote:
> Currently, --add-journal does not support adding journal to an existing array,
> We plan to add that soon.
>
> Song
>

^ permalink raw reply

* raid6 algorithm issues with 64K page_size
From: Zhengyuan Liu @ 2016-08-18  8:06 UTC (permalink / raw)
  To: linux-raid
  Cc: Shaohua Li, linux-kernel, ravi.v.shankar, Gayatri Kammela,
	H . Peter Anvin, Jim Kukunas, Fenghua Yu, Megha Dey,
	刘云, 胡海

G' day all,

The kernel would try to pick the best algorithm for raid6 to compute two
syndromes, generally referred to P and Q  at boot time. Part of the algorithm
code was showed bellow from lib/raid6/algos.c:

   int __init raid6_select_algo(void)
   {
        const int disks = (65536/PAGE_SIZE)+2;

        const struct raid6_calls *gen_best;
        const struct raid6_recov_calls *rec_best;
        char *syndromes;
        void *dptrs[(65536/PAGE_SIZE)+2];
        int i;

        for (i = 0; i < disks-2; i++)
                dptrs[i] = ((char *)raid6_gfmul) + PAGE_SIZE*i;

        /* Normal code - use a 2-page allocation to avoid D$ conflict */
        syndromes = (void *) __get_free_pages(GFP_KERNEL, 1);

        if (!syndromes) {
                pr_err("raid6: Yikes!  No memory available.\n");
                return -ENOMEM;
        }

        dptrs[disks-2] = syndromes;
        dptrs[disks-1] = syndromes + PAGE_SIZE;

        /* select raid gen_syndrome function */
        gen_best = raid6_choose_gen(&dptrs, disks);

        /* select raid recover functions */
        rec_best = raid6_choose_recov()

The data set to use for computing syndromes  is gfmul table, it was defined as
"u8 raid6_gfmul[256][256]" and size to be 65536 Bytes or 64KB . From
the code we can see it use gfmul table size and PAGE_SIZE to determine
the disk number.  If the PAGE_SIZE is 4K, then the number of disks got
to be 18 and 10 for 8K, 3 for 64K. As we all know, raid6 needs at
least 4 disks.

Could we just define a constantly macro for disks as the test program does in
lib/raid6/test/test.c, not depend on page size  and not use gfmul
table as the data source of disks?

Move further, bigger page size like 128K would encounter the same problem.

^ permalink raw reply

* OOM/panic observed durintg creating RAID4 with DIF/DIX enabled disk
From: Yi Zhang @ 2016-08-18  9:44 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, jmoyer, Jes Sorensen, Xiao Ni
In-Reply-To: <2062708428.2492143.1471511559011.JavaMail.zimbra@redhat.com>

hello 

I observed OOM/panic during create RAID4 with DIF/DIX enabled disk, here I used scsi-debug to created the DIF/DIX disk.

Full log:
http://pastebin.com/uP3rJcHX

Environment:
4.8.0-rc2
32Gb phycial memory

Reproduced at Iteration:19

Reproduce steps:

#!/bin/bash
modprobe scsi_debug dev_size_mb=16384  num_parts=4 dif=1 dix=1
sleep 5
num=0
while [ $num -lt 100 ]; do
	echo "********************Iteration:$num****************"
	mdadm --create --run /dev/md0 --level 4 --metadata 1.2 --raid-devices 4  /dev/sdc[1-4]  --bitmap=internal --bitmap-chunk=64M --chunk 512
	mdadm --wait /dev/md0
	cat /proc/mdstat
	mdadm -Ss
	mdadm --zero-superblock /dev/sdc[1-4]
	sleep 1
       ((num++))
done

Best Regards,
  Yi Zhang



^ permalink raw reply

* [PATCH 09/16] md: raid5: Convert to hotplug state machine
From: Sebastian Andrzej Siewior @ 2016-08-18 12:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, rt, Sebastian Andrzej Siewior,
	Neil Brown, linux-raid
In-Reply-To: <20160818125731.27256-1-bigeasy@linutronix.de>

Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.

Cc: Neil Brown <neilb@suse.com>
Cc: linux-raid@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/md/raid5.c         | 83 ++++++++++++++++------------------------------
 drivers/md/raid5.h         |  4 +--
 include/linux/cpuhotplug.h |  1 +
 3 files changed, 30 insertions(+), 58 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f79b1a57f69a..40ac4c953ff2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6344,22 +6344,20 @@ static int alloc_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu
 	return 0;
 }
 
+static int raid456_cpu_dead(unsigned int cpu, struct hlist_node *node)
+{
+	struct r5conf *conf = hlist_entry_safe(node, struct r5conf, node);
+
+	free_scratch_buffer(conf, per_cpu_ptr(conf->percpu, cpu));
+	return 0;
+}
+
 static void raid5_free_percpu(struct r5conf *conf)
 {
-	unsigned long cpu;
-
 	if (!conf->percpu)
 		return;
 
-#ifdef CONFIG_HOTPLUG_CPU
-	unregister_cpu_notifier(&conf->cpu_notify);
-#endif
-
-	get_online_cpus();
-	for_each_possible_cpu(cpu)
-		free_scratch_buffer(conf, per_cpu_ptr(conf->percpu, cpu));
-	put_online_cpus();
-
+	cpuhp_state_remove_instance(CPUHP_MD_RAID5_PREPARE, &conf->node);
 	free_percpu(conf->percpu);
 }
 
@@ -6378,64 +6376,28 @@ static void free_conf(struct r5conf *conf)
 	kfree(conf);
 }
 
-#ifdef CONFIG_HOTPLUG_CPU
-static int raid456_cpu_notify(struct notifier_block *nfb, unsigned long action,
-			      void *hcpu)
+static int raid456_cpu_up_prepare(unsigned int cpu, struct hlist_node *node)
 {
-	struct r5conf *conf = container_of(nfb, struct r5conf, cpu_notify);
-	long cpu = (long)hcpu;
+	struct r5conf *conf = hlist_entry_safe(node, struct r5conf, node);
 	struct raid5_percpu *percpu = per_cpu_ptr(conf->percpu, cpu);
 
-	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		if (alloc_scratch_buffer(conf, percpu)) {
-			pr_err("%s: failed memory allocation for cpu%ld\n",
-			       __func__, cpu);
-			return notifier_from_errno(-ENOMEM);
-		}
-		break;
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-		free_scratch_buffer(conf, per_cpu_ptr(conf->percpu, cpu));
-		break;
-	default:
-		break;
+	if (alloc_scratch_buffer(conf, percpu)) {
+		pr_err("%s: failed memory allocation for cpu%u\n",
+		       __func__, cpu);
+		return -ENOMEM;
 	}
-	return NOTIFY_OK;
+	return 0;
 }
-#endif
 
 static int raid5_alloc_percpu(struct r5conf *conf)
 {
-	unsigned long cpu;
 	int err = 0;
 
 	conf->percpu = alloc_percpu(struct raid5_percpu);
 	if (!conf->percpu)
 		return -ENOMEM;
 
-#ifdef CONFIG_HOTPLUG_CPU
-	conf->cpu_notify.notifier_call = raid456_cpu_notify;
-	conf->cpu_notify.priority = 0;
-	err = register_cpu_notifier(&conf->cpu_notify);
-	if (err)
-		return err;
-#endif
-
-	get_online_cpus();
-	for_each_present_cpu(cpu) {
-		err = alloc_scratch_buffer(conf, per_cpu_ptr(conf->percpu, cpu));
-		if (err) {
-			pr_err("%s: failed memory allocation for cpu%ld\n",
-			       __func__, cpu);
-			break;
-		}
-	}
-	put_online_cpus();
-
+	err = cpuhp_state_add_instance(CPUHP_MD_RAID5_PREPARE, &conf->node);
 	if (!err) {
 		conf->scribble_disks = max(conf->raid_disks,
 			conf->previous_raid_disks);
@@ -7967,10 +7929,20 @@ static struct md_personality raid4_personality =
 
 static int __init raid5_init(void)
 {
+	int ret;
+
 	raid5_wq = alloc_workqueue("raid5wq",
 		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE|WQ_SYSFS, 0);
 	if (!raid5_wq)
 		return -ENOMEM;
+	ret = cpuhp_setup_state_multi(CPUHP_MD_RAID5_PREPARE,
+				      "MD_RAID5_PREPARE",
+				      raid456_cpu_up_prepare,
+				      raid456_cpu_dead);
+	if (ret) {
+		destroy_workqueue(raid5_wq);
+		return ret;
+	}
 	register_md_personality(&raid6_personality);
 	register_md_personality(&raid5_personality);
 	register_md_personality(&raid4_personality);
@@ -7982,6 +7954,7 @@ static void raid5_exit(void)
 	unregister_md_personality(&raid6_personality);
 	unregister_md_personality(&raid5_personality);
 	unregister_md_personality(&raid4_personality);
+	cpuhp_remove_multi_state(CPUHP_MD_RAID5_PREPARE);
 	destroy_workqueue(raid5_wq);
 }
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 517d4b68a1be..57ec49f0839e 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -512,9 +512,7 @@ struct r5conf {
 	} __percpu *percpu;
 	int scribble_disks;
 	int scribble_sectors;
-#ifdef CONFIG_HOTPLUG_CPU
-	struct notifier_block	cpu_notify;
-#endif
+	struct hlist_node node;
 
 	/*
 	 * Free stripes pool
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 332b39c21d2e..5811954809af 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -28,6 +28,7 @@ enum cpuhp_state {
 	CPUHP_RELAY_PREPARE,
 	CPUHP_SLAB_PREPARE,
 	CPUHP_RCUTREE_PREP,
+	CPUHP_MD_RAID5_PREPARE,
 	CPUHP_NOTIFY_PREPARE,
 	CPUHP_TIMERS_DEAD,
 	CPUHP_BRINGUP_CPU,
-- 
2.9.3

^ permalink raw reply related

* Re: Adding journal to existing raid5 arrary
From: Song Liu @ 2016-08-18 16:03 UTC (permalink / raw)
  To: Maarten van Malland; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <CAC8wJ3Fjj=ANPpL8MxN_gmcs721TjkMUOooeEh_giAONFx4q-Q@mail.gmail.com>

Yes, we do plan to add removal as well. 
 
Thanks,
Song

>> On 8/18/16, 12:51 AM, "Maarten van Malland" <maartenvanmalland@gmail.com> wrote:

    Well that explains that then :-). Is it also planned to support
    removal of the journal device? I can imagine that there are cases
    (such as performance issues) that you would like to revert to the old
    situation...
    
    On Wed, Aug 17, 2016 at 11:46 PM, Song Liu <songliubraving@fb.com> wrote:
    > Currently, --add-journal does not support adding journal to an existing array,
    > We plan to add that soon.
    >
    > Song
    >
    


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox