Linux Plumbers MD BOF discussion notes

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Linux Plumbers MD BOF discussion notes
@ 2017-09-15 14:27 Shaohua Li
  2017-09-15 20:42 ` Coly Li
  2017-09-16  0:08 ` NeilBrown
  0 siblings, 2 replies; 22+ messages in thread
From: Shaohua Li @ 2017-09-15 14:27 UTC (permalink / raw)
  To: linux-raid

This is a short note based on Song's record. Please reply to the list if
anything is missing.

*IMSM - PPL
Fix write hole without extra device; Updated status and upcomming mdadm change
to support it. Intel guys are improving it, like fixing current 'disable disk
cache' problem.

*Hiding member drives
Hiding RAID array member drives from user, so MD RAID array looks more like a
hardware raid array. This turns out to be real customer requirement.
We do need to access the member drives for different reasons (create/assembly,
mdmon, iostat). Working around this issue might be possible, eg, delete the
/dev/xxx after array assembly. But must justify the value and also discuss with
block guys since this is a general issue.

*Block-mq
Converting MD to use block-mq? md is bio based (not request based) driver, so
no value to go mq. md dispatches bio directly to low level disks. blk-mq still
benefits if low level disk supports it but this is transparent to md.

*NVDIMM caching
NVDIMM supports block interface. Using it as a raid5 cache disk should be
straightforward.
Directly storing raid5 stripe cache in NVDIMM without current raid5-cache log
device? Had problems for example how to detect/fix data mismatch after power
failure. Need major changes in raid5 code.

*stream ID
Support stream ID in MD. It should be fairly easy to support stream ID in
raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID,
eg, write stripe data multiple times because of read-modify-write (clarify?).
Probably detecting IO pattern like what DM does can help.

*split/merge problem
md layer splits bio and block layer will do bio merge for low level disks. The
merge/split overhead is noticeable for raid0 with fast SSD and small chunk
size. Fixing the issue for raid0 is doable. Fixing for raid5 is not sure.
Discussed increasing stripe size of raid5 to reduce the split/merge overhead.
There is tradeoff here for example more unnecessary IO for read-modify-write
with bigger stripe size.

*Testing
md need recover data after disk failures, mdadm has test suite, but not
covering all cases. mdadm test suite is fragile, may kill the machine
We need to build more completed tests.

The recent null_blk block device driver can emulate several types of disk
failures. The plan is to make null_blk support all disk failures which md can
handle and create a test suite using null_blk. Help is welcome!

*RAID-1 RAID-10 barrier inconsistency
Coly improved the barrier scalibility for raid1, hopefully he can do the same
for raid10

*DAX
Support DAX in raid0/linear should not be hard. Does it make sense to support
other raid types?

*sysfs / ioctl
Jes started working on it. Goal is to replace ioctl with sysfs based interfaces.
There are gaps currently, eg, some operations can only be done with ioctl. Suse
guys promised to close the gap in kernel side.

Using configfs instead of sysfs?

*Stop nested RAID device
For example a raid0 on top of raid5. Userspace must understand the topology to
stop the nested raid arrays.
mdadm stop is async, need sync option for stop array (clarify?)

*More stable in kernel API
race condition when access md_dev data: data could be changed because of
resync. dm-raid need to get reliable resync status report. Need further
discussion on this side, email/draft patch could be helpful.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-09-15 14:27 Linux Plumbers MD BOF discussion notes Shaohua Li
@ 2017-09-15 20:42 ` Coly Li
  2017-09-15 21:20   ` Shaohua Li
  2017-09-16  0:08 ` NeilBrown
  1 sibling, 1 reply; 22+ messages in thread
From: Coly Li @ 2017-09-15 20:42 UTC (permalink / raw)
  To: Shaohua Li, linux-raid

On 2017/9/15 下午4:27, Shaohua Li wrote:
> This is a short note based on Song's record. Please reply to the list if
> anything is missing.

[snip]
>  
> *stream ID
> Support stream ID in MD. It should be fairly easy to support stream ID in
> raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID,
> eg, write stripe data multiple times because of read-modify-write (clarify?).
> Probably detecting IO pattern like what DM does can help.
> 

Can anyone give me a hint what is stream ID in the context of md raid ?


> *split/merge problem
> md layer splits bio and block layer will do bio merge for low level disks. The
> merge/split overhead is noticeable for raid0 with fast SSD and small chunk
> size. Fixing the issue for raid0 is doable. Fixing for raid5 is not sure.
> Discussed increasing stripe size of raid5 to reduce the split/merge overhead.
> There is tradeoff here for example more unnecessary IO for read-modify-write
> with bigger stripe size.
> 
> *Testing
> md need recover data after disk failures, mdadm has test suite, but not
> covering all cases. mdadm test suite is fragile, may kill the machine
> We need to build more completed tests.
> 
> The recent null_blk block device driver can emulate several types of disk
> failures. The plan is to make null_blk support all disk failures which md can
> handle and create a test suite using null_blk. Help is welcome!
>  
> *RAID-1 RAID-10 barrier inconsistency
> Coly improved the barrier scalibility for raid1, hopefully he can do the same
> for raid10
>  

Copied, I will handle it.


> *DAX
> Support DAX in raid0/linear should not be hard. Does it make sense to support
> other raid types?
> 
> *sysfs / ioctl
> Jes started working on it. Goal is to replace ioctl with sysfs based interfaces.
> There are gaps currently, eg, some operations can only be done with ioctl. Suse
> guys promised to close the gap in kernel side.
> 

Yes, I will handle kernel part. The change will be done one by one, step
by step. The first step is to unify code path for both ioctl and sysfs
interfaces. Once I finish my emergent tasks on hand, I will start to
handle this. Hopefully this work can start by end of this year.

> Using configfs instead of sysfs?
>  

Currently it is sysfs and I feel is it OK being sysfs interface. Do we
have specific reason or benefit for using configfs ?


Thanks for the informative notes, thank you all for the discussion !

Coly Li

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-09-15 20:42 ` Coly Li
@ 2017-09-15 21:20   ` Shaohua Li
  0 siblings, 0 replies; 22+ messages in thread
From: Shaohua Li @ 2017-09-15 21:20 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid

On Fri, Sep 15, 2017 at 10:42:23PM +0200, Coly Li wrote:
> On 2017/9/15 下午4:27, Shaohua Li wrote:
> > This is a short note based on Song's record. Please reply to the list if
> > anything is missing.
> 
> [snip]
> >  
> > *stream ID
> > Support stream ID in MD. It should be fairly easy to support stream ID in
> > raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID,
> > eg, write stripe data multiple times because of read-modify-write (clarify?).
> > Probably detecting IO pattern like what DM does can help.
> > 
> 
> Can anyone give me a hint what is stream ID in the context of md raid ?

I think it's called write hint right now, see c75b1d9421f8 fs: add fcntl()
interface for setting/getting write life time hints 
> 
> > *split/merge problem md layer splits bio and block layer will do bio merge
> > for low level disks. The merge/split overhead is noticeable for raid0 with
> > fast SSD and small chunk size. Fixing the issue for raid0 is doable. Fixing
> > for raid5 is not sure.  Discussed increasing stripe size of raid5 to reduce
> > the split/merge overhead.  There is tradeoff here for example more
> > unnecessary IO for read-modify-write with bigger stripe size.
> > 
> > *Testing md need recover data after disk failures, mdadm has test suite,
> > but not covering all cases. mdadm test suite is fragile, may kill the
> > machine We need to build more completed tests.
> > 
> > The recent null_blk block device driver can emulate several types of disk
> > failures. The plan is to make null_blk support all disk failures which md
> > can handle and create a test suite using null_blk. Help is welcome!
> >  
> > *RAID-1 RAID-10 barrier inconsistency Coly improved the barrier scalibility
> > for raid1, hopefully he can do the same for raid10
> >  
> 
> Copied, I will handle it.
> 
> 
> > *DAX Support DAX in raid0/linear should not be hard. Does it make sense to
> > support other raid types?
> > 
> > *sysfs / ioctl Jes started working on it. Goal is to replace ioctl with
> > sysfs based interfaces.  There are gaps currently, eg, some operations can
> > only be done with ioctl. Suse guys promised to close the gap in kernel
> > side.
> > 
> 
> Yes, I will handle kernel part. The change will be done one by one, step by
> step. The first step is to unify code path for both ioctl and sysfs
> interfaces. Once I finish my emergent tasks on hand, I will start to handle
> this. Hopefully this work can start by end of this year.
> 
> > Using configfs instead of sysfs?
> >  
> 
> Currently it is sysfs and I feel is it OK being sysfs interface. Do we have
> specific reason or benefit for using configfs ?

No strong reason, just because configfs is generally for configure. If sysfs
works here, I think it's ok.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-09-15 14:27 Linux Plumbers MD BOF discussion notes Shaohua Li
  2017-09-15 20:42 ` Coly Li
@ 2017-09-16  0:08 ` NeilBrown
  2017-09-18  4:54   ` Shaohua Li
  2017-09-18  7:04   ` Mikael Abrahamsson
  1 sibling, 2 replies; 22+ messages in thread
From: NeilBrown @ 2017-09-16  0:08 UTC (permalink / raw)
  To: Shaohua Li, linux-raid

[-- Attachment #1: Type: text/plain, Size: 6130 bytes --]


Sounds like an interesting, wide ranging discussion...



On Fri, Sep 15 2017, Shaohua Li wrote:

> This is a short note based on Song's record. Please reply to the list if
> anything is missing.
>
> *IMSM - PPL
> Fix write hole without extra device; Updated status and upcomming mdadm change
> to support it. Intel guys are improving it, like fixing current 'disable disk
> cache' problem.
>
> *Hiding member drives
> Hiding RAID array member drives from user, so MD RAID array looks more like a
> hardware raid array. This turns out to be real customer requirement.
> We do need to access the member drives for different reasons (create/assembly,
> mdmon, iostat). Working around this issue might be possible, eg, delete the
> /dev/xxx after array assembly. But must justify the value and also discuss with
> block guys since this is a general issue.


"Hiding" is a very vague term.  Should we get Harry Potter's
invisibility cloak and wrap it around the hardware?
Do we need to:
  - remove from /proc/partitions - possible and possibly sane
  - remove from from /dev - easy, given clear justification
  - remove from /sys/block - I don't think this is justifiable
  - make open() impossible - it already is if you use O_EXCL
??

Possibly something sensible could be done, but we do need a clear
statement of, and justification for, the customer requirement.

>
> *Block-mq
> Converting MD to use block-mq? md is bio based (not request based) driver, so
> no value to go mq. md dispatches bio directly to low level disks. blk-mq still
> benefits if low level disk supports it but this is transparent to md.
>  
> *NVDIMM caching
> NVDIMM supports block interface. Using it as a raid5 cache disk should be
> straightforward.
> Directly storing raid5 stripe cache in NVDIMM without current raid5-cache log
> device? Had problems for example how to detect/fix data mismatch after power
> failure. Need major changes in raid5 code.
>  
> *stream ID
> Support stream ID in MD. It should be fairly easy to support stream ID in
> raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID,
> eg, write stripe data multiple times because of read-modify-write (clarify?).
> Probably detecting IO pattern like what DM does can help.
>
> *split/merge problem
> md layer splits bio and block layer will do bio merge for low level disks. The
> merge/split overhead is noticeable for raid0 with fast SSD and small chunk
> size. Fixing the issue for raid0 is doable. Fixing for raid5 is not sure.
> Discussed increasing stripe size of raid5 to reduce the split/merge overhead.
> There is tradeoff here for example more unnecessary IO for read-modify-write
> with bigger stripe size.

For raid5 I can understand this being an issue as raid5 only submits
PAGE_SIZE bios to lower level devices.
The batching that Shaohua added might be a good starting point.  If you
have a batch of stripes, you should be able to submit one bio per device
for the whole batch.

For RAID0, I don't understand the problem.  RAID0 never splits smaller
than the chunk size, and that is a firm requirement.
Maybe RAID0 could merge the bios itself rather than passing them down.
Maybe that would help.  Do if a request is properly aligned and covers
several stripes, that could be mapped to one-bio-per-device.  Is that
the goal?

>
> *Testing
> md need recover data after disk failures, mdadm has test suite, but not
> covering all cases. mdadm test suite is fragile, may kill the machine
> We need to build more completed tests.
>
> The recent null_blk block device driver can emulate several types of disk
> failures. The plan is to make null_blk support all disk failures which md can
> handle and create a test suite using null_blk. Help is welcome!
>  
> *RAID-1 RAID-10 barrier inconsistency
> Coly improved the barrier scalibility for raid1, hopefully he can do the same
> for raid10
>  
> *DAX
> Support DAX in raid0/linear should not be hard. Does it make sense to support
> other raid types?
>
> *sysfs / ioctl
> Jes started working on it. Goal is to replace ioctl with sysfs based interfaces.
> There are gaps currently, eg, some operations can only be done with ioctl. Suse
> guys promised to close the gap in kernel side.
>
> Using configfs instead of sysfs?

It seems that no one actually wants this, but I'll just throw in my
opinion that this is a nonsensical suggestion.  configfs is only for
people who don't understand sysfs.  It has no real value.

>  
> *Stop nested RAID device
> For example a raid0 on top of raid5. Userspace must understand the topology to
> stop the nested raid arrays.
> mdadm stop is async, need sync option for stop array (clarify?)

I've been thinking that it might be useful for md (and dm and loop
and..) to have a setting whereby it automatically shuts down on last
close.  This would make it easier to orchestrate shutdown.
Certainly it would make sense to use such a mode for stacked arrays.

The "mdadm stop is async" comment refers to the fact that
mddev_delayed_delete is run in a work queue, possibly after "mdadm -S
/dev/mdX" completes.
It might also refer to that fact that udev subsequently deletes things
from /dev, possibly after a further delay.

It might be possible to remove the need for mddev_delayed_delete if we
enhance disk_release (in genhd.c) in some way so that it can drop the
reference on the mddev (instead of mddev having to be careful when it
drops the reference on the gendisk).

Getting mdadm to wait for udev might be easy if udev provides some sort
of API for that.

NeilBrown


>  
> *More stable in kernel API
> race condition when access md_dev data: data could be changed because of
> resync. dm-raid need to get reliable resync status report. Need further
> discussion on this side, email/draft patch could be helpful.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-09-16  0:08 ` NeilBrown
@ 2017-09-18  4:54   ` Shaohua Li
  2017-09-18  7:04   ` Mikael Abrahamsson
  1 sibling, 0 replies; 22+ messages in thread
From: Shaohua Li @ 2017-09-18  4:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Sat, Sep 16, 2017 at 10:08:06AM +1000, Neil Brown wrote:
> 
> Sounds like an interesting, wide ranging discussion...
> 
> 
> 
> On Fri, Sep 15 2017, Shaohua Li wrote:
> 
> > This is a short note based on Song's record. Please reply to the list if
> > anything is missing.
> >
> > *IMSM - PPL
> > Fix write hole without extra device; Updated status and upcomming mdadm change
> > to support it. Intel guys are improving it, like fixing current 'disable disk
> > cache' problem.
> >
> > *Hiding member drives
> > Hiding RAID array member drives from user, so MD RAID array looks more like a
> > hardware raid array. This turns out to be real customer requirement.
> > We do need to access the member drives for different reasons (create/assembly,
> > mdmon, iostat). Working around this issue might be possible, eg, delete the
> > /dev/xxx after array assembly. But must justify the value and also discuss with
> > block guys since this is a general issue.
> 
> 
> "Hiding" is a very vague term.  Should we get Harry Potter's
> invisibility cloak and wrap it around the hardware?
> Do we need to:
>   - remove from /proc/partitions - possible and possibly sane
>   - remove from from /dev - easy, given clear justification
>   - remove from /sys/block - I don't think this is justifiable
>   - make open() impossible - it already is if you use O_EXCL
> ??
> 
> Possibly something sensible could be done, but we do need a clear
> statement of, and justification for, the customer requirement.

Agree. The requirement isn't very clear right now.
 
> >
> > *Block-mq
> > Converting MD to use block-mq? md is bio based (not request based) driver, so
> > no value to go mq. md dispatches bio directly to low level disks. blk-mq still
> > benefits if low level disk supports it but this is transparent to md.
> >  
> > *NVDIMM caching
> > NVDIMM supports block interface. Using it as a raid5 cache disk should be
> > straightforward.
> > Directly storing raid5 stripe cache in NVDIMM without current raid5-cache log
> > device? Had problems for example how to detect/fix data mismatch after power
> > failure. Need major changes in raid5 code.
> >  
> > *stream ID
> > Support stream ID in MD. It should be fairly easy to support stream ID in
> > raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID,
> > eg, write stripe data multiple times because of read-modify-write (clarify?).
> > Probably detecting IO pattern like what DM does can help.
> >
> > *split/merge problem
> > md layer splits bio and block layer will do bio merge for low level disks. The
> > merge/split overhead is noticeable for raid0 with fast SSD and small chunk
> > size. Fixing the issue for raid0 is doable. Fixing for raid5 is not sure.
> > Discussed increasing stripe size of raid5 to reduce the split/merge overhead.
> > There is tradeoff here for example more unnecessary IO for read-modify-write
> > with bigger stripe size.
> 
> For raid5 I can understand this being an issue as raid5 only submits
> PAGE_SIZE bios to lower level devices.
> The batching that Shaohua added might be a good starting point.  If you
> have a batch of stripes, you should be able to submit one bio per device
> for the whole batch.
> 
> For RAID0, I don't understand the problem.  RAID0 never splits smaller
> than the chunk size, and that is a firm requirement.
> Maybe RAID0 could merge the bios itself rather than passing them down.
> Maybe that would help.  Do if a request is properly aligned and covers
> several stripes, that could be mapped to one-bio-per-device.  Is that
> the goal?

Yes, I think one-bio-per-device is the goal. split bio according to chunk size
and merge bio in low level disk does takes cpu time.

> >
> > *Testing
> > md need recover data after disk failures, mdadm has test suite, but not
> > covering all cases. mdadm test suite is fragile, may kill the machine
> > We need to build more completed tests.
> >
> > The recent null_blk block device driver can emulate several types of disk
> > failures. The plan is to make null_blk support all disk failures which md can
> > handle and create a test suite using null_blk. Help is welcome!
> >  
> > *RAID-1 RAID-10 barrier inconsistency
> > Coly improved the barrier scalibility for raid1, hopefully he can do the same
> > for raid10
> >  
> > *DAX
> > Support DAX in raid0/linear should not be hard. Does it make sense to support
> > other raid types?
> >
> > *sysfs / ioctl
> > Jes started working on it. Goal is to replace ioctl with sysfs based interfaces.
> > There are gaps currently, eg, some operations can only be done with ioctl. Suse
> > guys promised to close the gap in kernel side.
> >
> > Using configfs instead of sysfs?
> 
> It seems that no one actually wants this, but I'll just throw in my
> opinion that this is a nonsensical suggestion.  configfs is only for
> people who don't understand sysfs.  It has no real value. 
> >  
> > *Stop nested RAID device
> > For example a raid0 on top of raid5. Userspace must understand the topology to
> > stop the nested raid arrays.
> > mdadm stop is async, need sync option for stop array (clarify?)
> 
> I've been thinking that it might be useful for md (and dm and loop
> and..) to have a setting whereby it automatically shuts down on last
> close.  This would make it easier to orchestrate shutdown.
> Certainly it would make sense to use such a mode for stacked arrays.

loop does have the setting to automatically shut down on last close.

Thanks,
Shaohua
> The "mdadm stop is async" comment refers to the fact that
> mddev_delayed_delete is run in a work queue, possibly after "mdadm -S
> /dev/mdX" completes.
> It might also refer to that fact that udev subsequently deletes things
> from /dev, possibly after a further delay.
> 
> It might be possible to remove the need for mddev_delayed_delete if we
> enhance disk_release (in genhd.c) in some way so that it can drop the
> reference on the mddev (instead of mddev having to be careful when it
> drops the reference on the gendisk).
> 
> Getting mdadm to wait for udev might be easy if udev provides some sort
> of API for that.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-09-16  0:08 ` NeilBrown
  2017-09-18  4:54   ` Shaohua Li
@ 2017-09-18  7:04   ` Mikael Abrahamsson
  2017-09-18  8:56     ` NeilBrown
  2017-09-18 13:57     ` Wols Lists
  1 sibling, 2 replies; 22+ messages in thread
From: Mikael Abrahamsson @ 2017-09-18  7:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Sat, 16 Sep 2017, NeilBrown wrote:

> "Hiding" is a very vague term.  Should we get Harry Potter's
> invisibility cloak and wrap it around the hardware?
> Do we need to:
>  - remove from /proc/partitions - possible and possibly sane
>  - remove from from /dev - easy, given clear justification
>  - remove from /sys/block - I don't think this is justifiable
>  - make open() impossible - it already is if you use O_EXCL
> ??
>
> Possibly something sensible could be done, but we do need a clear
> statement of, and justification for, the customer requirement.

This is interesting.

On the IRC channel #linux-raid on freenode, we have frequent visitors who 
"oh, I happened to overwrite an active component drive with dd" or "I 
zero:ed the superblock on the active component by mistake" etc. So there 
is something to it to remove the "/dev/sd*" when they're part of an active 
md device.

However, this would cause problems when people are using for instance 
"smartctl" and equivalent ways to pull data from the devices. Same with 
mdadm -E.

Just thinking out loud, perhaps it would make sense to create some kind of 
hierarchy along the lines of "/proc/mapper/md0/" and put the components 
there for monitoring. However, I think it would be quite confusing for 
users if /dev/sd[b-f] disappeared as soon as it was put into an array. 
There is also the question about how to refer to these devices when 
manipulating with mdadm.

I don't have good answers, but I can say that there is user pain out there 
when they shoot themselves in the foot. If we can come up with a clever 
way to help them (without too many downsides), it'd be good.

If we could disable writing to the drives/partitions from regular 
userspace when they're handled by md, that could be some kind of middle 
ground. I guess most tools don't use O_EXCL.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-09-18  7:04   ` Mikael Abrahamsson
@ 2017-09-18  8:56     ` NeilBrown
  2017-10-01  5:32       ` Mikael Abrahamsson
  2017-09-18 13:57     ` Wols Lists
  1 sibling, 1 reply; 22+ messages in thread
From: NeilBrown @ 2017-09-18  8:56 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3010 bytes --]

On Mon, Sep 18 2017, Mikael Abrahamsson wrote:

> On Sat, 16 Sep 2017, NeilBrown wrote:
>
>> "Hiding" is a very vague term.  Should we get Harry Potter's
>> invisibility cloak and wrap it around the hardware?
>> Do we need to:
>>  - remove from /proc/partitions - possible and possibly sane
>>  - remove from from /dev - easy, given clear justification
>>  - remove from /sys/block - I don't think this is justifiable
>>  - make open() impossible - it already is if you use O_EXCL
>> ??
>>
>> Possibly something sensible could be done, but we do need a clear
>> statement of, and justification for, the customer requirement.
>
> This is interesting.
>
> On the IRC channel #linux-raid on freenode, we have frequent visitors who 
> "oh, I happened to overwrite an active component drive with dd" or "I 
> zero:ed the superblock on the active component by mistake" etc. So there 
> is something to it to remove the "/dev/sd*" when they're part of an active 
> md device.
>
> However, this would cause problems when people are using for instance 
> "smartctl" and equivalent ways to pull data from the devices. Same with 
> mdadm -E.
>
> Just thinking out loud, perhaps it would make sense to create some kind of 
> hierarchy along the lines of "/proc/mapper/md0/" and put the components 
> there for monitoring. However, I think it would be quite confusing for 
> users if /dev/sd[b-f] disappeared as soon as it was put into an array. 
> There is also the question about how to refer to these devices when 
> manipulating with mdadm.
>
> I don't have good answers, but I can say that there is user pain out there 
> when they shoot themselves in the foot. If we can come up with a clever 
> way to help them (without too many downsides), it'd be good.
>
> If we could disable writing to the drives/partitions from regular 
> userspace when they're handled by md, that could be some kind of middle 
> ground. I guess most tools don't use O_EXCL.

This is awkward.  There are times when userspace needs to write to a
device which is in used by me.  One example is when using DDF or Intel
metadata and userspace manages the metadata.  Another is using
raid6check to correct inconsistencies.

I don't object at all to making it hard for regular commands to write to
the devices, but it is hard to come up with a good way to do it.

Maybe just removing the /dev/XXX entry would be best.  That doesn't stop
a determined program, but it does make it harder for an inexperienced
user. As you say, that might cause confusion though.

It would be nice if we could simple remove write permission, but
that is ignored when root opens things.

We could add an ioctl that needs to be called on an fd before writes are
allowed.  This would effect a per-fd write access that applies even to
root.  If feels a but ugly, but might be possible.

Anyway, thanks for the example of a real problem related to this.  It
does make it easier to think about.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-09-18  8:56     ` NeilBrown
@ 2017-10-01  5:32       ` Mikael Abrahamsson
  2017-10-04  0:49         ` NeilBrown
  0 siblings, 1 reply; 22+ messages in thread
From: Mikael Abrahamsson @ 2017-10-01  5:32 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Mon, 18 Sep 2017, NeilBrown wrote:

> Anyway, thanks for the example of a real problem related to this.  It 
> does make it easier to think about.

Btw, if someone does --zero-superblock or dd /dev/zero to to a component 
device that is active, what happens when mdadm --stop /dev/mdX is run? 
Does it write out the complete superblock again?

If not, would that be hard to do?

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-01  5:32       ` Mikael Abrahamsson
@ 2017-10-04  0:49         ` NeilBrown
  2017-10-04 11:02           ` Artur Paszkiewicz
  2017-10-04 17:28           ` Piergiorgio Sartor
  0 siblings, 2 replies; 22+ messages in thread
From: NeilBrown @ 2017-10-04  0:49 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3706 bytes --]

On Sun, Oct 01 2017, Mikael Abrahamsson wrote:

> On Mon, 18 Sep 2017, NeilBrown wrote:
>
>> Anyway, thanks for the example of a real problem related to this.  It 
>> does make it easier to think about.
>
> Btw, if someone does --zero-superblock or dd /dev/zero to to a component 
> device that is active, what happens when mdadm --stop /dev/mdX is run? 
> Does it write out the complete superblock again?

--zero-superblock won't work on a device that is currently part of an
array.  dd /dev/zero will.
When the array is stopped the metadata will be written if the array is
not read-only and is not clean.
So for 'linear' and 'raid0' it is never written.  For others it probably
is but may not be.
I'm not sure that forcing a write makes sense.  A dd could corrupt lots
of stuff, and just saving the metadata is not a big win.

I've been playing with some code, and this patch makes it impossible to
write to a device which is in-use by md.
Well... not exactly.  If a partition is in-use by md, the whole device
can still be written to.  But the partition itself cannot.
Also if metadata is managed by user-space, writes are still allowed.
To fix that, we would need to capture each write request and validate
the sector range.  Not impossible, but ugly.

Also, by itself, this patch breaks the use of raid6check on an active
array.  We could fix that by enabling writes whenever a region is
suspended.

Still... maybe it is a starting point for thinking about the problem.

NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0ff1bbf6c90e..7c469cd9febc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2264,6 +2264,7 @@ static int lock_rdev(struct md_rdev *rdev, dev_t dev, int shared)
 		pr_warn("md: could not open %s.\n", __bdevname(dev, b));
 		return PTR_ERR(bdev);
 	}
+	bdev->bd_holder_only_writes = !shared;
 	rdev->bdev = bdev;
 	return err;
 }
@@ -2272,6 +2273,7 @@ static void unlock_rdev(struct md_rdev *rdev)
 {
 	struct block_device *bdev = rdev->bdev;
 	rdev->bdev = NULL;
+	bdev->bd_holder_only_writes = 0;
 	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
 }
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088ffc05c..673b71bac731 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1816,10 +1816,14 @@ void blkdev_put(struct block_device *bdev, fmode_t mode)
 		WARN_ON_ONCE(--bdev->bd_contains->bd_holders < 0);
 
 		/* bd_contains might point to self, check in a separate step */
-		if ((bdev_free = !bdev->bd_holders))
+		if ((bdev_free = !bdev->bd_holders)) {
+			bdev->bd_holder_only_writes = 0;
 			bdev->bd_holder = NULL;
-		if (!bdev->bd_contains->bd_holders)
+		}
+		if (!bdev->bd_contains->bd_holders) {
+			bdev->bd_contains->bd_holder_only_writes = 0;
 			bdev->bd_contains->bd_holder = NULL;
+		}
 
 		spin_unlock(&bdev_lock);
 
@@ -1884,8 +1888,13 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	loff_t size = i_size_read(bd_inode);
 	struct blk_plug plug;
 	ssize_t ret;
+	struct block_device *bdev = I_BDEV(bd_inode);
 
-	if (bdev_read_only(I_BDEV(bd_inode)))
+	if (bdev_read_only(bdev))
+		return -EPERM;
+	if (bdev->bd_holder != NULL &&
+	    bdev->bd_holder_only_writes &&
+	    bdev->bd_holder != file)
 		return -EPERM;
 
 	if (!iov_iter_count(from))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e73742e73..79e3a2822867 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -424,6 +424,7 @@ struct block_device {
 	void *			bd_holder;
 	int			bd_holders;
 	bool			bd_write_holder;
+	bool			bd_holder_only_writes;
 #ifdef CONFIG_SYSFS
 	struct list_head	bd_holder_disks;
 #endif

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-04  0:49         ` NeilBrown
@ 2017-10-04 11:02           ` Artur Paszkiewicz
  2017-10-04 11:23             ` Artur Paszkiewicz
  2017-10-04 21:41             ` NeilBrown
  2017-10-04 17:28           ` Piergiorgio Sartor
  1 sibling, 2 replies; 22+ messages in thread
From: Artur Paszkiewicz @ 2017-10-04 11:02 UTC (permalink / raw)
  To: NeilBrown, Mikael Abrahamsson; +Cc: linux-raid

On 10/04/2017 02:49 AM, NeilBrown wrote:
> On Sun, Oct 01 2017, Mikael Abrahamsson wrote:
> 
>> On Mon, 18 Sep 2017, NeilBrown wrote:
>>
>>> Anyway, thanks for the example of a real problem related to this.  It 
>>> does make it easier to think about.
>>
>> Btw, if someone does --zero-superblock or dd /dev/zero to to a component 
>> device that is active, what happens when mdadm --stop /dev/mdX is run? 
>> Does it write out the complete superblock again?
> 
> --zero-superblock won't work on a device that is currently part of an
> array.  dd /dev/zero will.
> When the array is stopped the metadata will be written if the array is
> not read-only and is not clean.
> So for 'linear' and 'raid0' it is never written.  For others it probably
> is but may not be.
> I'm not sure that forcing a write makes sense.  A dd could corrupt lots
> of stuff, and just saving the metadata is not a big win.
> 
> I've been playing with some code, and this patch makes it impossible to
> write to a device which is in-use by md.
> Well... not exactly.  If a partition is in-use by md, the whole device
> can still be written to.  But the partition itself cannot.
> Also if metadata is managed by user-space, writes are still allowed.
> To fix that, we would need to capture each write request and validate
> the sector range.  Not impossible, but ugly.
> 
> Also, by itself, this patch breaks the use of raid6check on an active
> array.  We could fix that by enabling writes whenever a region is
> suspended.
> 
> Still... maybe it is a starting point for thinking about the problem.

Hi Neil,

I tested your patch and it works. One minor issue: ioctls are still
allowed, so we can do e.g. blkdiscard on a component device or something
like this:

# mdadm -C /dev/md0 -e1.0 -l1 -n2 /dev/sd[ab] --assume-clean -R
# parted /dev/md0 mklabel msdos
# parted /dev/md0 mkpart primary 1M 100%
# partprobe /dev/sda
# dd if=/dev/zero of=/dev/sda1 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.010408 s, 101 MB/s

Earlier I was thinking about something similar to not allow writes to
devices used by md, but the problem with external metadata updates
looked like a showstopper to me. To keep it clean (and not ugly), I
think md would have to expose some special interface for updating
metadata from userspace.

Also, our customers are asking specifically for an option to "hide" the
member drives. Making it impossible to write to the devices is ok, but
they would like to see only the md device, "like hardware raid". One of
the reasons is that some monitoring or inventory scan applications treat
md arrays and their components as separate storage devices. They should
probably modify their software but maybe that's not possible.

I've been experimenting with different solutions and here is a patch
that allows to "hide" disk devices and their partitions by writing to a
sysfs attribute. It removes the device nodes (on devtmpfs), links in
/sys/block/ and removes the devices from the block class list, so they
won't be included in places like /proc/partitions, /sys/class/block/,
lsblk and so on. The device's "real" sysfs directory under /sys/devices/
is still available, also the links in /sys/dev/block/ are preserved.
Applications like mdadm can use this to hide/unhide their component
devices.

Thanks,
Artur

diff --git a/block/genhd.c b/block/genhd.c
index 7f520fa25d16..58b8e3eb14af 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -685,6 +685,55 @@ void del_gendisk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(del_gendisk);
 
+static int hide_disk(struct gendisk *disk)
+{
+	struct device *ddev = disk_to_dev(disk);
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+	int ret;
+
+	ret = device_hide(ddev);
+	if (ret)
+		return ret;
+
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	while ((part = disk_part_iter_next(&piter)))
+		device_hide(part_to_dev(part));
+	disk_part_iter_exit(&piter);
+
+	if (!sysfs_deprecated)
+		sysfs_remove_link(block_depr, dev_name(ddev));
+
+	return ret;
+}
+
+static int unhide_disk(struct gendisk *disk)
+{
+	struct device *ddev = disk_to_dev(disk);
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+	int ret;
+
+	ret = device_unhide(ddev);
+	if (ret)
+		return ret;
+
+	disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY);
+	while ((part = disk_part_iter_next(&piter)))
+		device_unhide(part_to_dev(part));
+	disk_part_iter_exit(&piter);
+
+	if (!sysfs_deprecated) {
+		if (sysfs_create_link(block_depr, &ddev->kobj,
+				      kobject_name(&ddev->kobj)))
+			pr_warn("%s: failed to restore /sys/block link\n",
+				disk->disk_name);
+	}
+
+	return ret;
+}
+
 /* sysfs access to bad-blocks list. */
 static ssize_t disk_badblocks_show(struct device *dev,
 					struct device_attribute *attr,
@@ -1017,6 +1066,31 @@ static ssize_t disk_discard_alignment_show(struct device *dev,
 	return sprintf(buf, "%d\n", queue_discard_alignment(disk->queue));
 }
 
+static ssize_t disk_hidden_show(struct device *dev,
+				struct device_attribute *attr,
+				char *buf)
+{
+	return sprintf(buf, "%d\n", device_is_hidden(dev));
+}
+
+static ssize_t disk_hidden_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t len)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	bool hide;
+	int ret;
+
+	ret = kstrtobool(buf, &hide);
+	if (ret)
+		return ret;
+
+	if (hide != device_is_hidden(dev))
+		ret = hide ? hide_disk(disk) : unhide_disk(disk);
+
+	return ret ?: len;
+}
+
 static DEVICE_ATTR(range, S_IRUGO, disk_range_show, NULL);
 static DEVICE_ATTR(ext_range, S_IRUGO, disk_ext_range_show, NULL);
 static DEVICE_ATTR(removable, S_IRUGO, disk_removable_show, NULL);
@@ -1030,6 +1104,8 @@ static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
 static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show,
 		disk_badblocks_store);
+static DEVICE_ATTR(hidden, S_IRUGO | S_IWUSR, disk_hidden_show,
+		   disk_hidden_store);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
 	__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -1058,6 +1134,7 @@ static struct attribute *disk_attrs[] = {
 #ifdef CONFIG_FAIL_IO_TIMEOUT
 	&dev_attr_fail_timeout.attr,
 #endif
+	&dev_attr_hidden.attr,
 	NULL
 };
 
diff --git a/block/partition-generic.c b/block/partition-generic.c
index c5ec8246e25e..0223a37c7a8c 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -350,6 +350,9 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
 	if (err)
 		goto out_put;
 
+	if (device_is_hidden(ddev))
+		device_hide(pdev);
+
 	err = -ENOMEM;
 	p->holder_dir = kobject_create_and_add("holders", &pdev->kobj);
 	if (!p->holder_dir)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 755451f684bc..0804a45b8fbf 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1871,6 +1871,71 @@ void device_del(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(device_del);
 
+DEFINE_KLIST(klist_hidden_devices, NULL, NULL);
+
+bool device_is_hidden(struct device *dev)
+{
+	bool ret = false;
+
+	if (dev->class) {
+		mutex_lock(&dev->class->p->mutex);
+		ret = (dev->knode_class.n_klist == &klist_hidden_devices);
+		mutex_unlock(&dev->class->p->mutex);
+	}
+	return ret;
+}
+
+int device_hide(struct device *dev)
+{
+	char *envp[] = { "EVENT=hide", NULL };
+
+	if (!dev->class)
+		return -EINVAL;
+
+	if (MAJOR(dev->devt))
+		devtmpfs_delete_node(dev);
+
+	device_remove_class_symlinks(dev);
+
+	mutex_lock(&dev->class->p->mutex);
+	/* remove the device from the class list */
+	klist_remove(&dev->knode_class);
+	/* add to the hidden devices list */
+	klist_add_tail(&dev->knode_class, &klist_hidden_devices);
+	mutex_unlock(&dev->class->p->mutex);
+
+	kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp);
+
+	return 0;
+}
+
+int device_unhide(struct device *dev)
+{
+	char *envp[] = { "EVENT=unhide", NULL };
+	int err;
+
+	if (!dev->class)
+		return -EINVAL;
+
+	err = device_add_class_symlinks(dev);
+	if (err)
+		return err;
+
+	if (MAJOR(dev->devt))
+		devtmpfs_create_node(dev);
+
+	mutex_lock(&dev->class->p->mutex);
+	/* remove the device from the hidden devices list */
+	klist_remove(&dev->knode_class);
+	/* tie the class to the device */
+	klist_add_tail(&dev->knode_class, &dev->class->p->klist_devices);
+	mutex_unlock(&dev->class->p->mutex);
+
+	kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp);
+
+	return err;
+}
+
 /**
  * device_unregister - unregister device from system.
  * @dev: device going away.
diff --git a/include/linux/device.h b/include/linux/device.h
index beabdbc08420..90ab1e6b63c6 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1118,6 +1118,9 @@ extern void device_unregister(struct device *dev);
 extern void device_initialize(struct device *dev);
 extern int __must_check device_add(struct device *dev);
 extern void device_del(struct device *dev);
+extern bool device_is_hidden(struct device *dev);
+extern int device_hide(struct device *dev);
+extern int device_unhide(struct device *dev);
 extern int device_for_each_child(struct device *dev, void *data,
 		     int (*fn)(struct device *dev, void *data));
 extern int device_for_each_child_reverse(struct device *dev, void *data,


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-04 11:02           ` Artur Paszkiewicz
@ 2017-10-04 11:23             ` Artur Paszkiewicz
  2017-10-04 17:30               ` Piergiorgio Sartor
  2017-10-04 21:18               ` Phil Turmel
  2017-10-04 21:41             ` NeilBrown
  1 sibling, 2 replies; 22+ messages in thread
From: Artur Paszkiewicz @ 2017-10-04 11:23 UTC (permalink / raw)
  To: NeilBrown, Mikael Abrahamsson; +Cc: linux-raid

On 10/04/2017 01:02 PM, Artur Paszkiewicz wrote:
> Applications like mdadm can use this to hide/unhide their component
> devices.

And here is an example patch for mdadm. It adds options to manually hide
or unhide the component devices:

mdadm --hide-components /dev/md0
mdadm --unhide-components /dev/md0

And an option for mdadm.conf that automatically hides the array's member
disks when assembling and when new disks are added:

ARRAY /dev/md/0  metadata=1.2 UUID=c2c4f8c6:cd775924:9cb2cc62:88fa45bd
	name=linux-ns31:0 hide-components=yes

Hidden disks (by mdadm --hide-components or by config) should unhide
when the array is stopped or disks are removed. It only works for whole
devices, not partitions.

Thanks,
Artur

diff --git a/Assemble.c b/Assemble.c
index 3c10b6cd..98dfe20d 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -1029,6 +1029,12 @@ static int start_array(int mdfd,
 			       i/2, mddev);
 	}
 
+	if (ident->hide_components) {
+		if (Manage_hidden(mdfd, 1, NULL))
+			pr_err("Failed to hide component devices for %s\n",
+			       mddev);
+	}
+
 	if (content->array.level == LEVEL_CONTAINER) {
 		if (c->verbose >= 0) {
 			pr_err("Container %s has been assembled with %d drive%s",
@@ -1500,6 +1506,13 @@ try_again:
 	if (content != &info) {
 		/* This is a member of a container.  Try starting the array. */
 		int err;
+
+		if (ident->hide_components) {
+			pr_err("Ignoring 'hide_components' from config for %s.\n",
+			       mddev);
+			pr_err("This should be set for container, not subarray.\n");
+		}
+
 		err = assemble_container_content(st, mdfd, content, c,
 						 chosen_name, NULL);
 		close(mdfd);
diff --git a/Grow.c b/Grow.c
index 1149753d..ae3ea512 100644
--- a/Grow.c
+++ b/Grow.c
@@ -637,7 +637,7 @@ int Grow_consistency_policy(char *devname, int fd, struct context *c, struct sha
 			int dfd;
 			char *devpath;
 
-			devpath = map_dev(sd->disk.major, sd->disk.minor, 0);
+			devpath = map_dev(sd->disk.major, sd->disk.minor, 1);
 			dfd = dev_open(devpath, O_RDWR);
 			if (dfd < 0) {
 				pr_err("Failed to open %s\n", devpath);
@@ -2461,7 +2461,7 @@ static int set_new_data_offset(struct mdinfo *sra, struct supertype *st,
 
 		if (sd->disk.state & (1<<MD_DISK_FAULTY))
 			continue;
-		dn = map_dev(sd->disk.major, sd->disk.minor, 0);
+		dn = map_dev(sd->disk.major, sd->disk.minor, 1);
 		dfd = dev_open(dn, O_RDONLY);
 		if (dfd < 0) {
 			pr_err("%s: cannot open component %s\n",
@@ -2520,6 +2520,9 @@ static int set_new_data_offset(struct mdinfo *sra, struct supertype *st,
 		char *dn = map_dev(sd->disk.major, sd->disk.minor, 0);
 		unsigned long long new_data_offset;
 
+		if (!dn)
+			dn = sd->sys_name;
+
 		if (sd->disk.state & (1<<MD_DISK_FAULTY))
 			continue;
 		if (delta_disks < 0) {
@@ -2759,7 +2762,7 @@ static void get_space_after(int fd, struct supertype *st, struct mdinfo *info)
 
 		if (sd->disk.state & (1<<MD_DISK_FAULTY))
 			continue;
-		dn = map_dev(sd->disk.major, sd->disk.minor, 0);
+		dn = map_dev(sd->disk.major, sd->disk.minor, 1);
 		dfd = dev_open(dn, O_RDONLY);
 		if (dfd < 0)
 			break;
diff --git a/Incremental.c b/Incremental.c
index 91301eb5..5ce1cf9f 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -534,6 +534,11 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 		} else if (c->verbose >= 0)
 			pr_err("%s attached to %s which is already active.\n",
 			       devname, chosen_name);
+		if (match && match->hide_components) {
+			if (Manage_hidden(mdfd, 1, &rdev))
+				pr_err("Failed to hide %s for %s\n",
+				       devname, chosen_name);
+		}
 		rv = 0;
 		goto out_unlock;
 	}
@@ -604,6 +609,12 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 					pr_err("%s re-added to %s\n",
 					       dsk->sys_name, chosen_name);
 			}
+
+			if (match && match->hide_components) {
+				if (Manage_hidden(mdfd, 1, NULL))
+					pr_err("Failed to hide component devices for %s\n",
+					       chosen_name);
+			}
 		} else {
 			pr_err("%s attached to %s, but failed to start: %s.\n",
 			       devname, chosen_name, strerror(errno));
@@ -1401,6 +1412,11 @@ restart:
 				if (c->verbose >= 0)
 					pr_err("started array %s\n",
 					       me->path ?: me->devnm);
+				if (mddev && mddev->hide_components) {
+					if (Manage_hidden(mdfd, 1, NULL))
+						pr_err("Failed to hide component devices for %s\n",
+						       me->path ?: me->devnm);
+				}
 			} else {
 				pr_err("failed to start array %s: %s\n",
 				       me->path ?: me->devnm,
@@ -1450,7 +1466,7 @@ static int Incremental_container(struct supertype *st, char *devname,
 	struct map_ent *map = NULL;
 	struct mdinfo info;
 	int trustworthy;
-	struct mddev_ident *match;
+	struct mddev_ident *match, *ident;
 	int rv = 0;
 	struct domainlist *domains;
 	struct map_ent *smp;
@@ -1477,6 +1493,8 @@ static int Incremental_container(struct supertype *st, char *devname,
 	if (match == NULL && rv == 2)
 		return rv;
 
+	ident = match;
+
 	/* Need to compute 'trustworthy' */
 	if (match)
 		trustworthy = LOCAL;
@@ -1608,6 +1626,17 @@ static int Incremental_container(struct supertype *st, char *devname,
 		printf("\n");
 	}
 
+	if (ident && ident->hide_components) {
+		int mdfd = open(devname, O_RDONLY);
+
+		if (mdfd >= 0) {
+			if (Manage_hidden(mdfd, 1, NULL))
+				pr_err("Failed to hide component devices for %s\n",
+				       devname);
+			close(mdfd);
+		}
+	}
+
 	/* don't move spares to container with volume being activated
 	   when all volumes are blocked */
 	if (ra_all == ra_blocked)
diff --git a/Manage.c b/Manage.c
index 21536f5e..7aa32fa9 100644
--- a/Manage.c
+++ b/Manage.c
@@ -224,6 +224,8 @@ int Manage_stop(char *devname, int fd, int verbose, int will_retry)
 			       devname);
 		return 1;
 	}
+	if (container[0] == 0)
+		Manage_hidden(fd, 0, NULL);
 	/* If this is an mdmon managed array, just write 'inactive'
 	 * to the array state and let mdmon clear up.
 	 */
@@ -711,6 +713,34 @@ skip_re_add:
 	return 0;
 }
 
+int Manage_hidden(int fd, int hide, dev_t *rdev)
+{
+	struct mdinfo *sra, *dv;
+	int ret = 0;
+
+	sra = sysfs_read(fd, NULL, GET_DEVS);
+	if (!sra || !sra->devs)
+		return 1;
+
+	for (dv = sra->devs; dv; dv = dv->next) {
+		if ((dv->hidden == hide) ||
+		    (rdev && *rdev != makedev(dv->disk.major, dv->disk.minor)))
+			continue;
+
+		if (!sysfs_attribute_available(sra, dv, "block/hidden")) {
+			ret = 1;
+			break;
+		}
+
+		ret = sysfs_set_num(sra, dv, "block/hidden", hide);
+		if (ret)
+			break;
+	}
+
+	sysfs_free(sra);
+	return ret;
+}
+
 int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 	       struct supertype *tst, mdu_array_info_t *array,
 	       int force, int verbose, char *devname,
@@ -721,6 +751,8 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 	struct supertype *dev_st;
 	int j;
 	mdu_disk_info_t disc;
+	struct mddev_ident *match;
+	struct mdinfo mdi;
 
 	if (!get_dev_size(tfd, dv->devname, &ldsize)) {
 		if (dv->disposition == 'M')
@@ -909,6 +941,8 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 		disc.number = raid_slot;
 	disc.state = 0;
 
+	tst->ss->getinfo_super(tst, &mdi, NULL);
+
 	/* only add journal to array that supports journaling */
 	if (dv->disposition == 'j') {
 		struct mdinfo *mdp;
@@ -1065,6 +1099,13 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 	}
 	if (verbose >= 0)
 		pr_err("added %s\n", dv->devname);
+
+	match = conf_match(tst, &mdi, devname, 0, NULL);
+	if (match && match->hide_components) {
+		if (Manage_hidden(fd, 1, &rdev))
+			pr_err("Failed to hide %s for %s", dv->devname, devname);
+	}
+
 	return 1;
 }
 
@@ -1138,6 +1179,8 @@ int Manage_remove(struct supertype *tst, int fd, struct mddev_dev *dv,
 		 */
 		err = sys_hot_remove_disk(sysfd, force);
 	} else {
+		Manage_hidden(fd, 0, &rdev);
+
 		err = hot_remove_disk(fd, rdev, force);
 		if (err && errno == ENODEV) {
 			/* Old kernels rejected this if no personality
diff --git a/ReadMe.c b/ReadMe.c
index 4d871e9d..5ce211dc 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -109,6 +109,9 @@ struct option long_options[] = {
     {"dump", 1, 0, Dump},
     {"restore", 1, 0, Restore},
 
+    {"hide-components", 0, 0, HideComponents},
+    {"unhide-components", 0, 0, UnhideComponents},
+
     /* synonyms */
     {"monitor",   0, 0, 'F'},
 
diff --git a/config.c b/config.c
index 48e02788..f37aeb8a 100644
--- a/config.c
+++ b/config.c
@@ -380,6 +380,7 @@ void arrayline(char *line)
 	mis.name[0] = 0;
 	mis.container = NULL;
 	mis.member = NULL;
+	mis.hide_components = 0;
 
 	for (w = dl_next(line); w != line; w = dl_next(w)) {
 		if (w[0] == '/' || strchr(w, '=') == NULL) {
@@ -493,6 +494,10 @@ void arrayline(char *line)
 			/* The container holding this subarray.
 			 * Either a device name or a uuid */
 			mis.container = xstrdup(w + 10);
+		} else if (strncasecmp(w, "hide-components=yes", 19) == 0) {
+			mis.hide_components = 1;
+		} else if (strncasecmp(w, "hide-components=no", 18) == 0) {
+			mis.hide_components = 0;
 		} else {
 			pr_err("unrecognised word on ARRAY line: %s\n",
 				w);
diff --git a/mdadm.c b/mdadm.c
index 7cdcdba7..2d27019c 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -119,6 +119,7 @@ int main(int argc, char *argv[])
 	ident.name[0] = 0;
 	ident.container = NULL;
 	ident.member = NULL;
+	ident.hide_components = 0;
 
 	if (get_linux_version() < 2006015) {
 		pr_err("This version of mdadm does not support kernels older than 2.6.15\n");
@@ -241,6 +242,8 @@ int main(int argc, char *argv[])
 		case Dump:
 		case Restore:
 		case Action:
+		case HideComponents:
+		case UnhideComponents:
 			newmode = MISC;
 			break;
 
@@ -1026,6 +1029,8 @@ int main(int argc, char *argv[])
 		case O(MISC, Dump):
 		case O(MISC, Restore):
 		case O(MISC ,Action):
+		case O(MISC ,HideComponents):
+		case O(MISC ,UnhideComponents):
 			if (opt == KillSubarray || opt == UpdateSubarray) {
 				if (c.subarray) {
 					pr_err("subarray can only be specified once\n");
@@ -1995,6 +2000,11 @@ static int misc_list(struct mddev_dev *devlist,
 			case 'w':
 				rv |= Manage_ro(dv->devname, mdfd, -1);
 				break;
+			case HideComponents:
+			case UnhideComponents:
+				rv |= Manage_hidden(mdfd,
+					dv->disposition == HideComponents,
+					NULL);
 			}
 			close(mdfd);
 		} else
diff --git a/mdadm.h b/mdadm.h
index 85947bf6..7d956a64 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -348,6 +348,7 @@ struct mdinfo {
 		ARRAY_UNKNOWN_STATE,
 	} array_state;
 	struct md_bb bb;
+	int hidden;
 };
 
 struct createinfo {
@@ -454,6 +455,8 @@ enum special_options {
 	ClusterConfirm,
 	WriteJournal,
 	ConsistencyPolicy,
+	HideComponents,
+	UnhideComponents,
 };
 
 enum prefix_standard {
@@ -510,6 +513,8 @@ struct mddev_ident {
 				 */
 	char	*member;	/* subarray within a container */
 
+	int hide_components;
+
 	struct mddev_ident *next;
 	union {
 		/* fields needed by different users of this structure */
@@ -1338,6 +1343,7 @@ extern int Manage_stop(char *devname, int fd, int quiet,
 extern int Manage_subdevs(char *devname, int fd,
 			  struct mddev_dev *devlist, int verbose, int test,
 			  char *update, int force);
+extern int Manage_hidden(int fd, int hide, dev_t *rdev);
 extern int autodetect(void);
 extern int Grow_Add_device(char *devname, int fd, char *newdev);
 extern int Grow_addbitmap(char *devname, int fd,
diff --git a/super-intel.c b/super-intel.c
index 536cb613..d57d22c4 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -3847,6 +3847,7 @@ static void fd2devname(int fd, char *name)
 static int nvme_get_serial(int fd, void *buf, size_t buf_len)
 {
 	char path[60];
+	struct stat st;
 	char *name = fd2kname(fd);
 
 	if (!name)
@@ -3855,7 +3856,15 @@ static int nvme_get_serial(int fd, void *buf, size_t buf_len)
 	if (strncmp(name, "nvme", 4) != 0)
 		return 1;
 
-	snprintf(path, sizeof(path) - 1, "/sys/block/%s/device/serial", name);
+	snprintf(path, sizeof(path), "/sys/block/%s/device/serial", name);
+
+	if (stat(path, &st) != 0) {
+		if (fstat(fd, &st) != 0)
+			return 1;
+
+		snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/../serial",
+			 major(st.st_rdev), minor(st.st_rdev));
+	}
 
 	return load_sys(path, buf, buf_len);
 }
diff --git a/sysfs.c b/sysfs.c
index bf5c8c5d..f079d51b 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -326,6 +326,11 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options)
 			continue;
 		}
 
+		strcpy(dbase, "block/hidden");
+		if (load_sys(fname, buf, sizeof(buf)) == 0 &&
+		    strtoul(buf, NULL, 0) == 1)
+			dev->hidden = 1;
+
 		/* finally add this disk to the array */
 		*devp = dev;
 		devp = & dev->next;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-04 11:23             ` Artur Paszkiewicz
@ 2017-10-04 17:30               ` Piergiorgio Sartor
  2017-10-04 18:03                 ` John Stoffel
  2017-10-04 21:18               ` Phil Turmel
  1 sibling, 1 reply; 22+ messages in thread
From: Piergiorgio Sartor @ 2017-10-04 17:30 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: NeilBrown, Mikael Abrahamsson, linux-raid

On Wed, Oct 04, 2017 at 01:23:02PM +0200, Artur Paszkiewicz wrote:
> On 10/04/2017 01:02 PM, Artur Paszkiewicz wrote:
> > Applications like mdadm can use this to hide/unhide their component
> > devices.
> 
> And here is an example patch for mdadm. It adds options to manually hide
> or unhide the component devices:
> 
> mdadm --hide-components /dev/md0
> mdadm --unhide-components /dev/md0

This seems to me already quite sensible.

> 
> And an option for mdadm.conf that automatically hides the array's member
> disks when assembling and when new disks are added:
> 
> ARRAY /dev/md/0  metadata=1.2 UUID=c2c4f8c6:cd775924:9cb2cc62:88fa45bd
> 	name=linux-ns31:0 hide-components=yes
> 
> Hidden disks (by mdadm --hide-components or by config) should unhide
> when the array is stopped or disks are removed. It only works for whole
> devices, not partitions.

Well, this, on the other hands, it makes
it not really useful.

Furthermore, how about "cat /proc/mdstat"?
Will this show what? In case of hidden components.

Thanks,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-04 17:30               ` Piergiorgio Sartor
@ 2017-10-04 18:03                 ` John Stoffel
  0 siblings, 0 replies; 22+ messages in thread
From: John Stoffel @ 2017-10-04 18:03 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Artur Paszkiewicz, NeilBrown, Mikael Abrahamsson, linux-raid

>>>>> "Piergiorgio" == Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> writes:

Piergiorgio> On Wed, Oct 04, 2017 at 01:23:02PM +0200, Artur Paszkiewicz wrote:
>> On 10/04/2017 01:02 PM, Artur Paszkiewicz wrote:
>> > Applications like mdadm can use this to hide/unhide their component
>> > devices.
>> 
>> And here is an example patch for mdadm. It adds options to manually hide
>> or unhide the component devices:
>> 
>> mdadm --hide-components /dev/md0
>> mdadm --unhide-components /dev/md0

Piergiorgio> This seems to me already quite sensible.

>> 
>> And an option for mdadm.conf that automatically hides the array's member
>> disks when assembling and when new disks are added:
>> 
>> ARRAY /dev/md/0  metadata=1.2 UUID=c2c4f8c6:cd775924:9cb2cc62:88fa45bd
>> name=linux-ns31:0 hide-components=yes
>> 
>> Hidden disks (by mdadm --hide-components or by config) should unhide
>> when the array is stopped or disks are removed. It only works for whole
>> devices, not partitions.

Piergiorgio> Well, this, on the other hands, it makes it not really
Piergiorgio> useful.

Yeah, this totally doesn't work or scale.  If you're using partitions
(why would you not?) as raid devices, even on a whole disk, you now
limit yourself.  


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-04 11:23             ` Artur Paszkiewicz
  2017-10-04 17:30               ` Piergiorgio Sartor
@ 2017-10-04 21:18               ` Phil Turmel
  1 sibling, 0 replies; 22+ messages in thread
From: Phil Turmel @ 2017-10-04 21:18 UTC (permalink / raw)
  To: Artur Paszkiewicz, NeilBrown, Mikael Abrahamsson; +Cc: linux-raid

On 10/04/2017 07:23 AM, Artur Paszkiewicz wrote:
> On 10/04/2017 01:02 PM, Artur Paszkiewicz wrote:
>> Applications like mdadm can use this to hide/unhide their component
>> devices.
> 
> And here is an example patch for mdadm. It adds options to manually hide
> or unhide the component devices:
> 
> mdadm --hide-components /dev/md0
> mdadm --unhide-components /dev/md0
> 
> And an option for mdadm.conf that automatically hides the array's member
> disks when assembling and when new disks are added:
> 
> ARRAY /dev/md/0  metadata=1.2 UUID=c2c4f8c6:cd775924:9cb2cc62:88fa45bd
> 	name=linux-ns31:0 hide-components=yes
> 
> Hidden disks (by mdadm --hide-components or by config) should unhide
> when the array is stopped or disks are removed. It only works for whole
> devices, not partitions.

I am not at all in favor of these patches, as they break all kinds of
diagnostic tooling and certainly violate the principle of least
surprise.  Tools like "lsblk --tree" and my own "lsdrv" [1].

Please keep these out of the kernel, or better yet, explain to your
customers that visibility into the components is a key *advantage*
of software raid.

Regards,

Phil Turmel

[1] https://github.com/pturmel/lsdrv

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-04 11:02           ` Artur Paszkiewicz
  2017-10-04 11:23             ` Artur Paszkiewicz
@ 2017-10-04 21:41             ` NeilBrown
  2017-10-05 18:52               ` Artur Paszkiewicz
  1 sibling, 1 reply; 22+ messages in thread
From: NeilBrown @ 2017-10-04 21:41 UTC (permalink / raw)
  To: Artur Paszkiewicz, Mikael Abrahamsson; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 8435 bytes --]

On Wed, Oct 04 2017, Artur Paszkiewicz wrote:

>
> Also, our customers are asking specifically for an option to "hide" the
> member drives. Making it impossible to write to the devices is ok, but
> they would like to see only the md device, "like hardware raid". One of
> the reasons is that some monitoring or inventory scan applications treat
> md arrays and their components as separate storage devices. They should
> probably modify their software but maybe that's not possible.

What exactly is meant by "hide".  How, exactly, does this software scan
all devices?  /proc/partitions? /sys/block? /sys/class/block?
/dev/disks?  All of the above.

Modifying their application *must* be on the table, else modify the
kernel is *not* on the table.  I'm certainly open to enhancing the
kernel so that it is easy to skip a particular class of devices, but if
their application chooses to ignore the information the kernel provides,
then the fact that the application doesn't work is their problem, not
ours.


>
> I've been experimenting with different solutions and here is a patch
> that allows to "hide" disk devices and their partitions by writing to a
> sysfs attribute. It removes the device nodes (on devtmpfs), links in
> /sys/block/ and removes the devices from the block class list, so they
> won't be included in places like /proc/partitions, /sys/class/block/,
> lsblk and so on. The device's "real" sysfs directory under /sys/devices/
> is still available, also the links in /sys/dev/block/ are preserved.
> Applications like mdadm can use this to hide/unhide their component
> devices.

Can you confirm that this addresses the customer problem?  Do you know
which of these lists their code looks at?

Thanks,
NeilBrown


>
> Thanks,
> Artur
>
> diff --git a/block/genhd.c b/block/genhd.c
> index 7f520fa25d16..58b8e3eb14af 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -685,6 +685,55 @@ void del_gendisk(struct gendisk *disk)
>  }
>  EXPORT_SYMBOL(del_gendisk);
>  
> +static int hide_disk(struct gendisk *disk)
> +{
> +	struct device *ddev = disk_to_dev(disk);
> +	struct disk_part_iter piter;
> +	struct hd_struct *part;
> +	int ret;
> +
> +	ret = device_hide(ddev);
> +	if (ret)
> +		return ret;
> +
> +	disk_part_iter_init(&piter, disk,
> +			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
> +	while ((part = disk_part_iter_next(&piter)))
> +		device_hide(part_to_dev(part));
> +	disk_part_iter_exit(&piter);
> +
> +	if (!sysfs_deprecated)
> +		sysfs_remove_link(block_depr, dev_name(ddev));
> +
> +	return ret;
> +}
> +
> +static int unhide_disk(struct gendisk *disk)
> +{
> +	struct device *ddev = disk_to_dev(disk);
> +	struct disk_part_iter piter;
> +	struct hd_struct *part;
> +	int ret;
> +
> +	ret = device_unhide(ddev);
> +	if (ret)
> +		return ret;
> +
> +	disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY);
> +	while ((part = disk_part_iter_next(&piter)))
> +		device_unhide(part_to_dev(part));
> +	disk_part_iter_exit(&piter);
> +
> +	if (!sysfs_deprecated) {
> +		if (sysfs_create_link(block_depr, &ddev->kobj,
> +				      kobject_name(&ddev->kobj)))
> +			pr_warn("%s: failed to restore /sys/block link\n",
> +				disk->disk_name);
> +	}
> +
> +	return ret;
> +}
> +
>  /* sysfs access to bad-blocks list. */
>  static ssize_t disk_badblocks_show(struct device *dev,
>  					struct device_attribute *attr,
> @@ -1017,6 +1066,31 @@ static ssize_t disk_discard_alignment_show(struct device *dev,
>  	return sprintf(buf, "%d\n", queue_discard_alignment(disk->queue));
>  }
>  
> +static ssize_t disk_hidden_show(struct device *dev,
> +				struct device_attribute *attr,
> +				char *buf)
> +{
> +	return sprintf(buf, "%d\n", device_is_hidden(dev));
> +}
> +
> +static ssize_t disk_hidden_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t len)
> +{
> +	struct gendisk *disk = dev_to_disk(dev);
> +	bool hide;
> +	int ret;
> +
> +	ret = kstrtobool(buf, &hide);
> +	if (ret)
> +		return ret;
> +
> +	if (hide != device_is_hidden(dev))
> +		ret = hide ? hide_disk(disk) : unhide_disk(disk);
> +
> +	return ret ?: len;
> +}
> +
>  static DEVICE_ATTR(range, S_IRUGO, disk_range_show, NULL);
>  static DEVICE_ATTR(ext_range, S_IRUGO, disk_ext_range_show, NULL);
>  static DEVICE_ATTR(removable, S_IRUGO, disk_removable_show, NULL);
> @@ -1030,6 +1104,8 @@ static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
>  static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
>  static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show,
>  		disk_badblocks_store);
> +static DEVICE_ATTR(hidden, S_IRUGO | S_IWUSR, disk_hidden_show,
> +		   disk_hidden_store);
>  #ifdef CONFIG_FAIL_MAKE_REQUEST
>  static struct device_attribute dev_attr_fail =
>  	__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
> @@ -1058,6 +1134,7 @@ static struct attribute *disk_attrs[] = {
>  #ifdef CONFIG_FAIL_IO_TIMEOUT
>  	&dev_attr_fail_timeout.attr,
>  #endif
> +	&dev_attr_hidden.attr,
>  	NULL
>  };
>  
> diff --git a/block/partition-generic.c b/block/partition-generic.c
> index c5ec8246e25e..0223a37c7a8c 100644
> --- a/block/partition-generic.c
> +++ b/block/partition-generic.c
> @@ -350,6 +350,9 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
>  	if (err)
>  		goto out_put;
>  
> +	if (device_is_hidden(ddev))
> +		device_hide(pdev);
> +
>  	err = -ENOMEM;
>  	p->holder_dir = kobject_create_and_add("holders", &pdev->kobj);
>  	if (!p->holder_dir)
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 755451f684bc..0804a45b8fbf 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -1871,6 +1871,71 @@ void device_del(struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(device_del);
>  
> +DEFINE_KLIST(klist_hidden_devices, NULL, NULL);
> +
> +bool device_is_hidden(struct device *dev)
> +{
> +	bool ret = false;
> +
> +	if (dev->class) {
> +		mutex_lock(&dev->class->p->mutex);
> +		ret = (dev->knode_class.n_klist == &klist_hidden_devices);
> +		mutex_unlock(&dev->class->p->mutex);
> +	}
> +	return ret;
> +}
> +
> +int device_hide(struct device *dev)
> +{
> +	char *envp[] = { "EVENT=hide", NULL };
> +
> +	if (!dev->class)
> +		return -EINVAL;
> +
> +	if (MAJOR(dev->devt))
> +		devtmpfs_delete_node(dev);
> +
> +	device_remove_class_symlinks(dev);
> +
> +	mutex_lock(&dev->class->p->mutex);
> +	/* remove the device from the class list */
> +	klist_remove(&dev->knode_class);
> +	/* add to the hidden devices list */
> +	klist_add_tail(&dev->knode_class, &klist_hidden_devices);
> +	mutex_unlock(&dev->class->p->mutex);
> +
> +	kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp);
> +
> +	return 0;
> +}
> +
> +int device_unhide(struct device *dev)
> +{
> +	char *envp[] = { "EVENT=unhide", NULL };
> +	int err;
> +
> +	if (!dev->class)
> +		return -EINVAL;
> +
> +	err = device_add_class_symlinks(dev);
> +	if (err)
> +		return err;
> +
> +	if (MAJOR(dev->devt))
> +		devtmpfs_create_node(dev);
> +
> +	mutex_lock(&dev->class->p->mutex);
> +	/* remove the device from the hidden devices list */
> +	klist_remove(&dev->knode_class);
> +	/* tie the class to the device */
> +	klist_add_tail(&dev->knode_class, &dev->class->p->klist_devices);
> +	mutex_unlock(&dev->class->p->mutex);
> +
> +	kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp);
> +
> +	return err;
> +}
> +
>  /**
>   * device_unregister - unregister device from system.
>   * @dev: device going away.
> diff --git a/include/linux/device.h b/include/linux/device.h
> index beabdbc08420..90ab1e6b63c6 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -1118,6 +1118,9 @@ extern void device_unregister(struct device *dev);
>  extern void device_initialize(struct device *dev);
>  extern int __must_check device_add(struct device *dev);
>  extern void device_del(struct device *dev);
> +extern bool device_is_hidden(struct device *dev);
> +extern int device_hide(struct device *dev);
> +extern int device_unhide(struct device *dev);
>  extern int device_for_each_child(struct device *dev, void *data,
>  		     int (*fn)(struct device *dev, void *data));
>  extern int device_for_each_child_reverse(struct device *dev, void *data,

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-04 21:41             ` NeilBrown
@ 2017-10-05 18:52               ` Artur Paszkiewicz
  2017-10-05 23:39                 ` NeilBrown
  0 siblings, 1 reply; 22+ messages in thread
From: Artur Paszkiewicz @ 2017-10-05 18:52 UTC (permalink / raw)
  To: NeilBrown, Mikael Abrahamsson; +Cc: linux-raid

On 10/04/2017 11:41 PM, NeilBrown wrote:
> On Wed, Oct 04 2017, Artur Paszkiewicz wrote:
> 
>>
>> Also, our customers are asking specifically for an option to "hide" the
>> member drives. Making it impossible to write to the devices is ok, but
>> they would like to see only the md device, "like hardware raid". One of
>> the reasons is that some monitoring or inventory scan applications treat
>> md arrays and their components as separate storage devices. They should
>> probably modify their software but maybe that's not possible.
> 
> What exactly is meant by "hide".  How, exactly, does this software scan
> all devices?  /proc/partitions? /sys/block? /sys/class/block?
> /dev/disks?  All of the above.
> 
> Modifying their application *must* be on the table, else modify the
> kernel is *not* on the table.  I'm certainly open to enhancing the
> kernel so that it is easy to skip a particular class of devices, but if
> their application chooses to ignore the information the kernel provides,
> then the fact that the application doesn't work is their problem, not
> ours.
> 
> 
>>
>> I've been experimenting with different solutions and here is a patch
>> that allows to "hide" disk devices and their partitions by writing to a
>> sysfs attribute. It removes the device nodes (on devtmpfs), links in
>> /sys/block/ and removes the devices from the block class list, so they
>> won't be included in places like /proc/partitions, /sys/class/block/,
>> lsblk and so on. The device's "real" sysfs directory under /sys/devices/
>> is still available, also the links in /sys/dev/block/ are preserved.
>> Applications like mdadm can use this to hide/unhide their component
>> devices.
> 
> Can you confirm that this addresses the customer problem?  Do you know
> which of these lists their code looks at?

Yes, this is what they asked for. I know that at least /proc/partitions
and /sys/class/block are used. But this is not a single customer (or
application) case, this keeps coming up again and again... Of course
educating users about the specifics of md and that not hiding the drives
is actually an advantage always comes first. Some get it and would be ok
with just blocking write access to the drives, but others really want
this hiding approach.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-05 18:52               ` Artur Paszkiewicz
@ 2017-10-05 23:39                 ` NeilBrown
  2017-10-06  7:13                   ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2017-10-05 23:39 UTC (permalink / raw)
  To: Artur Paszkiewicz, Mikael Abrahamsson; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3431 bytes --]

On Thu, Oct 05 2017, Artur Paszkiewicz wrote:

> On 10/04/2017 11:41 PM, NeilBrown wrote:
>> On Wed, Oct 04 2017, Artur Paszkiewicz wrote:
>> 
>>>
>>> Also, our customers are asking specifically for an option to "hide" the
>>> member drives. Making it impossible to write to the devices is ok, but
>>> they would like to see only the md device, "like hardware raid". One of
>>> the reasons is that some monitoring or inventory scan applications treat
>>> md arrays and their components as separate storage devices. They should
>>> probably modify their software but maybe that's not possible.
>> 
>> What exactly is meant by "hide".  How, exactly, does this software scan
>> all devices?  /proc/partitions? /sys/block? /sys/class/block?
>> /dev/disks?  All of the above.
>> 
>> Modifying their application *must* be on the table, else modify the
>> kernel is *not* on the table.  I'm certainly open to enhancing the
>> kernel so that it is easy to skip a particular class of devices, but if
>> their application chooses to ignore the information the kernel provides,
>> then the fact that the application doesn't work is their problem, not
>> ours.
>> 
>> 
>>>
>>> I've been experimenting with different solutions and here is a patch
>>> that allows to "hide" disk devices and their partitions by writing to a
>>> sysfs attribute. It removes the device nodes (on devtmpfs), links in
>>> /sys/block/ and removes the devices from the block class list, so they
>>> won't be included in places like /proc/partitions, /sys/class/block/,
>>> lsblk and so on. The device's "real" sysfs directory under /sys/devices/
>>> is still available, also the links in /sys/dev/block/ are preserved.
>>> Applications like mdadm can use this to hide/unhide their component
>>> devices.
>> 
>> Can you confirm that this addresses the customer problem?  Do you know
>> which of these lists their code looks at?
>
> Yes, this is what they asked for. I know that at least /proc/partitions
> and /sys/class/block are used. But this is not a single customer (or
> application) case, this keeps coming up again and again... Of course
> educating users about the specifics of md and that not hiding the drives
> is actually an advantage always comes first. Some get it and would be ok
> with just blocking write access to the drives, but others really want
> this hiding approach.
>

Hmmm.. interesting that it is multiple applications.
Given that both md and dm have been around for years and have layered on
top of exiting devices without hiding them, it is hard to see how these
applications have not run into problem before.

There is some precedent for hiding things from /proc/partitions.
removable devices like CDROMs are hidden, and you can easily hide
individual devices by setting GENHD_FL_SUPPRESS_PARTITION_INFO.
We might be able to get that through.  It is certainly worth writing a
patch and letting people experiment with it.

Removing things from /sys/class/block is much less likely to be
acceptable.  Code really shouldn't be digging around in there without
knowing what it is doing.  It should be trivial to ignore any entry in
/sys/class/block where the 'holders' directory is not empty, and that
should achieve the goal.  For users of /sys/class/block I think you
really need to push back on the customers and tell them their
application is broken.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-05 23:39                 ` NeilBrown
@ 2017-10-06  7:13                   ` Christoph Hellwig
  2017-10-06  7:59                     ` Mikael Abrahamsson
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2017-10-06  7:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: Artur Paszkiewicz, Mikael Abrahamsson, linux-raid

On Fri, Oct 06, 2017 at 10:39:50AM +1100, NeilBrown wrote:
> Hmmm.. interesting that it is multiple applications.
> Given that both md and dm have been around for years and have layered on
> top of exiting devices without hiding them, it is hard to see how these
> applications have not run into problem before.
> 
> There is some precedent for hiding things from /proc/partitions.
> removable devices like CDROMs are hidden, and you can easily hide
> individual devices by setting GENHD_FL_SUPPRESS_PARTITION_INFO.
> We might be able to get that through.  It is certainly worth writing a
> patch and letting people experiment with it.

No way in hell this would get through.  Preemptive NAK right here.

That whole idea is just amazingly stupid, and no one has even explained
a reason for it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-06  7:13                   ` Christoph Hellwig
@ 2017-10-06  7:59                     ` Mikael Abrahamsson
  0 siblings, 0 replies; 22+ messages in thread
From: Mikael Abrahamsson @ 2017-10-06  7:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: NeilBrown, Artur Paszkiewicz, linux-raid

On Fri, 6 Oct 2017, Christoph Hellwig wrote:

> That whole idea is just amazingly stupid, and no one has even explained
> a reason for it.

It's all about making it easier for the people using the functionality we 
have and less risk of making a mistake. See my previous email about people 
I have seen that would have benefited from the active component drives 
being not writable (or hidden).

Yes, I know this goes against the old unix "don't try to stop the user 
from shooting themselves in the foot", but I do think this should be user 
configurable.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-04  0:49         ` NeilBrown
  2017-10-04 11:02           ` Artur Paszkiewicz
@ 2017-10-04 17:28           ` Piergiorgio Sartor
  2017-10-04 18:13             ` Anthony Youngman
  1 sibling, 1 reply; 22+ messages in thread
From: Piergiorgio Sartor @ 2017-10-04 17:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mikael Abrahamsson, linux-raid

On Wed, Oct 04, 2017 at 11:49:00AM +1100, NeilBrown wrote:
> On Sun, Oct 01 2017, Mikael Abrahamsson wrote:
> 
> > On Mon, 18 Sep 2017, NeilBrown wrote:
> >
> >> Anyway, thanks for the example of a real problem related to this.  It 
> >> does make it easier to think about.
> >
> > Btw, if someone does --zero-superblock or dd /dev/zero to to a component 
> > device that is active, what happens when mdadm --stop /dev/mdX is run? 
> > Does it write out the complete superblock again?
> 
> --zero-superblock won't work on a device that is currently part of an
> array.  dd /dev/zero will.
> When the array is stopped the metadata will be written if the array is
> not read-only and is not clean.
> So for 'linear' and 'raid0' it is never written.  For others it probably
> is but may not be.
> I'm not sure that forcing a write makes sense.  A dd could corrupt lots
> of stuff, and just saving the metadata is not a big win.
> 
> I've been playing with some code, and this patch makes it impossible to
> write to a device which is in-use by md.
> Well... not exactly.  If a partition is in-use by md, the whole device
> can still be written to.  But the partition itself cannot.
> Also if metadata is managed by user-space, writes are still allowed.
> To fix that, we would need to capture each write request and validate
> the sector range.  Not impossible, but ugly.
> 
> Also, by itself, this patch breaks the use of raid6check on an active
> array.  We could fix that by enabling writes whenever a region is
> suspended.

Maybe you all have to make up your mind on how
to handle md devices and components.

We had long discussions about "not having code
in kernel space", to avoid useless burden, and
use user space, instead.
Now, someone discovers that user space is very
dangerous and should be blocked.

So, what should we do? Add an interface to the
md devices in order to access the components?
Will this really be safe against clueless people
trying "dd" here and there?

I think, if someone destroys a RAID using "dd"
on the single components he/she deserves it.

I made similar mistakes, I would not blame md
for them.

And having "mdadm" protecting from things like
"--zero-superblock" is fine, correct and exactly
what is needed as safety net.

In order to conclude, please decide kernel vs.
user space approaches *before* making changes.

Thanks!

> Still... maybe it is a starting point for thinking about the problem.

Yes, you're right,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-10-04 17:28           ` Piergiorgio Sartor
@ 2017-10-04 18:13             ` Anthony Youngman
  0 siblings, 0 replies; 22+ messages in thread
From: Anthony Youngman @ 2017-10-04 18:13 UTC (permalink / raw)
  To: Piergiorgio Sartor, NeilBrown; +Cc: Mikael Abrahamsson, linux-raid

On 04/10/17 18:28, Piergiorgio Sartor wrote:
> In order to conclude, please decide kernel vs.
> user space approaches*before*  making changes.
> 
> Thanks!
> 
>> Still... maybe it is a starting point for thinking about the problem.
> Yes, you're right,

Throwing a few ideas into the mix - shoot them down if they seem too 
outrageous :-)

1) Can we apply xattrs to devices in /dev?

2) Does xattrs lock out root?

Okay, this won't necessarily hide component devices, but if assembling 
an array (optionally) applies xattrs to them, and makes them writeable 
*only* by user "mdadm", then that will help prevent a load of damage. Of 
course, that doesn't protect a partition against be overwritten by a 
write to the underlying drive, but that's a whole 'nother can of worms...

And of course, the question is will this lock out the operations we 
don't want, while allowing stuff that we do ...

Provided xattrs does lock out root from user space, this looks to me to 
be the obvious route to try to go down, but what do I know?

Cheers,
Wol

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux Plumbers MD BOF discussion notes
  2017-09-18  7:04   ` Mikael Abrahamsson
  2017-09-18  8:56     ` NeilBrown
@ 2017-09-18 13:57     ` Wols Lists
  1 sibling, 0 replies; 22+ messages in thread
From: Wols Lists @ 2017-09-18 13:57 UTC (permalink / raw)
  To: Mikael Abrahamsson, NeilBrown; +Cc: linux-raid

On 18/09/17 08:04, Mikael Abrahamsson wrote:
> If we could disable writing to the drives/partitions from regular
> userspace when they're handled by md, that could be some kind of middle
> ground. I guess most tools don't use O_EXCL.

It would make sense to set permissions to read-only, but of course on
Unix root just ignores permissions :-(

Oh for Pr1mos where the only right that couldn't be taken away from the
super-user was the right to set rights :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-10-06  7:59 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-15 14:27 Linux Plumbers MD BOF discussion notes Shaohua Li
2017-09-15 20:42 ` Coly Li
2017-09-15 21:20   ` Shaohua Li
2017-09-16  0:08 ` NeilBrown
2017-09-18  4:54   ` Shaohua Li
2017-09-18  7:04   ` Mikael Abrahamsson
2017-09-18  8:56     ` NeilBrown
2017-10-01  5:32       ` Mikael Abrahamsson
2017-10-04  0:49         ` NeilBrown
2017-10-04 11:02           ` Artur Paszkiewicz
2017-10-04 11:23             ` Artur Paszkiewicz
2017-10-04 17:30               ` Piergiorgio Sartor
2017-10-04 18:03                 ` John Stoffel
2017-10-04 21:18               ` Phil Turmel
2017-10-04 21:41             ` NeilBrown
2017-10-05 18:52               ` Artur Paszkiewicz
2017-10-05 23:39                 ` NeilBrown
2017-10-06  7:13                   ` Christoph Hellwig
2017-10-06  7:59                     ` Mikael Abrahamsson
2017-10-04 17:28           ` Piergiorgio Sartor
2017-10-04 18:13             ` Anthony Youngman
2017-09-18 13:57     ` Wols Lists

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).