* Linux Plumbers MD BOF discussion notes @ 2017-09-15 14:27 Shaohua Li 2017-09-15 20:42 ` Coly Li 2017-09-16 0:08 ` NeilBrown 0 siblings, 2 replies; 22+ messages in thread From: Shaohua Li @ 2017-09-15 14:27 UTC (permalink / raw) To: linux-raid This is a short note based on Song's record. Please reply to the list if anything is missing. *IMSM - PPL Fix write hole without extra device; Updated status and upcomming mdadm change to support it. Intel guys are improving it, like fixing current 'disable disk cache' problem. *Hiding member drives Hiding RAID array member drives from user, so MD RAID array looks more like a hardware raid array. This turns out to be real customer requirement. We do need to access the member drives for different reasons (create/assembly, mdmon, iostat). Working around this issue might be possible, eg, delete the /dev/xxx after array assembly. But must justify the value and also discuss with block guys since this is a general issue. *Block-mq Converting MD to use block-mq? md is bio based (not request based) driver, so no value to go mq. md dispatches bio directly to low level disks. blk-mq still benefits if low level disk supports it but this is transparent to md. *NVDIMM caching NVDIMM supports block interface. Using it as a raid5 cache disk should be straightforward. Directly storing raid5 stripe cache in NVDIMM without current raid5-cache log device? Had problems for example how to detect/fix data mismatch after power failure. Need major changes in raid5 code. *stream ID Support stream ID in MD. It should be fairly easy to support stream ID in raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID, eg, write stripe data multiple times because of read-modify-write (clarify?). Probably detecting IO pattern like what DM does can help. *split/merge problem md layer splits bio and block layer will do bio merge for low level disks. The merge/split overhead is noticeable for raid0 with fast SSD and small chunk size. Fixing the issue for raid0 is doable. Fixing for raid5 is not sure. Discussed increasing stripe size of raid5 to reduce the split/merge overhead. There is tradeoff here for example more unnecessary IO for read-modify-write with bigger stripe size. *Testing md need recover data after disk failures, mdadm has test suite, but not covering all cases. mdadm test suite is fragile, may kill the machine We need to build more completed tests. The recent null_blk block device driver can emulate several types of disk failures. The plan is to make null_blk support all disk failures which md can handle and create a test suite using null_blk. Help is welcome! *RAID-1 RAID-10 barrier inconsistency Coly improved the barrier scalibility for raid1, hopefully he can do the same for raid10 *DAX Support DAX in raid0/linear should not be hard. Does it make sense to support other raid types? *sysfs / ioctl Jes started working on it. Goal is to replace ioctl with sysfs based interfaces. There are gaps currently, eg, some operations can only be done with ioctl. Suse guys promised to close the gap in kernel side. Using configfs instead of sysfs? *Stop nested RAID device For example a raid0 on top of raid5. Userspace must understand the topology to stop the nested raid arrays. mdadm stop is async, need sync option for stop array (clarify?) *More stable in kernel API race condition when access md_dev data: data could be changed because of resync. dm-raid need to get reliable resync status report. Need further discussion on this side, email/draft patch could be helpful. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-09-15 14:27 Linux Plumbers MD BOF discussion notes Shaohua Li @ 2017-09-15 20:42 ` Coly Li 2017-09-15 21:20 ` Shaohua Li 2017-09-16 0:08 ` NeilBrown 1 sibling, 1 reply; 22+ messages in thread From: Coly Li @ 2017-09-15 20:42 UTC (permalink / raw) To: Shaohua Li, linux-raid On 2017/9/15 下午4:27, Shaohua Li wrote: > This is a short note based on Song's record. Please reply to the list if > anything is missing. [snip] > > *stream ID > Support stream ID in MD. It should be fairly easy to support stream ID in > raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID, > eg, write stripe data multiple times because of read-modify-write (clarify?). > Probably detecting IO pattern like what DM does can help. > Can anyone give me a hint what is stream ID in the context of md raid ? > *split/merge problem > md layer splits bio and block layer will do bio merge for low level disks. The > merge/split overhead is noticeable for raid0 with fast SSD and small chunk > size. Fixing the issue for raid0 is doable. Fixing for raid5 is not sure. > Discussed increasing stripe size of raid5 to reduce the split/merge overhead. > There is tradeoff here for example more unnecessary IO for read-modify-write > with bigger stripe size. > > *Testing > md need recover data after disk failures, mdadm has test suite, but not > covering all cases. mdadm test suite is fragile, may kill the machine > We need to build more completed tests. > > The recent null_blk block device driver can emulate several types of disk > failures. The plan is to make null_blk support all disk failures which md can > handle and create a test suite using null_blk. Help is welcome! > > *RAID-1 RAID-10 barrier inconsistency > Coly improved the barrier scalibility for raid1, hopefully he can do the same > for raid10 > Copied, I will handle it. > *DAX > Support DAX in raid0/linear should not be hard. Does it make sense to support > other raid types? > > *sysfs / ioctl > Jes started working on it. Goal is to replace ioctl with sysfs based interfaces. > There are gaps currently, eg, some operations can only be done with ioctl. Suse > guys promised to close the gap in kernel side. > Yes, I will handle kernel part. The change will be done one by one, step by step. The first step is to unify code path for both ioctl and sysfs interfaces. Once I finish my emergent tasks on hand, I will start to handle this. Hopefully this work can start by end of this year. > Using configfs instead of sysfs? > Currently it is sysfs and I feel is it OK being sysfs interface. Do we have specific reason or benefit for using configfs ? Thanks for the informative notes, thank you all for the discussion ! Coly Li ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-09-15 20:42 ` Coly Li @ 2017-09-15 21:20 ` Shaohua Li 0 siblings, 0 replies; 22+ messages in thread From: Shaohua Li @ 2017-09-15 21:20 UTC (permalink / raw) To: Coly Li; +Cc: linux-raid On Fri, Sep 15, 2017 at 10:42:23PM +0200, Coly Li wrote: > On 2017/9/15 下午4:27, Shaohua Li wrote: > > This is a short note based on Song's record. Please reply to the list if > > anything is missing. > > [snip] > > > > *stream ID > > Support stream ID in MD. It should be fairly easy to support stream ID in > > raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID, > > eg, write stripe data multiple times because of read-modify-write (clarify?). > > Probably detecting IO pattern like what DM does can help. > > > > Can anyone give me a hint what is stream ID in the context of md raid ? I think it's called write hint right now, see c75b1d9421f8 fs: add fcntl() interface for setting/getting write life time hints > > > *split/merge problem md layer splits bio and block layer will do bio merge > > for low level disks. The merge/split overhead is noticeable for raid0 with > > fast SSD and small chunk size. Fixing the issue for raid0 is doable. Fixing > > for raid5 is not sure. Discussed increasing stripe size of raid5 to reduce > > the split/merge overhead. There is tradeoff here for example more > > unnecessary IO for read-modify-write with bigger stripe size. > > > > *Testing md need recover data after disk failures, mdadm has test suite, > > but not covering all cases. mdadm test suite is fragile, may kill the > > machine We need to build more completed tests. > > > > The recent null_blk block device driver can emulate several types of disk > > failures. The plan is to make null_blk support all disk failures which md > > can handle and create a test suite using null_blk. Help is welcome! > > > > *RAID-1 RAID-10 barrier inconsistency Coly improved the barrier scalibility > > for raid1, hopefully he can do the same for raid10 > > > > Copied, I will handle it. > > > > *DAX Support DAX in raid0/linear should not be hard. Does it make sense to > > support other raid types? > > > > *sysfs / ioctl Jes started working on it. Goal is to replace ioctl with > > sysfs based interfaces. There are gaps currently, eg, some operations can > > only be done with ioctl. Suse guys promised to close the gap in kernel > > side. > > > > Yes, I will handle kernel part. The change will be done one by one, step by > step. The first step is to unify code path for both ioctl and sysfs > interfaces. Once I finish my emergent tasks on hand, I will start to handle > this. Hopefully this work can start by end of this year. > > > Using configfs instead of sysfs? > > > > Currently it is sysfs and I feel is it OK being sysfs interface. Do we have > specific reason or benefit for using configfs ? No strong reason, just because configfs is generally for configure. If sysfs works here, I think it's ok. Thanks, Shaohua ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-09-15 14:27 Linux Plumbers MD BOF discussion notes Shaohua Li 2017-09-15 20:42 ` Coly Li @ 2017-09-16 0:08 ` NeilBrown 2017-09-18 4:54 ` Shaohua Li 2017-09-18 7:04 ` Mikael Abrahamsson 1 sibling, 2 replies; 22+ messages in thread From: NeilBrown @ 2017-09-16 0:08 UTC (permalink / raw) To: Shaohua Li, linux-raid [-- Attachment #1: Type: text/plain, Size: 6130 bytes --] Sounds like an interesting, wide ranging discussion... On Fri, Sep 15 2017, Shaohua Li wrote: > This is a short note based on Song's record. Please reply to the list if > anything is missing. > > *IMSM - PPL > Fix write hole without extra device; Updated status and upcomming mdadm change > to support it. Intel guys are improving it, like fixing current 'disable disk > cache' problem. > > *Hiding member drives > Hiding RAID array member drives from user, so MD RAID array looks more like a > hardware raid array. This turns out to be real customer requirement. > We do need to access the member drives for different reasons (create/assembly, > mdmon, iostat). Working around this issue might be possible, eg, delete the > /dev/xxx after array assembly. But must justify the value and also discuss with > block guys since this is a general issue. "Hiding" is a very vague term. Should we get Harry Potter's invisibility cloak and wrap it around the hardware? Do we need to: - remove from /proc/partitions - possible and possibly sane - remove from from /dev - easy, given clear justification - remove from /sys/block - I don't think this is justifiable - make open() impossible - it already is if you use O_EXCL ?? Possibly something sensible could be done, but we do need a clear statement of, and justification for, the customer requirement. > > *Block-mq > Converting MD to use block-mq? md is bio based (not request based) driver, so > no value to go mq. md dispatches bio directly to low level disks. blk-mq still > benefits if low level disk supports it but this is transparent to md. > > *NVDIMM caching > NVDIMM supports block interface. Using it as a raid5 cache disk should be > straightforward. > Directly storing raid5 stripe cache in NVDIMM without current raid5-cache log > device? Had problems for example how to detect/fix data mismatch after power > failure. Need major changes in raid5 code. > > *stream ID > Support stream ID in MD. It should be fairly easy to support stream ID in > raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID, > eg, write stripe data multiple times because of read-modify-write (clarify?). > Probably detecting IO pattern like what DM does can help. > > *split/merge problem > md layer splits bio and block layer will do bio merge for low level disks. The > merge/split overhead is noticeable for raid0 with fast SSD and small chunk > size. Fixing the issue for raid0 is doable. Fixing for raid5 is not sure. > Discussed increasing stripe size of raid5 to reduce the split/merge overhead. > There is tradeoff here for example more unnecessary IO for read-modify-write > with bigger stripe size. For raid5 I can understand this being an issue as raid5 only submits PAGE_SIZE bios to lower level devices. The batching that Shaohua added might be a good starting point. If you have a batch of stripes, you should be able to submit one bio per device for the whole batch. For RAID0, I don't understand the problem. RAID0 never splits smaller than the chunk size, and that is a firm requirement. Maybe RAID0 could merge the bios itself rather than passing them down. Maybe that would help. Do if a request is properly aligned and covers several stripes, that could be mapped to one-bio-per-device. Is that the goal? > > *Testing > md need recover data after disk failures, mdadm has test suite, but not > covering all cases. mdadm test suite is fragile, may kill the machine > We need to build more completed tests. > > The recent null_blk block device driver can emulate several types of disk > failures. The plan is to make null_blk support all disk failures which md can > handle and create a test suite using null_blk. Help is welcome! > > *RAID-1 RAID-10 barrier inconsistency > Coly improved the barrier scalibility for raid1, hopefully he can do the same > for raid10 > > *DAX > Support DAX in raid0/linear should not be hard. Does it make sense to support > other raid types? > > *sysfs / ioctl > Jes started working on it. Goal is to replace ioctl with sysfs based interfaces. > There are gaps currently, eg, some operations can only be done with ioctl. Suse > guys promised to close the gap in kernel side. > > Using configfs instead of sysfs? It seems that no one actually wants this, but I'll just throw in my opinion that this is a nonsensical suggestion. configfs is only for people who don't understand sysfs. It has no real value. > > *Stop nested RAID device > For example a raid0 on top of raid5. Userspace must understand the topology to > stop the nested raid arrays. > mdadm stop is async, need sync option for stop array (clarify?) I've been thinking that it might be useful for md (and dm and loop and..) to have a setting whereby it automatically shuts down on last close. This would make it easier to orchestrate shutdown. Certainly it would make sense to use such a mode for stacked arrays. The "mdadm stop is async" comment refers to the fact that mddev_delayed_delete is run in a work queue, possibly after "mdadm -S /dev/mdX" completes. It might also refer to that fact that udev subsequently deletes things from /dev, possibly after a further delay. It might be possible to remove the need for mddev_delayed_delete if we enhance disk_release (in genhd.c) in some way so that it can drop the reference on the mddev (instead of mddev having to be careful when it drops the reference on the gendisk). Getting mdadm to wait for udev might be easy if udev provides some sort of API for that. NeilBrown > > *More stable in kernel API > race condition when access md_dev data: data could be changed because of > resync. dm-raid need to get reliable resync status report. Need further > discussion on this side, email/draft patch could be helpful. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-09-16 0:08 ` NeilBrown @ 2017-09-18 4:54 ` Shaohua Li 2017-09-18 7:04 ` Mikael Abrahamsson 1 sibling, 0 replies; 22+ messages in thread From: Shaohua Li @ 2017-09-18 4:54 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On Sat, Sep 16, 2017 at 10:08:06AM +1000, Neil Brown wrote: > > Sounds like an interesting, wide ranging discussion... > > > > On Fri, Sep 15 2017, Shaohua Li wrote: > > > This is a short note based on Song's record. Please reply to the list if > > anything is missing. > > > > *IMSM - PPL > > Fix write hole without extra device; Updated status and upcomming mdadm change > > to support it. Intel guys are improving it, like fixing current 'disable disk > > cache' problem. > > > > *Hiding member drives > > Hiding RAID array member drives from user, so MD RAID array looks more like a > > hardware raid array. This turns out to be real customer requirement. > > We do need to access the member drives for different reasons (create/assembly, > > mdmon, iostat). Working around this issue might be possible, eg, delete the > > /dev/xxx after array assembly. But must justify the value and also discuss with > > block guys since this is a general issue. > > > "Hiding" is a very vague term. Should we get Harry Potter's > invisibility cloak and wrap it around the hardware? > Do we need to: > - remove from /proc/partitions - possible and possibly sane > - remove from from /dev - easy, given clear justification > - remove from /sys/block - I don't think this is justifiable > - make open() impossible - it already is if you use O_EXCL > ?? > > Possibly something sensible could be done, but we do need a clear > statement of, and justification for, the customer requirement. Agree. The requirement isn't very clear right now. > > > > *Block-mq > > Converting MD to use block-mq? md is bio based (not request based) driver, so > > no value to go mq. md dispatches bio directly to low level disks. blk-mq still > > benefits if low level disk supports it but this is transparent to md. > > > > *NVDIMM caching > > NVDIMM supports block interface. Using it as a raid5 cache disk should be > > straightforward. > > Directly storing raid5 stripe cache in NVDIMM without current raid5-cache log > > device? Had problems for example how to detect/fix data mismatch after power > > failure. Need major changes in raid5 code. > > > > *stream ID > > Support stream ID in MD. It should be fairly easy to support stream ID in > > raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID, > > eg, write stripe data multiple times because of read-modify-write (clarify?). > > Probably detecting IO pattern like what DM does can help. > > > > *split/merge problem > > md layer splits bio and block layer will do bio merge for low level disks. The > > merge/split overhead is noticeable for raid0 with fast SSD and small chunk > > size. Fixing the issue for raid0 is doable. Fixing for raid5 is not sure. > > Discussed increasing stripe size of raid5 to reduce the split/merge overhead. > > There is tradeoff here for example more unnecessary IO for read-modify-write > > with bigger stripe size. > > For raid5 I can understand this being an issue as raid5 only submits > PAGE_SIZE bios to lower level devices. > The batching that Shaohua added might be a good starting point. If you > have a batch of stripes, you should be able to submit one bio per device > for the whole batch. > > For RAID0, I don't understand the problem. RAID0 never splits smaller > than the chunk size, and that is a firm requirement. > Maybe RAID0 could merge the bios itself rather than passing them down. > Maybe that would help. Do if a request is properly aligned and covers > several stripes, that could be mapped to one-bio-per-device. Is that > the goal? Yes, I think one-bio-per-device is the goal. split bio according to chunk size and merge bio in low level disk does takes cpu time. > > > > *Testing > > md need recover data after disk failures, mdadm has test suite, but not > > covering all cases. mdadm test suite is fragile, may kill the machine > > We need to build more completed tests. > > > > The recent null_blk block device driver can emulate several types of disk > > failures. The plan is to make null_blk support all disk failures which md can > > handle and create a test suite using null_blk. Help is welcome! > > > > *RAID-1 RAID-10 barrier inconsistency > > Coly improved the barrier scalibility for raid1, hopefully he can do the same > > for raid10 > > > > *DAX > > Support DAX in raid0/linear should not be hard. Does it make sense to support > > other raid types? > > > > *sysfs / ioctl > > Jes started working on it. Goal is to replace ioctl with sysfs based interfaces. > > There are gaps currently, eg, some operations can only be done with ioctl. Suse > > guys promised to close the gap in kernel side. > > > > Using configfs instead of sysfs? > > It seems that no one actually wants this, but I'll just throw in my > opinion that this is a nonsensical suggestion. configfs is only for > people who don't understand sysfs. It has no real value. > > > > *Stop nested RAID device > > For example a raid0 on top of raid5. Userspace must understand the topology to > > stop the nested raid arrays. > > mdadm stop is async, need sync option for stop array (clarify?) > > I've been thinking that it might be useful for md (and dm and loop > and..) to have a setting whereby it automatically shuts down on last > close. This would make it easier to orchestrate shutdown. > Certainly it would make sense to use such a mode for stacked arrays. loop does have the setting to automatically shut down on last close. Thanks, Shaohua > The "mdadm stop is async" comment refers to the fact that > mddev_delayed_delete is run in a work queue, possibly after "mdadm -S > /dev/mdX" completes. > It might also refer to that fact that udev subsequently deletes things > from /dev, possibly after a further delay. > > It might be possible to remove the need for mddev_delayed_delete if we > enhance disk_release (in genhd.c) in some way so that it can drop the > reference on the mddev (instead of mddev having to be careful when it > drops the reference on the gendisk). > > Getting mdadm to wait for udev might be easy if udev provides some sort > of API for that. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-09-16 0:08 ` NeilBrown 2017-09-18 4:54 ` Shaohua Li @ 2017-09-18 7:04 ` Mikael Abrahamsson 2017-09-18 8:56 ` NeilBrown 2017-09-18 13:57 ` Wols Lists 1 sibling, 2 replies; 22+ messages in thread From: Mikael Abrahamsson @ 2017-09-18 7:04 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On Sat, 16 Sep 2017, NeilBrown wrote: > "Hiding" is a very vague term. Should we get Harry Potter's > invisibility cloak and wrap it around the hardware? > Do we need to: > - remove from /proc/partitions - possible and possibly sane > - remove from from /dev - easy, given clear justification > - remove from /sys/block - I don't think this is justifiable > - make open() impossible - it already is if you use O_EXCL > ?? > > Possibly something sensible could be done, but we do need a clear > statement of, and justification for, the customer requirement. This is interesting. On the IRC channel #linux-raid on freenode, we have frequent visitors who "oh, I happened to overwrite an active component drive with dd" or "I zero:ed the superblock on the active component by mistake" etc. So there is something to it to remove the "/dev/sd*" when they're part of an active md device. However, this would cause problems when people are using for instance "smartctl" and equivalent ways to pull data from the devices. Same with mdadm -E. Just thinking out loud, perhaps it would make sense to create some kind of hierarchy along the lines of "/proc/mapper/md0/" and put the components there for monitoring. However, I think it would be quite confusing for users if /dev/sd[b-f] disappeared as soon as it was put into an array. There is also the question about how to refer to these devices when manipulating with mdadm. I don't have good answers, but I can say that there is user pain out there when they shoot themselves in the foot. If we can come up with a clever way to help them (without too many downsides), it'd be good. If we could disable writing to the drives/partitions from regular userspace when they're handled by md, that could be some kind of middle ground. I guess most tools don't use O_EXCL. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-09-18 7:04 ` Mikael Abrahamsson @ 2017-09-18 8:56 ` NeilBrown 2017-10-01 5:32 ` Mikael Abrahamsson 2017-09-18 13:57 ` Wols Lists 1 sibling, 1 reply; 22+ messages in thread From: NeilBrown @ 2017-09-18 8:56 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3010 bytes --] On Mon, Sep 18 2017, Mikael Abrahamsson wrote: > On Sat, 16 Sep 2017, NeilBrown wrote: > >> "Hiding" is a very vague term. Should we get Harry Potter's >> invisibility cloak and wrap it around the hardware? >> Do we need to: >> - remove from /proc/partitions - possible and possibly sane >> - remove from from /dev - easy, given clear justification >> - remove from /sys/block - I don't think this is justifiable >> - make open() impossible - it already is if you use O_EXCL >> ?? >> >> Possibly something sensible could be done, but we do need a clear >> statement of, and justification for, the customer requirement. > > This is interesting. > > On the IRC channel #linux-raid on freenode, we have frequent visitors who > "oh, I happened to overwrite an active component drive with dd" or "I > zero:ed the superblock on the active component by mistake" etc. So there > is something to it to remove the "/dev/sd*" when they're part of an active > md device. > > However, this would cause problems when people are using for instance > "smartctl" and equivalent ways to pull data from the devices. Same with > mdadm -E. > > Just thinking out loud, perhaps it would make sense to create some kind of > hierarchy along the lines of "/proc/mapper/md0/" and put the components > there for monitoring. However, I think it would be quite confusing for > users if /dev/sd[b-f] disappeared as soon as it was put into an array. > There is also the question about how to refer to these devices when > manipulating with mdadm. > > I don't have good answers, but I can say that there is user pain out there > when they shoot themselves in the foot. If we can come up with a clever > way to help them (without too many downsides), it'd be good. > > If we could disable writing to the drives/partitions from regular > userspace when they're handled by md, that could be some kind of middle > ground. I guess most tools don't use O_EXCL. This is awkward. There are times when userspace needs to write to a device which is in used by me. One example is when using DDF or Intel metadata and userspace manages the metadata. Another is using raid6check to correct inconsistencies. I don't object at all to making it hard for regular commands to write to the devices, but it is hard to come up with a good way to do it. Maybe just removing the /dev/XXX entry would be best. That doesn't stop a determined program, but it does make it harder for an inexperienced user. As you say, that might cause confusion though. It would be nice if we could simple remove write permission, but that is ignored when root opens things. We could add an ioctl that needs to be called on an fd before writes are allowed. This would effect a per-fd write access that applies even to root. If feels a but ugly, but might be possible. Anyway, thanks for the example of a real problem related to this. It does make it easier to think about. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-09-18 8:56 ` NeilBrown @ 2017-10-01 5:32 ` Mikael Abrahamsson 2017-10-04 0:49 ` NeilBrown 0 siblings, 1 reply; 22+ messages in thread From: Mikael Abrahamsson @ 2017-10-01 5:32 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On Mon, 18 Sep 2017, NeilBrown wrote: > Anyway, thanks for the example of a real problem related to this. It > does make it easier to think about. Btw, if someone does --zero-superblock or dd /dev/zero to to a component device that is active, what happens when mdadm --stop /dev/mdX is run? Does it write out the complete superblock again? If not, would that be hard to do? -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-01 5:32 ` Mikael Abrahamsson @ 2017-10-04 0:49 ` NeilBrown 2017-10-04 11:02 ` Artur Paszkiewicz 2017-10-04 17:28 ` Piergiorgio Sartor 0 siblings, 2 replies; 22+ messages in thread From: NeilBrown @ 2017-10-04 0:49 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3706 bytes --] On Sun, Oct 01 2017, Mikael Abrahamsson wrote: > On Mon, 18 Sep 2017, NeilBrown wrote: > >> Anyway, thanks for the example of a real problem related to this. It >> does make it easier to think about. > > Btw, if someone does --zero-superblock or dd /dev/zero to to a component > device that is active, what happens when mdadm --stop /dev/mdX is run? > Does it write out the complete superblock again? --zero-superblock won't work on a device that is currently part of an array. dd /dev/zero will. When the array is stopped the metadata will be written if the array is not read-only and is not clean. So for 'linear' and 'raid0' it is never written. For others it probably is but may not be. I'm not sure that forcing a write makes sense. A dd could corrupt lots of stuff, and just saving the metadata is not a big win. I've been playing with some code, and this patch makes it impossible to write to a device which is in-use by md. Well... not exactly. If a partition is in-use by md, the whole device can still be written to. But the partition itself cannot. Also if metadata is managed by user-space, writes are still allowed. To fix that, we would need to capture each write request and validate the sector range. Not impossible, but ugly. Also, by itself, this patch breaks the use of raid6check on an active array. We could fix that by enabling writes whenever a region is suspended. Still... maybe it is a starting point for thinking about the problem. NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index 0ff1bbf6c90e..7c469cd9febc 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2264,6 +2264,7 @@ static int lock_rdev(struct md_rdev *rdev, dev_t dev, int shared) pr_warn("md: could not open %s.\n", __bdevname(dev, b)); return PTR_ERR(bdev); } + bdev->bd_holder_only_writes = !shared; rdev->bdev = bdev; return err; } @@ -2272,6 +2273,7 @@ static void unlock_rdev(struct md_rdev *rdev) { struct block_device *bdev = rdev->bdev; rdev->bdev = NULL; + bdev->bd_holder_only_writes = 0; blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL); } diff --git a/fs/block_dev.c b/fs/block_dev.c index 93d088ffc05c..673b71bac731 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1816,10 +1816,14 @@ void blkdev_put(struct block_device *bdev, fmode_t mode) WARN_ON_ONCE(--bdev->bd_contains->bd_holders < 0); /* bd_contains might point to self, check in a separate step */ - if ((bdev_free = !bdev->bd_holders)) + if ((bdev_free = !bdev->bd_holders)) { + bdev->bd_holder_only_writes = 0; bdev->bd_holder = NULL; - if (!bdev->bd_contains->bd_holders) + } + if (!bdev->bd_contains->bd_holders) { + bdev->bd_contains->bd_holder_only_writes = 0; bdev->bd_contains->bd_holder = NULL; + } spin_unlock(&bdev_lock); @@ -1884,8 +1888,13 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) loff_t size = i_size_read(bd_inode); struct blk_plug plug; ssize_t ret; + struct block_device *bdev = I_BDEV(bd_inode); - if (bdev_read_only(I_BDEV(bd_inode))) + if (bdev_read_only(bdev)) + return -EPERM; + if (bdev->bd_holder != NULL && + bdev->bd_holder_only_writes && + bdev->bd_holder != file) return -EPERM; if (!iov_iter_count(from)) diff --git a/include/linux/fs.h b/include/linux/fs.h index 339e73742e73..79e3a2822867 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -424,6 +424,7 @@ struct block_device { void * bd_holder; int bd_holders; bool bd_write_holder; + bool bd_holder_only_writes; #ifdef CONFIG_SYSFS struct list_head bd_holder_disks; #endif [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-04 0:49 ` NeilBrown @ 2017-10-04 11:02 ` Artur Paszkiewicz 2017-10-04 11:23 ` Artur Paszkiewicz 2017-10-04 21:41 ` NeilBrown 2017-10-04 17:28 ` Piergiorgio Sartor 1 sibling, 2 replies; 22+ messages in thread From: Artur Paszkiewicz @ 2017-10-04 11:02 UTC (permalink / raw) To: NeilBrown, Mikael Abrahamsson; +Cc: linux-raid On 10/04/2017 02:49 AM, NeilBrown wrote: > On Sun, Oct 01 2017, Mikael Abrahamsson wrote: > >> On Mon, 18 Sep 2017, NeilBrown wrote: >> >>> Anyway, thanks for the example of a real problem related to this. It >>> does make it easier to think about. >> >> Btw, if someone does --zero-superblock or dd /dev/zero to to a component >> device that is active, what happens when mdadm --stop /dev/mdX is run? >> Does it write out the complete superblock again? > > --zero-superblock won't work on a device that is currently part of an > array. dd /dev/zero will. > When the array is stopped the metadata will be written if the array is > not read-only and is not clean. > So for 'linear' and 'raid0' it is never written. For others it probably > is but may not be. > I'm not sure that forcing a write makes sense. A dd could corrupt lots > of stuff, and just saving the metadata is not a big win. > > I've been playing with some code, and this patch makes it impossible to > write to a device which is in-use by md. > Well... not exactly. If a partition is in-use by md, the whole device > can still be written to. But the partition itself cannot. > Also if metadata is managed by user-space, writes are still allowed. > To fix that, we would need to capture each write request and validate > the sector range. Not impossible, but ugly. > > Also, by itself, this patch breaks the use of raid6check on an active > array. We could fix that by enabling writes whenever a region is > suspended. > > Still... maybe it is a starting point for thinking about the problem. Hi Neil, I tested your patch and it works. One minor issue: ioctls are still allowed, so we can do e.g. blkdiscard on a component device or something like this: # mdadm -C /dev/md0 -e1.0 -l1 -n2 /dev/sd[ab] --assume-clean -R # parted /dev/md0 mklabel msdos # parted /dev/md0 mkpart primary 1M 100% # partprobe /dev/sda # dd if=/dev/zero of=/dev/sda1 bs=1M count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.010408 s, 101 MB/s Earlier I was thinking about something similar to not allow writes to devices used by md, but the problem with external metadata updates looked like a showstopper to me. To keep it clean (and not ugly), I think md would have to expose some special interface for updating metadata from userspace. Also, our customers are asking specifically for an option to "hide" the member drives. Making it impossible to write to the devices is ok, but they would like to see only the md device, "like hardware raid". One of the reasons is that some monitoring or inventory scan applications treat md arrays and their components as separate storage devices. They should probably modify their software but maybe that's not possible. I've been experimenting with different solutions and here is a patch that allows to "hide" disk devices and their partitions by writing to a sysfs attribute. It removes the device nodes (on devtmpfs), links in /sys/block/ and removes the devices from the block class list, so they won't be included in places like /proc/partitions, /sys/class/block/, lsblk and so on. The device's "real" sysfs directory under /sys/devices/ is still available, also the links in /sys/dev/block/ are preserved. Applications like mdadm can use this to hide/unhide their component devices. Thanks, Artur diff --git a/block/genhd.c b/block/genhd.c index 7f520fa25d16..58b8e3eb14af 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -685,6 +685,55 @@ void del_gendisk(struct gendisk *disk) } EXPORT_SYMBOL(del_gendisk); +static int hide_disk(struct gendisk *disk) +{ + struct device *ddev = disk_to_dev(disk); + struct disk_part_iter piter; + struct hd_struct *part; + int ret; + + ret = device_hide(ddev); + if (ret) + return ret; + + disk_part_iter_init(&piter, disk, + DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); + while ((part = disk_part_iter_next(&piter))) + device_hide(part_to_dev(part)); + disk_part_iter_exit(&piter); + + if (!sysfs_deprecated) + sysfs_remove_link(block_depr, dev_name(ddev)); + + return ret; +} + +static int unhide_disk(struct gendisk *disk) +{ + struct device *ddev = disk_to_dev(disk); + struct disk_part_iter piter; + struct hd_struct *part; + int ret; + + ret = device_unhide(ddev); + if (ret) + return ret; + + disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY); + while ((part = disk_part_iter_next(&piter))) + device_unhide(part_to_dev(part)); + disk_part_iter_exit(&piter); + + if (!sysfs_deprecated) { + if (sysfs_create_link(block_depr, &ddev->kobj, + kobject_name(&ddev->kobj))) + pr_warn("%s: failed to restore /sys/block link\n", + disk->disk_name); + } + + return ret; +} + /* sysfs access to bad-blocks list. */ static ssize_t disk_badblocks_show(struct device *dev, struct device_attribute *attr, @@ -1017,6 +1066,31 @@ static ssize_t disk_discard_alignment_show(struct device *dev, return sprintf(buf, "%d\n", queue_discard_alignment(disk->queue)); } +static ssize_t disk_hidden_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", device_is_hidden(dev)); +} + +static ssize_t disk_hidden_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + struct gendisk *disk = dev_to_disk(dev); + bool hide; + int ret; + + ret = kstrtobool(buf, &hide); + if (ret) + return ret; + + if (hide != device_is_hidden(dev)) + ret = hide ? hide_disk(disk) : unhide_disk(disk); + + return ret ?: len; +} + static DEVICE_ATTR(range, S_IRUGO, disk_range_show, NULL); static DEVICE_ATTR(ext_range, S_IRUGO, disk_ext_range_show, NULL); static DEVICE_ATTR(removable, S_IRUGO, disk_removable_show, NULL); @@ -1030,6 +1104,8 @@ static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL); static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL); static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show, disk_badblocks_store); +static DEVICE_ATTR(hidden, S_IRUGO | S_IWUSR, disk_hidden_show, + disk_hidden_store); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store); @@ -1058,6 +1134,7 @@ static struct attribute *disk_attrs[] = { #ifdef CONFIG_FAIL_IO_TIMEOUT &dev_attr_fail_timeout.attr, #endif + &dev_attr_hidden.attr, NULL }; diff --git a/block/partition-generic.c b/block/partition-generic.c index c5ec8246e25e..0223a37c7a8c 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -350,6 +350,9 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno, if (err) goto out_put; + if (device_is_hidden(ddev)) + device_hide(pdev); + err = -ENOMEM; p->holder_dir = kobject_create_and_add("holders", &pdev->kobj); if (!p->holder_dir) diff --git a/drivers/base/core.c b/drivers/base/core.c index 755451f684bc..0804a45b8fbf 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1871,6 +1871,71 @@ void device_del(struct device *dev) } EXPORT_SYMBOL_GPL(device_del); +DEFINE_KLIST(klist_hidden_devices, NULL, NULL); + +bool device_is_hidden(struct device *dev) +{ + bool ret = false; + + if (dev->class) { + mutex_lock(&dev->class->p->mutex); + ret = (dev->knode_class.n_klist == &klist_hidden_devices); + mutex_unlock(&dev->class->p->mutex); + } + return ret; +} + +int device_hide(struct device *dev) +{ + char *envp[] = { "EVENT=hide", NULL }; + + if (!dev->class) + return -EINVAL; + + if (MAJOR(dev->devt)) + devtmpfs_delete_node(dev); + + device_remove_class_symlinks(dev); + + mutex_lock(&dev->class->p->mutex); + /* remove the device from the class list */ + klist_remove(&dev->knode_class); + /* add to the hidden devices list */ + klist_add_tail(&dev->knode_class, &klist_hidden_devices); + mutex_unlock(&dev->class->p->mutex); + + kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp); + + return 0; +} + +int device_unhide(struct device *dev) +{ + char *envp[] = { "EVENT=unhide", NULL }; + int err; + + if (!dev->class) + return -EINVAL; + + err = device_add_class_symlinks(dev); + if (err) + return err; + + if (MAJOR(dev->devt)) + devtmpfs_create_node(dev); + + mutex_lock(&dev->class->p->mutex); + /* remove the device from the hidden devices list */ + klist_remove(&dev->knode_class); + /* tie the class to the device */ + klist_add_tail(&dev->knode_class, &dev->class->p->klist_devices); + mutex_unlock(&dev->class->p->mutex); + + kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp); + + return err; +} + /** * device_unregister - unregister device from system. * @dev: device going away. diff --git a/include/linux/device.h b/include/linux/device.h index beabdbc08420..90ab1e6b63c6 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1118,6 +1118,9 @@ extern void device_unregister(struct device *dev); extern void device_initialize(struct device *dev); extern int __must_check device_add(struct device *dev); extern void device_del(struct device *dev); +extern bool device_is_hidden(struct device *dev); +extern int device_hide(struct device *dev); +extern int device_unhide(struct device *dev); extern int device_for_each_child(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); extern int device_for_each_child_reverse(struct device *dev, void *data, ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-04 11:02 ` Artur Paszkiewicz @ 2017-10-04 11:23 ` Artur Paszkiewicz 2017-10-04 17:30 ` Piergiorgio Sartor 2017-10-04 21:18 ` Phil Turmel 2017-10-04 21:41 ` NeilBrown 1 sibling, 2 replies; 22+ messages in thread From: Artur Paszkiewicz @ 2017-10-04 11:23 UTC (permalink / raw) To: NeilBrown, Mikael Abrahamsson; +Cc: linux-raid On 10/04/2017 01:02 PM, Artur Paszkiewicz wrote: > Applications like mdadm can use this to hide/unhide their component > devices. And here is an example patch for mdadm. It adds options to manually hide or unhide the component devices: mdadm --hide-components /dev/md0 mdadm --unhide-components /dev/md0 And an option for mdadm.conf that automatically hides the array's member disks when assembling and when new disks are added: ARRAY /dev/md/0 metadata=1.2 UUID=c2c4f8c6:cd775924:9cb2cc62:88fa45bd name=linux-ns31:0 hide-components=yes Hidden disks (by mdadm --hide-components or by config) should unhide when the array is stopped or disks are removed. It only works for whole devices, not partitions. Thanks, Artur diff --git a/Assemble.c b/Assemble.c index 3c10b6cd..98dfe20d 100644 --- a/Assemble.c +++ b/Assemble.c @@ -1029,6 +1029,12 @@ static int start_array(int mdfd, i/2, mddev); } + if (ident->hide_components) { + if (Manage_hidden(mdfd, 1, NULL)) + pr_err("Failed to hide component devices for %s\n", + mddev); + } + if (content->array.level == LEVEL_CONTAINER) { if (c->verbose >= 0) { pr_err("Container %s has been assembled with %d drive%s", @@ -1500,6 +1506,13 @@ try_again: if (content != &info) { /* This is a member of a container. Try starting the array. */ int err; + + if (ident->hide_components) { + pr_err("Ignoring 'hide_components' from config for %s.\n", + mddev); + pr_err("This should be set for container, not subarray.\n"); + } + err = assemble_container_content(st, mdfd, content, c, chosen_name, NULL); close(mdfd); diff --git a/Grow.c b/Grow.c index 1149753d..ae3ea512 100644 --- a/Grow.c +++ b/Grow.c @@ -637,7 +637,7 @@ int Grow_consistency_policy(char *devname, int fd, struct context *c, struct sha int dfd; char *devpath; - devpath = map_dev(sd->disk.major, sd->disk.minor, 0); + devpath = map_dev(sd->disk.major, sd->disk.minor, 1); dfd = dev_open(devpath, O_RDWR); if (dfd < 0) { pr_err("Failed to open %s\n", devpath); @@ -2461,7 +2461,7 @@ static int set_new_data_offset(struct mdinfo *sra, struct supertype *st, if (sd->disk.state & (1<<MD_DISK_FAULTY)) continue; - dn = map_dev(sd->disk.major, sd->disk.minor, 0); + dn = map_dev(sd->disk.major, sd->disk.minor, 1); dfd = dev_open(dn, O_RDONLY); if (dfd < 0) { pr_err("%s: cannot open component %s\n", @@ -2520,6 +2520,9 @@ static int set_new_data_offset(struct mdinfo *sra, struct supertype *st, char *dn = map_dev(sd->disk.major, sd->disk.minor, 0); unsigned long long new_data_offset; + if (!dn) + dn = sd->sys_name; + if (sd->disk.state & (1<<MD_DISK_FAULTY)) continue; if (delta_disks < 0) { @@ -2759,7 +2762,7 @@ static void get_space_after(int fd, struct supertype *st, struct mdinfo *info) if (sd->disk.state & (1<<MD_DISK_FAULTY)) continue; - dn = map_dev(sd->disk.major, sd->disk.minor, 0); + dn = map_dev(sd->disk.major, sd->disk.minor, 1); dfd = dev_open(dn, O_RDONLY); if (dfd < 0) break; diff --git a/Incremental.c b/Incremental.c index 91301eb5..5ce1cf9f 100644 --- a/Incremental.c +++ b/Incremental.c @@ -534,6 +534,11 @@ int Incremental(struct mddev_dev *devlist, struct context *c, } else if (c->verbose >= 0) pr_err("%s attached to %s which is already active.\n", devname, chosen_name); + if (match && match->hide_components) { + if (Manage_hidden(mdfd, 1, &rdev)) + pr_err("Failed to hide %s for %s\n", + devname, chosen_name); + } rv = 0; goto out_unlock; } @@ -604,6 +609,12 @@ int Incremental(struct mddev_dev *devlist, struct context *c, pr_err("%s re-added to %s\n", dsk->sys_name, chosen_name); } + + if (match && match->hide_components) { + if (Manage_hidden(mdfd, 1, NULL)) + pr_err("Failed to hide component devices for %s\n", + chosen_name); + } } else { pr_err("%s attached to %s, but failed to start: %s.\n", devname, chosen_name, strerror(errno)); @@ -1401,6 +1412,11 @@ restart: if (c->verbose >= 0) pr_err("started array %s\n", me->path ?: me->devnm); + if (mddev && mddev->hide_components) { + if (Manage_hidden(mdfd, 1, NULL)) + pr_err("Failed to hide component devices for %s\n", + me->path ?: me->devnm); + } } else { pr_err("failed to start array %s: %s\n", me->path ?: me->devnm, @@ -1450,7 +1466,7 @@ static int Incremental_container(struct supertype *st, char *devname, struct map_ent *map = NULL; struct mdinfo info; int trustworthy; - struct mddev_ident *match; + struct mddev_ident *match, *ident; int rv = 0; struct domainlist *domains; struct map_ent *smp; @@ -1477,6 +1493,8 @@ static int Incremental_container(struct supertype *st, char *devname, if (match == NULL && rv == 2) return rv; + ident = match; + /* Need to compute 'trustworthy' */ if (match) trustworthy = LOCAL; @@ -1608,6 +1626,17 @@ static int Incremental_container(struct supertype *st, char *devname, printf("\n"); } + if (ident && ident->hide_components) { + int mdfd = open(devname, O_RDONLY); + + if (mdfd >= 0) { + if (Manage_hidden(mdfd, 1, NULL)) + pr_err("Failed to hide component devices for %s\n", + devname); + close(mdfd); + } + } + /* don't move spares to container with volume being activated when all volumes are blocked */ if (ra_all == ra_blocked) diff --git a/Manage.c b/Manage.c index 21536f5e..7aa32fa9 100644 --- a/Manage.c +++ b/Manage.c @@ -224,6 +224,8 @@ int Manage_stop(char *devname, int fd, int verbose, int will_retry) devname); return 1; } + if (container[0] == 0) + Manage_hidden(fd, 0, NULL); /* If this is an mdmon managed array, just write 'inactive' * to the array state and let mdmon clear up. */ @@ -711,6 +713,34 @@ skip_re_add: return 0; } +int Manage_hidden(int fd, int hide, dev_t *rdev) +{ + struct mdinfo *sra, *dv; + int ret = 0; + + sra = sysfs_read(fd, NULL, GET_DEVS); + if (!sra || !sra->devs) + return 1; + + for (dv = sra->devs; dv; dv = dv->next) { + if ((dv->hidden == hide) || + (rdev && *rdev != makedev(dv->disk.major, dv->disk.minor))) + continue; + + if (!sysfs_attribute_available(sra, dv, "block/hidden")) { + ret = 1; + break; + } + + ret = sysfs_set_num(sra, dv, "block/hidden", hide); + if (ret) + break; + } + + sysfs_free(sra); + return ret; +} + int Manage_add(int fd, int tfd, struct mddev_dev *dv, struct supertype *tst, mdu_array_info_t *array, int force, int verbose, char *devname, @@ -721,6 +751,8 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv, struct supertype *dev_st; int j; mdu_disk_info_t disc; + struct mddev_ident *match; + struct mdinfo mdi; if (!get_dev_size(tfd, dv->devname, &ldsize)) { if (dv->disposition == 'M') @@ -909,6 +941,8 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv, disc.number = raid_slot; disc.state = 0; + tst->ss->getinfo_super(tst, &mdi, NULL); + /* only add journal to array that supports journaling */ if (dv->disposition == 'j') { struct mdinfo *mdp; @@ -1065,6 +1099,13 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv, } if (verbose >= 0) pr_err("added %s\n", dv->devname); + + match = conf_match(tst, &mdi, devname, 0, NULL); + if (match && match->hide_components) { + if (Manage_hidden(fd, 1, &rdev)) + pr_err("Failed to hide %s for %s", dv->devname, devname); + } + return 1; } @@ -1138,6 +1179,8 @@ int Manage_remove(struct supertype *tst, int fd, struct mddev_dev *dv, */ err = sys_hot_remove_disk(sysfd, force); } else { + Manage_hidden(fd, 0, &rdev); + err = hot_remove_disk(fd, rdev, force); if (err && errno == ENODEV) { /* Old kernels rejected this if no personality diff --git a/ReadMe.c b/ReadMe.c index 4d871e9d..5ce211dc 100644 --- a/ReadMe.c +++ b/ReadMe.c @@ -109,6 +109,9 @@ struct option long_options[] = { {"dump", 1, 0, Dump}, {"restore", 1, 0, Restore}, + {"hide-components", 0, 0, HideComponents}, + {"unhide-components", 0, 0, UnhideComponents}, + /* synonyms */ {"monitor", 0, 0, 'F'}, diff --git a/config.c b/config.c index 48e02788..f37aeb8a 100644 --- a/config.c +++ b/config.c @@ -380,6 +380,7 @@ void arrayline(char *line) mis.name[0] = 0; mis.container = NULL; mis.member = NULL; + mis.hide_components = 0; for (w = dl_next(line); w != line; w = dl_next(w)) { if (w[0] == '/' || strchr(w, '=') == NULL) { @@ -493,6 +494,10 @@ void arrayline(char *line) /* The container holding this subarray. * Either a device name or a uuid */ mis.container = xstrdup(w + 10); + } else if (strncasecmp(w, "hide-components=yes", 19) == 0) { + mis.hide_components = 1; + } else if (strncasecmp(w, "hide-components=no", 18) == 0) { + mis.hide_components = 0; } else { pr_err("unrecognised word on ARRAY line: %s\n", w); diff --git a/mdadm.c b/mdadm.c index 7cdcdba7..2d27019c 100644 --- a/mdadm.c +++ b/mdadm.c @@ -119,6 +119,7 @@ int main(int argc, char *argv[]) ident.name[0] = 0; ident.container = NULL; ident.member = NULL; + ident.hide_components = 0; if (get_linux_version() < 2006015) { pr_err("This version of mdadm does not support kernels older than 2.6.15\n"); @@ -241,6 +242,8 @@ int main(int argc, char *argv[]) case Dump: case Restore: case Action: + case HideComponents: + case UnhideComponents: newmode = MISC; break; @@ -1026,6 +1029,8 @@ int main(int argc, char *argv[]) case O(MISC, Dump): case O(MISC, Restore): case O(MISC ,Action): + case O(MISC ,HideComponents): + case O(MISC ,UnhideComponents): if (opt == KillSubarray || opt == UpdateSubarray) { if (c.subarray) { pr_err("subarray can only be specified once\n"); @@ -1995,6 +2000,11 @@ static int misc_list(struct mddev_dev *devlist, case 'w': rv |= Manage_ro(dv->devname, mdfd, -1); break; + case HideComponents: + case UnhideComponents: + rv |= Manage_hidden(mdfd, + dv->disposition == HideComponents, + NULL); } close(mdfd); } else diff --git a/mdadm.h b/mdadm.h index 85947bf6..7d956a64 100644 --- a/mdadm.h +++ b/mdadm.h @@ -348,6 +348,7 @@ struct mdinfo { ARRAY_UNKNOWN_STATE, } array_state; struct md_bb bb; + int hidden; }; struct createinfo { @@ -454,6 +455,8 @@ enum special_options { ClusterConfirm, WriteJournal, ConsistencyPolicy, + HideComponents, + UnhideComponents, }; enum prefix_standard { @@ -510,6 +513,8 @@ struct mddev_ident { */ char *member; /* subarray within a container */ + int hide_components; + struct mddev_ident *next; union { /* fields needed by different users of this structure */ @@ -1338,6 +1343,7 @@ extern int Manage_stop(char *devname, int fd, int quiet, extern int Manage_subdevs(char *devname, int fd, struct mddev_dev *devlist, int verbose, int test, char *update, int force); +extern int Manage_hidden(int fd, int hide, dev_t *rdev); extern int autodetect(void); extern int Grow_Add_device(char *devname, int fd, char *newdev); extern int Grow_addbitmap(char *devname, int fd, diff --git a/super-intel.c b/super-intel.c index 536cb613..d57d22c4 100644 --- a/super-intel.c +++ b/super-intel.c @@ -3847,6 +3847,7 @@ static void fd2devname(int fd, char *name) static int nvme_get_serial(int fd, void *buf, size_t buf_len) { char path[60]; + struct stat st; char *name = fd2kname(fd); if (!name) @@ -3855,7 +3856,15 @@ static int nvme_get_serial(int fd, void *buf, size_t buf_len) if (strncmp(name, "nvme", 4) != 0) return 1; - snprintf(path, sizeof(path) - 1, "/sys/block/%s/device/serial", name); + snprintf(path, sizeof(path), "/sys/block/%s/device/serial", name); + + if (stat(path, &st) != 0) { + if (fstat(fd, &st) != 0) + return 1; + + snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/../serial", + major(st.st_rdev), minor(st.st_rdev)); + } return load_sys(path, buf, buf_len); } diff --git a/sysfs.c b/sysfs.c index bf5c8c5d..f079d51b 100644 --- a/sysfs.c +++ b/sysfs.c @@ -326,6 +326,11 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options) continue; } + strcpy(dbase, "block/hidden"); + if (load_sys(fname, buf, sizeof(buf)) == 0 && + strtoul(buf, NULL, 0) == 1) + dev->hidden = 1; + /* finally add this disk to the array */ *devp = dev; devp = & dev->next; ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-04 11:23 ` Artur Paszkiewicz @ 2017-10-04 17:30 ` Piergiorgio Sartor 2017-10-04 18:03 ` John Stoffel 2017-10-04 21:18 ` Phil Turmel 1 sibling, 1 reply; 22+ messages in thread From: Piergiorgio Sartor @ 2017-10-04 17:30 UTC (permalink / raw) To: Artur Paszkiewicz; +Cc: NeilBrown, Mikael Abrahamsson, linux-raid On Wed, Oct 04, 2017 at 01:23:02PM +0200, Artur Paszkiewicz wrote: > On 10/04/2017 01:02 PM, Artur Paszkiewicz wrote: > > Applications like mdadm can use this to hide/unhide their component > > devices. > > And here is an example patch for mdadm. It adds options to manually hide > or unhide the component devices: > > mdadm --hide-components /dev/md0 > mdadm --unhide-components /dev/md0 This seems to me already quite sensible. > > And an option for mdadm.conf that automatically hides the array's member > disks when assembling and when new disks are added: > > ARRAY /dev/md/0 metadata=1.2 UUID=c2c4f8c6:cd775924:9cb2cc62:88fa45bd > name=linux-ns31:0 hide-components=yes > > Hidden disks (by mdadm --hide-components or by config) should unhide > when the array is stopped or disks are removed. It only works for whole > devices, not partitions. Well, this, on the other hands, it makes it not really useful. Furthermore, how about "cat /proc/mdstat"? Will this show what? In case of hidden components. Thanks, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-04 17:30 ` Piergiorgio Sartor @ 2017-10-04 18:03 ` John Stoffel 0 siblings, 0 replies; 22+ messages in thread From: John Stoffel @ 2017-10-04 18:03 UTC (permalink / raw) To: Piergiorgio Sartor Cc: Artur Paszkiewicz, NeilBrown, Mikael Abrahamsson, linux-raid >>>>> "Piergiorgio" == Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> writes: Piergiorgio> On Wed, Oct 04, 2017 at 01:23:02PM +0200, Artur Paszkiewicz wrote: >> On 10/04/2017 01:02 PM, Artur Paszkiewicz wrote: >> > Applications like mdadm can use this to hide/unhide their component >> > devices. >> >> And here is an example patch for mdadm. It adds options to manually hide >> or unhide the component devices: >> >> mdadm --hide-components /dev/md0 >> mdadm --unhide-components /dev/md0 Piergiorgio> This seems to me already quite sensible. >> >> And an option for mdadm.conf that automatically hides the array's member >> disks when assembling and when new disks are added: >> >> ARRAY /dev/md/0 metadata=1.2 UUID=c2c4f8c6:cd775924:9cb2cc62:88fa45bd >> name=linux-ns31:0 hide-components=yes >> >> Hidden disks (by mdadm --hide-components or by config) should unhide >> when the array is stopped or disks are removed. It only works for whole >> devices, not partitions. Piergiorgio> Well, this, on the other hands, it makes it not really Piergiorgio> useful. Yeah, this totally doesn't work or scale. If you're using partitions (why would you not?) as raid devices, even on a whole disk, you now limit yourself. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-04 11:23 ` Artur Paszkiewicz 2017-10-04 17:30 ` Piergiorgio Sartor @ 2017-10-04 21:18 ` Phil Turmel 1 sibling, 0 replies; 22+ messages in thread From: Phil Turmel @ 2017-10-04 21:18 UTC (permalink / raw) To: Artur Paszkiewicz, NeilBrown, Mikael Abrahamsson; +Cc: linux-raid On 10/04/2017 07:23 AM, Artur Paszkiewicz wrote: > On 10/04/2017 01:02 PM, Artur Paszkiewicz wrote: >> Applications like mdadm can use this to hide/unhide their component >> devices. > > And here is an example patch for mdadm. It adds options to manually hide > or unhide the component devices: > > mdadm --hide-components /dev/md0 > mdadm --unhide-components /dev/md0 > > And an option for mdadm.conf that automatically hides the array's member > disks when assembling and when new disks are added: > > ARRAY /dev/md/0 metadata=1.2 UUID=c2c4f8c6:cd775924:9cb2cc62:88fa45bd > name=linux-ns31:0 hide-components=yes > > Hidden disks (by mdadm --hide-components or by config) should unhide > when the array is stopped or disks are removed. It only works for whole > devices, not partitions. I am not at all in favor of these patches, as they break all kinds of diagnostic tooling and certainly violate the principle of least surprise. Tools like "lsblk --tree" and my own "lsdrv" [1]. Please keep these out of the kernel, or better yet, explain to your customers that visibility into the components is a key *advantage* of software raid. Regards, Phil Turmel [1] https://github.com/pturmel/lsdrv ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-04 11:02 ` Artur Paszkiewicz 2017-10-04 11:23 ` Artur Paszkiewicz @ 2017-10-04 21:41 ` NeilBrown 2017-10-05 18:52 ` Artur Paszkiewicz 1 sibling, 1 reply; 22+ messages in thread From: NeilBrown @ 2017-10-04 21:41 UTC (permalink / raw) To: Artur Paszkiewicz, Mikael Abrahamsson; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 8435 bytes --] On Wed, Oct 04 2017, Artur Paszkiewicz wrote: > > Also, our customers are asking specifically for an option to "hide" the > member drives. Making it impossible to write to the devices is ok, but > they would like to see only the md device, "like hardware raid". One of > the reasons is that some monitoring or inventory scan applications treat > md arrays and their components as separate storage devices. They should > probably modify their software but maybe that's not possible. What exactly is meant by "hide". How, exactly, does this software scan all devices? /proc/partitions? /sys/block? /sys/class/block? /dev/disks? All of the above. Modifying their application *must* be on the table, else modify the kernel is *not* on the table. I'm certainly open to enhancing the kernel so that it is easy to skip a particular class of devices, but if their application chooses to ignore the information the kernel provides, then the fact that the application doesn't work is their problem, not ours. > > I've been experimenting with different solutions and here is a patch > that allows to "hide" disk devices and their partitions by writing to a > sysfs attribute. It removes the device nodes (on devtmpfs), links in > /sys/block/ and removes the devices from the block class list, so they > won't be included in places like /proc/partitions, /sys/class/block/, > lsblk and so on. The device's "real" sysfs directory under /sys/devices/ > is still available, also the links in /sys/dev/block/ are preserved. > Applications like mdadm can use this to hide/unhide their component > devices. Can you confirm that this addresses the customer problem? Do you know which of these lists their code looks at? Thanks, NeilBrown > > Thanks, > Artur > > diff --git a/block/genhd.c b/block/genhd.c > index 7f520fa25d16..58b8e3eb14af 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -685,6 +685,55 @@ void del_gendisk(struct gendisk *disk) > } > EXPORT_SYMBOL(del_gendisk); > > +static int hide_disk(struct gendisk *disk) > +{ > + struct device *ddev = disk_to_dev(disk); > + struct disk_part_iter piter; > + struct hd_struct *part; > + int ret; > + > + ret = device_hide(ddev); > + if (ret) > + return ret; > + > + disk_part_iter_init(&piter, disk, > + DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); > + while ((part = disk_part_iter_next(&piter))) > + device_hide(part_to_dev(part)); > + disk_part_iter_exit(&piter); > + > + if (!sysfs_deprecated) > + sysfs_remove_link(block_depr, dev_name(ddev)); > + > + return ret; > +} > + > +static int unhide_disk(struct gendisk *disk) > +{ > + struct device *ddev = disk_to_dev(disk); > + struct disk_part_iter piter; > + struct hd_struct *part; > + int ret; > + > + ret = device_unhide(ddev); > + if (ret) > + return ret; > + > + disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY); > + while ((part = disk_part_iter_next(&piter))) > + device_unhide(part_to_dev(part)); > + disk_part_iter_exit(&piter); > + > + if (!sysfs_deprecated) { > + if (sysfs_create_link(block_depr, &ddev->kobj, > + kobject_name(&ddev->kobj))) > + pr_warn("%s: failed to restore /sys/block link\n", > + disk->disk_name); > + } > + > + return ret; > +} > + > /* sysfs access to bad-blocks list. */ > static ssize_t disk_badblocks_show(struct device *dev, > struct device_attribute *attr, > @@ -1017,6 +1066,31 @@ static ssize_t disk_discard_alignment_show(struct device *dev, > return sprintf(buf, "%d\n", queue_discard_alignment(disk->queue)); > } > > +static ssize_t disk_hidden_show(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + return sprintf(buf, "%d\n", device_is_hidden(dev)); > +} > + > +static ssize_t disk_hidden_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t len) > +{ > + struct gendisk *disk = dev_to_disk(dev); > + bool hide; > + int ret; > + > + ret = kstrtobool(buf, &hide); > + if (ret) > + return ret; > + > + if (hide != device_is_hidden(dev)) > + ret = hide ? hide_disk(disk) : unhide_disk(disk); > + > + return ret ?: len; > +} > + > static DEVICE_ATTR(range, S_IRUGO, disk_range_show, NULL); > static DEVICE_ATTR(ext_range, S_IRUGO, disk_ext_range_show, NULL); > static DEVICE_ATTR(removable, S_IRUGO, disk_removable_show, NULL); > @@ -1030,6 +1104,8 @@ static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL); > static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL); > static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show, > disk_badblocks_store); > +static DEVICE_ATTR(hidden, S_IRUGO | S_IWUSR, disk_hidden_show, > + disk_hidden_store); > #ifdef CONFIG_FAIL_MAKE_REQUEST > static struct device_attribute dev_attr_fail = > __ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store); > @@ -1058,6 +1134,7 @@ static struct attribute *disk_attrs[] = { > #ifdef CONFIG_FAIL_IO_TIMEOUT > &dev_attr_fail_timeout.attr, > #endif > + &dev_attr_hidden.attr, > NULL > }; > > diff --git a/block/partition-generic.c b/block/partition-generic.c > index c5ec8246e25e..0223a37c7a8c 100644 > --- a/block/partition-generic.c > +++ b/block/partition-generic.c > @@ -350,6 +350,9 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno, > if (err) > goto out_put; > > + if (device_is_hidden(ddev)) > + device_hide(pdev); > + > err = -ENOMEM; > p->holder_dir = kobject_create_and_add("holders", &pdev->kobj); > if (!p->holder_dir) > diff --git a/drivers/base/core.c b/drivers/base/core.c > index 755451f684bc..0804a45b8fbf 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -1871,6 +1871,71 @@ void device_del(struct device *dev) > } > EXPORT_SYMBOL_GPL(device_del); > > +DEFINE_KLIST(klist_hidden_devices, NULL, NULL); > + > +bool device_is_hidden(struct device *dev) > +{ > + bool ret = false; > + > + if (dev->class) { > + mutex_lock(&dev->class->p->mutex); > + ret = (dev->knode_class.n_klist == &klist_hidden_devices); > + mutex_unlock(&dev->class->p->mutex); > + } > + return ret; > +} > + > +int device_hide(struct device *dev) > +{ > + char *envp[] = { "EVENT=hide", NULL }; > + > + if (!dev->class) > + return -EINVAL; > + > + if (MAJOR(dev->devt)) > + devtmpfs_delete_node(dev); > + > + device_remove_class_symlinks(dev); > + > + mutex_lock(&dev->class->p->mutex); > + /* remove the device from the class list */ > + klist_remove(&dev->knode_class); > + /* add to the hidden devices list */ > + klist_add_tail(&dev->knode_class, &klist_hidden_devices); > + mutex_unlock(&dev->class->p->mutex); > + > + kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp); > + > + return 0; > +} > + > +int device_unhide(struct device *dev) > +{ > + char *envp[] = { "EVENT=unhide", NULL }; > + int err; > + > + if (!dev->class) > + return -EINVAL; > + > + err = device_add_class_symlinks(dev); > + if (err) > + return err; > + > + if (MAJOR(dev->devt)) > + devtmpfs_create_node(dev); > + > + mutex_lock(&dev->class->p->mutex); > + /* remove the device from the hidden devices list */ > + klist_remove(&dev->knode_class); > + /* tie the class to the device */ > + klist_add_tail(&dev->knode_class, &dev->class->p->klist_devices); > + mutex_unlock(&dev->class->p->mutex); > + > + kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp); > + > + return err; > +} > + > /** > * device_unregister - unregister device from system. > * @dev: device going away. > diff --git a/include/linux/device.h b/include/linux/device.h > index beabdbc08420..90ab1e6b63c6 100644 > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -1118,6 +1118,9 @@ extern void device_unregister(struct device *dev); > extern void device_initialize(struct device *dev); > extern int __must_check device_add(struct device *dev); > extern void device_del(struct device *dev); > +extern bool device_is_hidden(struct device *dev); > +extern int device_hide(struct device *dev); > +extern int device_unhide(struct device *dev); > extern int device_for_each_child(struct device *dev, void *data, > int (*fn)(struct device *dev, void *data)); > extern int device_for_each_child_reverse(struct device *dev, void *data, [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-04 21:41 ` NeilBrown @ 2017-10-05 18:52 ` Artur Paszkiewicz 2017-10-05 23:39 ` NeilBrown 0 siblings, 1 reply; 22+ messages in thread From: Artur Paszkiewicz @ 2017-10-05 18:52 UTC (permalink / raw) To: NeilBrown, Mikael Abrahamsson; +Cc: linux-raid On 10/04/2017 11:41 PM, NeilBrown wrote: > On Wed, Oct 04 2017, Artur Paszkiewicz wrote: > >> >> Also, our customers are asking specifically for an option to "hide" the >> member drives. Making it impossible to write to the devices is ok, but >> they would like to see only the md device, "like hardware raid". One of >> the reasons is that some monitoring or inventory scan applications treat >> md arrays and their components as separate storage devices. They should >> probably modify their software but maybe that's not possible. > > What exactly is meant by "hide". How, exactly, does this software scan > all devices? /proc/partitions? /sys/block? /sys/class/block? > /dev/disks? All of the above. > > Modifying their application *must* be on the table, else modify the > kernel is *not* on the table. I'm certainly open to enhancing the > kernel so that it is easy to skip a particular class of devices, but if > their application chooses to ignore the information the kernel provides, > then the fact that the application doesn't work is their problem, not > ours. > > >> >> I've been experimenting with different solutions and here is a patch >> that allows to "hide" disk devices and their partitions by writing to a >> sysfs attribute. It removes the device nodes (on devtmpfs), links in >> /sys/block/ and removes the devices from the block class list, so they >> won't be included in places like /proc/partitions, /sys/class/block/, >> lsblk and so on. The device's "real" sysfs directory under /sys/devices/ >> is still available, also the links in /sys/dev/block/ are preserved. >> Applications like mdadm can use this to hide/unhide their component >> devices. > > Can you confirm that this addresses the customer problem? Do you know > which of these lists their code looks at? Yes, this is what they asked for. I know that at least /proc/partitions and /sys/class/block are used. But this is not a single customer (or application) case, this keeps coming up again and again... Of course educating users about the specifics of md and that not hiding the drives is actually an advantage always comes first. Some get it and would be ok with just blocking write access to the drives, but others really want this hiding approach. Thanks, Artur ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-05 18:52 ` Artur Paszkiewicz @ 2017-10-05 23:39 ` NeilBrown 2017-10-06 7:13 ` Christoph Hellwig 0 siblings, 1 reply; 22+ messages in thread From: NeilBrown @ 2017-10-05 23:39 UTC (permalink / raw) To: Artur Paszkiewicz, Mikael Abrahamsson; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3431 bytes --] On Thu, Oct 05 2017, Artur Paszkiewicz wrote: > On 10/04/2017 11:41 PM, NeilBrown wrote: >> On Wed, Oct 04 2017, Artur Paszkiewicz wrote: >> >>> >>> Also, our customers are asking specifically for an option to "hide" the >>> member drives. Making it impossible to write to the devices is ok, but >>> they would like to see only the md device, "like hardware raid". One of >>> the reasons is that some monitoring or inventory scan applications treat >>> md arrays and their components as separate storage devices. They should >>> probably modify their software but maybe that's not possible. >> >> What exactly is meant by "hide". How, exactly, does this software scan >> all devices? /proc/partitions? /sys/block? /sys/class/block? >> /dev/disks? All of the above. >> >> Modifying their application *must* be on the table, else modify the >> kernel is *not* on the table. I'm certainly open to enhancing the >> kernel so that it is easy to skip a particular class of devices, but if >> their application chooses to ignore the information the kernel provides, >> then the fact that the application doesn't work is their problem, not >> ours. >> >> >>> >>> I've been experimenting with different solutions and here is a patch >>> that allows to "hide" disk devices and their partitions by writing to a >>> sysfs attribute. It removes the device nodes (on devtmpfs), links in >>> /sys/block/ and removes the devices from the block class list, so they >>> won't be included in places like /proc/partitions, /sys/class/block/, >>> lsblk and so on. The device's "real" sysfs directory under /sys/devices/ >>> is still available, also the links in /sys/dev/block/ are preserved. >>> Applications like mdadm can use this to hide/unhide their component >>> devices. >> >> Can you confirm that this addresses the customer problem? Do you know >> which of these lists their code looks at? > > Yes, this is what they asked for. I know that at least /proc/partitions > and /sys/class/block are used. But this is not a single customer (or > application) case, this keeps coming up again and again... Of course > educating users about the specifics of md and that not hiding the drives > is actually an advantage always comes first. Some get it and would be ok > with just blocking write access to the drives, but others really want > this hiding approach. > Hmmm.. interesting that it is multiple applications. Given that both md and dm have been around for years and have layered on top of exiting devices without hiding them, it is hard to see how these applications have not run into problem before. There is some precedent for hiding things from /proc/partitions. removable devices like CDROMs are hidden, and you can easily hide individual devices by setting GENHD_FL_SUPPRESS_PARTITION_INFO. We might be able to get that through. It is certainly worth writing a patch and letting people experiment with it. Removing things from /sys/class/block is much less likely to be acceptable. Code really shouldn't be digging around in there without knowing what it is doing. It should be trivial to ignore any entry in /sys/class/block where the 'holders' directory is not empty, and that should achieve the goal. For users of /sys/class/block I think you really need to push back on the customers and tell them their application is broken. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-05 23:39 ` NeilBrown @ 2017-10-06 7:13 ` Christoph Hellwig 2017-10-06 7:59 ` Mikael Abrahamsson 0 siblings, 1 reply; 22+ messages in thread From: Christoph Hellwig @ 2017-10-06 7:13 UTC (permalink / raw) To: NeilBrown; +Cc: Artur Paszkiewicz, Mikael Abrahamsson, linux-raid On Fri, Oct 06, 2017 at 10:39:50AM +1100, NeilBrown wrote: > Hmmm.. interesting that it is multiple applications. > Given that both md and dm have been around for years and have layered on > top of exiting devices without hiding them, it is hard to see how these > applications have not run into problem before. > > There is some precedent for hiding things from /proc/partitions. > removable devices like CDROMs are hidden, and you can easily hide > individual devices by setting GENHD_FL_SUPPRESS_PARTITION_INFO. > We might be able to get that through. It is certainly worth writing a > patch and letting people experiment with it. No way in hell this would get through. Preemptive NAK right here. That whole idea is just amazingly stupid, and no one has even explained a reason for it. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-06 7:13 ` Christoph Hellwig @ 2017-10-06 7:59 ` Mikael Abrahamsson 0 siblings, 0 replies; 22+ messages in thread From: Mikael Abrahamsson @ 2017-10-06 7:59 UTC (permalink / raw) To: Christoph Hellwig; +Cc: NeilBrown, Artur Paszkiewicz, linux-raid On Fri, 6 Oct 2017, Christoph Hellwig wrote: > That whole idea is just amazingly stupid, and no one has even explained > a reason for it. It's all about making it easier for the people using the functionality we have and less risk of making a mistake. See my previous email about people I have seen that would have benefited from the active component drives being not writable (or hidden). Yes, I know this goes against the old unix "don't try to stop the user from shooting themselves in the foot", but I do think this should be user configurable. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-04 0:49 ` NeilBrown 2017-10-04 11:02 ` Artur Paszkiewicz @ 2017-10-04 17:28 ` Piergiorgio Sartor 2017-10-04 18:13 ` Anthony Youngman 1 sibling, 1 reply; 22+ messages in thread From: Piergiorgio Sartor @ 2017-10-04 17:28 UTC (permalink / raw) To: NeilBrown; +Cc: Mikael Abrahamsson, linux-raid On Wed, Oct 04, 2017 at 11:49:00AM +1100, NeilBrown wrote: > On Sun, Oct 01 2017, Mikael Abrahamsson wrote: > > > On Mon, 18 Sep 2017, NeilBrown wrote: > > > >> Anyway, thanks for the example of a real problem related to this. It > >> does make it easier to think about. > > > > Btw, if someone does --zero-superblock or dd /dev/zero to to a component > > device that is active, what happens when mdadm --stop /dev/mdX is run? > > Does it write out the complete superblock again? > > --zero-superblock won't work on a device that is currently part of an > array. dd /dev/zero will. > When the array is stopped the metadata will be written if the array is > not read-only and is not clean. > So for 'linear' and 'raid0' it is never written. For others it probably > is but may not be. > I'm not sure that forcing a write makes sense. A dd could corrupt lots > of stuff, and just saving the metadata is not a big win. > > I've been playing with some code, and this patch makes it impossible to > write to a device which is in-use by md. > Well... not exactly. If a partition is in-use by md, the whole device > can still be written to. But the partition itself cannot. > Also if metadata is managed by user-space, writes are still allowed. > To fix that, we would need to capture each write request and validate > the sector range. Not impossible, but ugly. > > Also, by itself, this patch breaks the use of raid6check on an active > array. We could fix that by enabling writes whenever a region is > suspended. Maybe you all have to make up your mind on how to handle md devices and components. We had long discussions about "not having code in kernel space", to avoid useless burden, and use user space, instead. Now, someone discovers that user space is very dangerous and should be blocked. So, what should we do? Add an interface to the md devices in order to access the components? Will this really be safe against clueless people trying "dd" here and there? I think, if someone destroys a RAID using "dd" on the single components he/she deserves it. I made similar mistakes, I would not blame md for them. And having "mdadm" protecting from things like "--zero-superblock" is fine, correct and exactly what is needed as safety net. In order to conclude, please decide kernel vs. user space approaches *before* making changes. Thanks! > Still... maybe it is a starting point for thinking about the problem. Yes, you're right, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-10-04 17:28 ` Piergiorgio Sartor @ 2017-10-04 18:13 ` Anthony Youngman 0 siblings, 0 replies; 22+ messages in thread From: Anthony Youngman @ 2017-10-04 18:13 UTC (permalink / raw) To: Piergiorgio Sartor, NeilBrown; +Cc: Mikael Abrahamsson, linux-raid On 04/10/17 18:28, Piergiorgio Sartor wrote: > In order to conclude, please decide kernel vs. > user space approaches*before* making changes. > > Thanks! > >> Still... maybe it is a starting point for thinking about the problem. > Yes, you're right, Throwing a few ideas into the mix - shoot them down if they seem too outrageous :-) 1) Can we apply xattrs to devices in /dev? 2) Does xattrs lock out root? Okay, this won't necessarily hide component devices, but if assembling an array (optionally) applies xattrs to them, and makes them writeable *only* by user "mdadm", then that will help prevent a load of damage. Of course, that doesn't protect a partition against be overwritten by a write to the underlying drive, but that's a whole 'nother can of worms... And of course, the question is will this lock out the operations we don't want, while allowing stuff that we do ... Provided xattrs does lock out root from user space, this looks to me to be the obvious route to try to go down, but what do I know? Cheers, Wol ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Linux Plumbers MD BOF discussion notes 2017-09-18 7:04 ` Mikael Abrahamsson 2017-09-18 8:56 ` NeilBrown @ 2017-09-18 13:57 ` Wols Lists 1 sibling, 0 replies; 22+ messages in thread From: Wols Lists @ 2017-09-18 13:57 UTC (permalink / raw) To: Mikael Abrahamsson, NeilBrown; +Cc: linux-raid On 18/09/17 08:04, Mikael Abrahamsson wrote: > If we could disable writing to the drives/partitions from regular > userspace when they're handled by md, that could be some kind of middle > ground. I guess most tools don't use O_EXCL. It would make sense to set permissions to read-only, but of course on Unix root just ignores permissions :-( Oh for Pr1mos where the only right that couldn't be taken away from the super-user was the right to set rights :-) Cheers, Wol ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2017-10-06 7:59 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-09-15 14:27 Linux Plumbers MD BOF discussion notes Shaohua Li 2017-09-15 20:42 ` Coly Li 2017-09-15 21:20 ` Shaohua Li 2017-09-16 0:08 ` NeilBrown 2017-09-18 4:54 ` Shaohua Li 2017-09-18 7:04 ` Mikael Abrahamsson 2017-09-18 8:56 ` NeilBrown 2017-10-01 5:32 ` Mikael Abrahamsson 2017-10-04 0:49 ` NeilBrown 2017-10-04 11:02 ` Artur Paszkiewicz 2017-10-04 11:23 ` Artur Paszkiewicz 2017-10-04 17:30 ` Piergiorgio Sartor 2017-10-04 18:03 ` John Stoffel 2017-10-04 21:18 ` Phil Turmel 2017-10-04 21:41 ` NeilBrown 2017-10-05 18:52 ` Artur Paszkiewicz 2017-10-05 23:39 ` NeilBrown 2017-10-06 7:13 ` Christoph Hellwig 2017-10-06 7:59 ` Mikael Abrahamsson 2017-10-04 17:28 ` Piergiorgio Sartor 2017-10-04 18:13 ` Anthony Youngman 2017-09-18 13:57 ` Wols Lists
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).