* Adding/removing multi-pathed disk partitions
@ 2005-01-25 21:07 goggin, edward
2005-01-25 22:34 ` christophe varoqui
2005-02-17 22:38 ` Christophe Varoqui
0 siblings, 2 replies; 5+ messages in thread
From: goggin, edward @ 2005-01-25 21:07 UTC (permalink / raw)
To: 'dm-devel@redhat.com'
Should one be using the multi-path device-mapper mapped device
name or one of the SCSI target device names for the whole disk
device when modifying a SCSI disk's partition table with fdisk(8)?
If the former, drivers/block/ioctl.c:blkdev_reread_part()
currently returns EINVAL since the mapped device is not
partitionable. I can't imagine the recommendation is the
latter, since there is no mechanism in place to synchronize
the kernel's partition table handling amongst multiple target
devices.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Adding/removing multi-pathed disk partitions
2005-01-25 21:07 Adding/removing multi-pathed disk partitions goggin, edward
@ 2005-01-25 22:34 ` christophe varoqui
2005-01-26 8:52 ` Lars Marowsky-Bree
2005-02-17 22:38 ` Christophe Varoqui
1 sibling, 1 reply; 5+ messages in thread
From: christophe varoqui @ 2005-01-25 22:34 UTC (permalink / raw)
To: device-mapper development
On Tue, 2005-01-25 at 16:07 -0500, goggin, edward wrote:
> Should one be using the multi-path device-mapper mapped device
> name or one of the SCSI target device names for the whole disk
> device when modifying a SCSI disk's partition table with fdisk(8)?
>
> If the former, drivers/block/ioctl.c:blkdev_reread_part()
> currently returns EINVAL since the mapped device is not
> partitionable. I can't imagine the recommendation is the
> latter, since there is no mechanism in place to synchronize
> the kernel's partition table handling amongst multiple target
> devices.
>
May be that's not a problem, as we don't need the kernel partitioning
code. Though it's not nice to have the kernel not in sync with reality.
If kpartx reads the partition table directly from disk, ie bypassing the
cache, the correct layout will be mapped. Note it recquires a manual
execution, as no hotplug event is generated upon fdisk's sync. This also
applies to "blockdev --rereadpt" anyway.
This trick should have it working leaving all involved code mostly
untouched.
Now that's not necessarily satisfying. The weak spots of the situation
are :
1) impose an manual kpartx run to admins
2) the kernel display bogus partitioning info
3) the admin has to know he must run fdisk on a path and not on the
multipath
What can we do to improve that ?
3) would be solved by using "sfdisk --no-reread" on the multipath map
instead of fdisk. It would need to be a FAQ item (near the top).
1) could be addressed the same way BLKREREADPART has made its way into
fdisk and the like : we could push a kpartx awareness into them.
2) flames to me for suggesting we could get away altogether with the
kernel partitioning code ?
All insights are more than welcome.
regards,
--
christophe varoqui <christophe.varoqui@free.fr>
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Adding/removing multi-pathed disk partitions
2005-01-25 22:34 ` christophe varoqui
@ 2005-01-26 8:52 ` Lars Marowsky-Bree
0 siblings, 0 replies; 5+ messages in thread
From: Lars Marowsky-Bree @ 2005-01-26 8:52 UTC (permalink / raw)
To: device-mapper development
On 2005-01-25T23:34:25, christophe varoqui <christophe.varoqui@free.fr> wrote:
Yeah, the issue of partitioning device mapper targets has come up here
too. It's definetely an important discussion to have right now.
> May be that's not a problem, as we don't need the kernel partitioning
> code. Though it's not nice to have the kernel not in sync with reality.
This is the first issue: The lower level devices not showing the "right"
partition table when the partition table is changed via the higher level
device (in this case, dm multipath).
I think this probably needs to be solved in-kernel: When DM bd_claim()s
the whole disk (ie, /dev/sda), the kernel should unmap /dev/sda[0-9]+ -
bd_claim() has declared this one owner (the multipath table) to be the
one way of accessing the device, and that's that.
When the multipath table is released, the kernel can rereadpt the device
and the partition table entries would reappear.
(One issue: If any partition on the device was already bd_claim()ed at
the time where multipath tried to bd_claim the whole disk, the full disk
claim should error out.)
> If kpartx reads the partition table directly from disk, ie bypassing the
> cache, the correct layout will be mapped. Note it recquires a manual
> execution, as no hotplug event is generated upon fdisk's sync. This also
> applies to "blockdev --rereadpt" anyway.
Now, second issue: Having the kpartx generated mappings be updated when
someone uses the fdisk/sfdisk etc tools. This one is reasonably simple:
Generate a hotplug event for the rereadpt ioctl() and map it to kpartx
rescanning the table. Et voila.
There's a third issue, namely being able to figure out from user-space
that a /dev/498984348484304p1 is actually a partition of the ...304
device. (This might be needed for grouping the partition entries
correctly in various admin tools.) But, short of storing a list of
"parent" dev entries in sysfs, I think this would need to be solved by
them parsing the DM table if they really cared...
> 2) flames to me for suggesting we could get away altogether with the
> kernel partitioning code ?
This one has been suggested a number of times to me. ie, use the
in-kernel partition code to partition the DM (multipath) device, much
like the md entries can be partitioned.
For a while, I actually thought this was a good idea ;-) But then I've
had to reconsider. It's much more powerful to have the partitions be
feature-complete DMs, because then they can be remapped, snapshotted and
so on; so eventually one of the volume managers would want to implement
them this way, and then all the issues would pop up again. I'd rather
solve them like this right now.
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: Adding/removing multi-pathed disk partitions
@ 2005-01-31 14:05 goggin, edward
0 siblings, 0 replies; 5+ messages in thread
From: goggin, edward @ 2005-01-31 14:05 UTC (permalink / raw)
To: 'dm-devel@redhat.com'
> Some assumptions are not that clear to me, so let me confront them to
> your knowledge :
>
I changed the meaningless subject line I used two days ago into
something meaningful.
> 1) A Logical Unit can change its H/B/T/L mapping.
> 1.1) H/B/T through LUN masking reconfiguration
> 1.2) L through a logical unit reconfiguration
I think there are lots of ways to change the association
between a kernel's SCSI mid-layer's scsi_device data
structure and the SCSI logical unit to which it corresponds
in an FC SAN, amongst them (1) switch zoning, (2) storage
system LUN masking, (2) storage System LUN re-configuration,
and (4) inadvertent re-cabling errors at initiators, switches,
or targets.
> 2) These topology changes can happen between two checks of
> the affected path
Absoltuely.
> 3) No path failure is seen by the device mapper during the topology
> reconfiguration
This is my understanding, assuming no I/Os are directed to the unit
during this period of time.
> 4) Not failing or isolating the changed path from its
> previous multipath
> map will lead to unrecoverable data corruption at the first submitted
> write IO routed through this path
I belive this to be true.
> 5) HBA drivers see the LU remapping events
>
The SCSI mid-layer should be seeing a UNIT_ATTENTION sense key
for all I/O directed to a SCSI logical unit with a "new" identity
before the UNIT_ATTENTION check condition is cleared.
> If all these assertions are legitimate, it might not be
> enough to check
> uid changes at pathcheck interval.
>
I agree with all the assertions above.
Because the consequences of the problem are so severe, I was
advocating that the multipathing software detect this condition
if it can do so with reasonable means, even though the potential
for the problem cannot be fully elminated using these techniques.
I now agree with your suggestion below, that the problem is better
addressed at the SCSI mid-layer.
> It would seem safer to let the HBA driver error the first IO
> submitted to a
> changed LU *and* send an event (maybe through a transport
> class kobj) for
> userspace to reconfigure the maps.
>
I like your idea better. You are attempting to solve the problem
at a lower level where it can more likely be fully addressed.
Looks like the SCSI mid-layer may be already doing __most__ of the
"right thing" - but (1) the right thing isn't happening for EMC CLARiion
or Symmetrix logical units (maybe other storage also) because they are
not being treated as "removable" units by the mid-layer and (2) there
Does not seem to be any refresh of the cached inquiry data
(vendor/model/rev) in the mid-layer's scsi_device data structure.
Not clear to me if these storage systems should be setting the RMB
bit of the standard inquiry reply (which CLARiion and Symmetrix are
not doing) or if the linux SCSI mid-layer should just be treating all
SAN storage units as removable units, independent of the state of this bit.
The SCSI mid-layer (scsi_io_completion()) detects a SCSI sense key of
UNIT_ATTENTION after most any attempt to access the SCSI logical unit
With the "new" identity for the first time. As long as the logical unit
is viewed as "removable" media, the most of right thing happens, namely,
the I/O in question is failed and the device is marked to prevent any
further I/O. Apparently calling check_disk_change() from the next
sd_open() will at least clear the changed field of the scsi_device,
thereby allowing I/O to the device. But, the cached inquiry fields
are not updated (possibly via scsi_probe_lun()) to reflect the
possibly new device identity.
> Please comment abundantly.
>
> regards,
> cvaroqui
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Adding/removing multi-pathed disk partitions
2005-01-25 21:07 Adding/removing multi-pathed disk partitions goggin, edward
2005-01-25 22:34 ` christophe varoqui
@ 2005-02-17 22:38 ` Christophe Varoqui
1 sibling, 0 replies; 5+ messages in thread
From: Christophe Varoqui @ 2005-02-17 22:38 UTC (permalink / raw)
To: device-mapper development
goggin, edward wrote:
>Should one be using the multi-path device-mapper mapped device
>name or one of the SCSI target device names for the whole disk
>device when modifying a SCSI disk's partition table with fdisk(8)?
>
>If the former, drivers/block/ioctl.c:blkdev_reread_part()
>currently returns EINVAL since the mapped device is not
>partitionable. I can't imagine the recommendation is the
>latter, since there is no mechanism in place to synchronize
>the kernel's partition table handling amongst multiple target
>devices.
>
>--
>
As discussed earlier, one face of this problem is that kpartx does
buffered reads to grab the on-disk partition table, leading to
situations where kpartx ->fdisk -> kpartx sequence does not garanty
partitons-to-maps layout will reflect reality.
To address this particular problem, do you think adding O_DIRECT to the
open() call in kpartx.c would be enough ?
Regards,
cvaroqui
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2005-02-17 22:38 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-25 21:07 Adding/removing multi-pathed disk partitions goggin, edward
2005-01-25 22:34 ` christophe varoqui
2005-01-26 8:52 ` Lars Marowsky-Bree
2005-02-17 22:38 ` Christophe Varoqui
-- strict thread matches above, loose matches on Subject: below --
2005-01-31 14:05 goggin, edward
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.