* [PATCH v3] mulitpath-tools Add purge capability to multipath-tools @ 2025-12-03 16:19 Brian Bunker 2025-12-04 19:33 ` Benjamin Marzinski 0 siblings, 1 reply; 7+ messages in thread From: Brian Bunker @ 2025-12-03 16:19 UTC (permalink / raw) To: dm-devel, mwilck, bmarzins; +Cc: Brian Bunker, Krishna Kant In the Linux kernel when volumes are disconnected, they are left behind to be re-enabled when the connection comes back. There are many cases where this behavior is not desirable. The goal is to have paths appear when volumes are connected. This is handled by target initiated rescans and posting of unit attentions. Secondarily when volumes are disconnected at the target, paths should be removed. This was discussed here in a kernel patch: https://marc.info/?l=linux-scsi&m=164426722617834&w=2 It was suggested there that userspace would be more advantageous in handling the device removal. An option was added to multipath-tools to purge devices that are disconnected at the target. A purge thread was implemented to handle the device removal when the path returns sense data (0x5/0x25/0x0) from the path checker. The option, purge_disconnected, is by default off but can be turned on by setting it to true. When enabled multipathd will log: Dec 3 09:10:05 init110-14 multipathd[54768]: 8:128: path removed from map 3624a93708ebb224d4a5e45c400011a43 ... Dec 3 09:10:14 init110-14 multipathd[54768]: purge cycle 2 complete: 3 purged, 0 failed, 0 pending Signed-off-by: Brian Bunker <brian@purestorage.com> Signed-off-by: Krishna Kant <krishna.kant@purestorage.com> --- libmultipath/checkers.c | 2 + libmultipath/checkers.h | 2 + libmultipath/checkers/tur.c | 10 + libmultipath/config.c | 2 + libmultipath/config.h | 3 + libmultipath/configure.c | 1 + libmultipath/defaults.h | 1 + libmultipath/dict.c | 14 ++ libmultipath/discovery.c | 3 +- libmultipath/io_err_stat.c | 1 + libmultipath/libmultipath.version | 1 + libmultipath/print.c | 2 + libmultipath/propsel.c | 16 ++ libmultipath/propsel.h | 1 + libmultipath/structs.h | 10 + libmultipath/sysfs.c | 58 ++++- libmultipath/sysfs.h | 2 + multipath/multipath.conf.5.in | 22 ++ multipathd/main.c | 402 +++++++++++++++++++++++++++++- 19 files changed, 545 insertions(+), 8 deletions(-) diff --git a/libmultipath/checkers.c b/libmultipath/checkers.c index e2eda58d..bb6ad1ee 100644 --- a/libmultipath/checkers.c +++ b/libmultipath/checkers.c @@ -43,6 +43,7 @@ static const char *checker_state_names[PATH_MAX_STATE] = { [PATH_TIMEOUT] = "timeout", [PATH_REMOVED] = "removed", [PATH_DELAYED] = "delayed", + [PATH_DISCONNECTED] = "disconnected", }; static LIST_HEAD(checkers); @@ -363,6 +364,7 @@ static const char *generic_msg[CHECKER_GENERIC_MSGTABLE_SIZE] = { [CHECKER_MSGID_DOWN] = " reports path is down", [CHECKER_MSGID_GHOST] = " reports path is ghost", [CHECKER_MSGID_UNSUPPORTED] = " doesn't support this device", + [CHECKER_MSGID_DISCONNECTED] = " no access to this device", }; const char *checker_message(const struct checker *c) diff --git a/libmultipath/checkers.h b/libmultipath/checkers.h index da91f499..903c3ebe 100644 --- a/libmultipath/checkers.h +++ b/libmultipath/checkers.h @@ -78,6 +78,7 @@ enum path_check_state { PATH_TIMEOUT, PATH_REMOVED, PATH_DELAYED, + PATH_DISCONNECTED, PATH_MAX_STATE }; @@ -113,6 +114,7 @@ enum { CHECKER_MSGID_DOWN, CHECKER_MSGID_GHOST, CHECKER_MSGID_UNSUPPORTED, + CHECKER_MSGID_DISCONNECTED, CHECKER_GENERIC_MSGTABLE_SIZE, CHECKER_FIRST_MSGID = 100, /* lowest msgid for checkers */ CHECKER_MSGTABLE_SIZE = 100, /* max msg table size for checkers */ diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c index 0010acf8..1bb9a992 100644 --- a/libmultipath/checkers/tur.c +++ b/libmultipath/checkers/tur.c @@ -199,6 +199,16 @@ retry: return PATH_PENDING; } } + else if (key == 0x5) { + /* Illegal request */ + if (asc == 0x25 && ascq == 0x00) { + /* + * LOGICAL UNIT NOT SUPPORTED + */ + *msgid = CHECKER_MSGID_DISCONNECTED; + return PATH_DISCONNECTED; + } + } *msgid = CHECKER_MSGID_DOWN; return PATH_DOWN; } diff --git a/libmultipath/config.c b/libmultipath/config.c index 8b424d18..2c66a788 100644 --- a/libmultipath/config.c +++ b/libmultipath/config.c @@ -470,6 +470,7 @@ merge_hwe (struct hwentry * dst, struct hwentry * src) merge_num(marginal_path_err_rate_threshold); merge_num(marginal_path_err_recheck_gap_time); merge_num(marginal_path_double_failed_time); + merge_num(purge_disconnected); snprintf(id, sizeof(id), "%s/%s", dst->vendor, dst->product); reconcile_features_with_options(id, &dst->features, @@ -517,6 +518,7 @@ merge_mpe(struct mpentry *dst, struct mpentry *src) merge_num(skip_kpartx); merge_num(max_sectors_kb); merge_num(ghost_delay); + merge_num(purge_disconnected); merge_num(uid); merge_num(gid); merge_num(mode); diff --git a/libmultipath/config.h b/libmultipath/config.h index 5b4ebf8c..618fa572 100644 --- a/libmultipath/config.h +++ b/libmultipath/config.h @@ -88,6 +88,7 @@ struct hwentry { int marginal_path_err_rate_threshold; int marginal_path_err_recheck_gap_time; int marginal_path_double_failed_time; + int purge_disconnected; int skip_kpartx; int max_sectors_kb; int ghost_delay; @@ -130,6 +131,7 @@ struct mpentry { int marginal_path_err_rate_threshold; int marginal_path_err_recheck_gap_time; int marginal_path_double_failed_time; + int purge_disconnected; int skip_kpartx; int max_sectors_kb; int ghost_delay; @@ -188,6 +190,7 @@ struct config { int marginal_path_err_rate_threshold; int marginal_path_err_recheck_gap_time; int marginal_path_double_failed_time; + int purge_disconnected; int uxsock_timeout; int strict_timing; int retrigger_tries; diff --git a/libmultipath/configure.c b/libmultipath/configure.c index baa13573..e4431d2f 100644 --- a/libmultipath/configure.c +++ b/libmultipath/configure.c @@ -355,6 +355,7 @@ int setup_map(struct multipath *mpp, char **params, struct vectors *vecs) select_max_sectors_kb(conf, mpp); select_ghost_delay(conf, mpp); select_flush_on_last_del(conf, mpp); + select_purge_disconnected(conf, mpp); sysfs_set_scsi_tmo(conf, mpp); marginal_pathgroups = conf->marginal_pathgroups; diff --git a/libmultipath/defaults.h b/libmultipath/defaults.h index 134b690a..d80dd7d4 100644 --- a/libmultipath/defaults.h +++ b/libmultipath/defaults.h @@ -58,6 +58,7 @@ #define DEFAULT_ALL_TG_PT ALL_TG_PT_OFF #define DEFAULT_RECHECK_WWID RECHECK_WWID_OFF #define DEFAULT_AUTO_RESIZE AUTO_RESIZE_NEVER +#define DEFAULT_PURGE_DISCONNECTED PURGE_DISCONNECTED_OFF /* Enable no foreign libraries by default */ #define DEFAULT_ENABLE_FOREIGN "NONE" diff --git a/libmultipath/dict.c b/libmultipath/dict.c index a06a6138..504b28b5 100644 --- a/libmultipath/dict.c +++ b/libmultipath/dict.c @@ -941,6 +941,16 @@ declare_hw_snprint(skip_kpartx, print_yes_no_undef) declare_mp_handler(skip_kpartx, set_yes_no_undef) declare_mp_snprint(skip_kpartx, print_yes_no_undef) +declare_def_handler(purge_disconnected, set_yes_no_undef) +declare_def_snprint_defint(purge_disconnected, print_yes_no_undef, + DEFAULT_PURGE_DISCONNECTED) +declare_ovr_handler(purge_disconnected, set_yes_no_undef) +declare_ovr_snprint(purge_disconnected, print_yes_no_undef) +declare_hw_handler(purge_disconnected, set_yes_no_undef) +declare_hw_snprint(purge_disconnected, print_yes_no_undef) +declare_mp_handler(purge_disconnected, set_yes_no_undef) +declare_mp_snprint(purge_disconnected, print_yes_no_undef) + declare_def_range_handler(remove_retries, 0, INT_MAX) declare_def_snprint(remove_retries, print_int) @@ -2227,6 +2237,7 @@ init_keywords(vector keywords) install_keyword("retrigger_delay", &def_retrigger_delay_handler, &snprint_def_retrigger_delay); install_keyword("missing_uev_wait_timeout", &def_uev_wait_timeout_handler, &snprint_def_uev_wait_timeout); install_keyword("skip_kpartx", &def_skip_kpartx_handler, &snprint_def_skip_kpartx); + install_keyword("purge_disconnected", &def_purge_disconnected_handler, &snprint_def_purge_disconnected); install_keyword("disable_changed_wwids", &deprecated_disable_changed_wwids_handler, &snprint_deprecated); install_keyword("remove_retries", &def_remove_retries_handler, &snprint_def_remove_retries); install_keyword("max_sectors_kb", &def_max_sectors_kb_handler, &snprint_def_max_sectors_kb); @@ -2310,6 +2321,7 @@ init_keywords(vector keywords) install_keyword("marginal_path_err_recheck_gap_time", &hw_marginal_path_err_recheck_gap_time_handler, &snprint_hw_marginal_path_err_recheck_gap_time); install_keyword("marginal_path_double_failed_time", &hw_marginal_path_double_failed_time_handler, &snprint_hw_marginal_path_double_failed_time); install_keyword("skip_kpartx", &hw_skip_kpartx_handler, &snprint_hw_skip_kpartx); + install_keyword("purge_disconnected", &hw_purge_disconnected_handler, &snprint_hw_purge_disconnected); install_keyword("max_sectors_kb", &hw_max_sectors_kb_handler, &snprint_hw_max_sectors_kb); install_keyword("ghost_delay", &hw_ghost_delay_handler, &snprint_hw_ghost_delay); install_keyword("all_tg_pt", &hw_all_tg_pt_handler, &snprint_hw_all_tg_pt); @@ -2355,6 +2367,7 @@ init_keywords(vector keywords) install_keyword("marginal_path_double_failed_time", &ovr_marginal_path_double_failed_time_handler, &snprint_ovr_marginal_path_double_failed_time); install_keyword("skip_kpartx", &ovr_skip_kpartx_handler, &snprint_ovr_skip_kpartx); + install_keyword("purge_disconnected", &ovr_purge_disconnected_handler, &snprint_ovr_purge_disconnected); install_keyword("max_sectors_kb", &ovr_max_sectors_kb_handler, &snprint_ovr_max_sectors_kb); install_keyword("ghost_delay", &ovr_ghost_delay_handler, &snprint_ovr_ghost_delay); install_keyword("all_tg_pt", &ovr_all_tg_pt_handler, &snprint_ovr_all_tg_pt); @@ -2400,6 +2413,7 @@ init_keywords(vector keywords) install_keyword("marginal_path_err_recheck_gap_time", &mp_marginal_path_err_recheck_gap_time_handler, &snprint_mp_marginal_path_err_recheck_gap_time); install_keyword("marginal_path_double_failed_time", &mp_marginal_path_double_failed_time_handler, &snprint_mp_marginal_path_double_failed_time); install_keyword("skip_kpartx", &mp_skip_kpartx_handler, &snprint_mp_skip_kpartx); + install_keyword("purge_disconnected", &mp_purge_disconnected_handler, &snprint_mp_purge_disconnected); install_keyword("max_sectors_kb", &mp_max_sectors_kb_handler, &snprint_mp_max_sectors_kb); install_keyword("ghost_delay", &mp_ghost_delay_handler, &snprint_mp_ghost_delay); install_sublevel_end(); diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c index 31db8758..46274f3f 100644 --- a/libmultipath/discovery.c +++ b/libmultipath/discovery.c @@ -2482,7 +2482,8 @@ int pathinfo(struct path *pp, struct config *conf, int mask) pp->state == PATH_UNCHECKED || pp->state == PATH_WILD) pp->chkrstate = pp->state = newstate; - if (pp->state == PATH_TIMEOUT) + if (pp->state == PATH_TIMEOUT || + pp->state == PATH_DISCONNECTED) pp->state = PATH_DOWN; if (pp->state == PATH_UP && !pp->size) { condlog(3, "%s: device size is 0, " diff --git a/libmultipath/io_err_stat.c b/libmultipath/io_err_stat.c index 64054c18..acb087b7 100644 --- a/libmultipath/io_err_stat.c +++ b/libmultipath/io_err_stat.c @@ -397,6 +397,7 @@ static void account_async_io_state(struct io_err_stat_path *pp, int rc) switch (rc) { case PATH_DOWN: case PATH_TIMEOUT: + case PATH_DISCONNECTED: pp->io_err_nr++; break; case PATH_UNCHECKED: diff --git a/libmultipath/libmultipath.version b/libmultipath/libmultipath.version index 89ae2a3c..c435847f 100644 --- a/libmultipath/libmultipath.version +++ b/libmultipath/libmultipath.version @@ -239,6 +239,7 @@ global: snprint_tgt_wwnn; snprint_tgt_wwpn; sysfs_attr_set_value; + sysfs_attr_set_value_nonblock; sysfs_attr_get_value; sysfs_get_asymmetric_access_state; diff --git a/libmultipath/print.c b/libmultipath/print.c index 0d44a5a9..8b09b174 100644 --- a/libmultipath/print.c +++ b/libmultipath/print.c @@ -541,6 +541,8 @@ snprint_chk_state (struct strbuf *buff, const struct path * pp) return append_strbuf_str(buff, "i/o timeout"); case PATH_DELAYED: return append_strbuf_str(buff, "delayed"); + case PATH_DISCONNECTED: + return append_strbuf_str(buff, "disconnected"); default: return append_strbuf_str(buff, "undef"); } diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c index 4c0fbcf3..9a17f1ed 100644 --- a/libmultipath/propsel.c +++ b/libmultipath/propsel.c @@ -1371,6 +1371,22 @@ out: return 0; } +int select_purge_disconnected (struct config *conf, struct multipath * mp) +{ + const char *origin; + + mp_set_mpe(purge_disconnected); + mp_set_ovr(purge_disconnected); + mp_set_hwe(purge_disconnected); + mp_set_conf(purge_disconnected); + mp_set_default(purge_disconnected, DEFAULT_PURGE_DISCONNECTED); +out: + condlog(3, "%s: purge_disconnected = %s %s", mp->alias, + (mp->purge_disconnected == PURGE_DISCONNECTED_ON)? "yes" : "no", + origin); + return 0; +} + int select_max_sectors_kb(struct config *conf, struct multipath * mp) { const char *origin; diff --git a/libmultipath/propsel.h b/libmultipath/propsel.h index 55930050..c6c937c3 100644 --- a/libmultipath/propsel.h +++ b/libmultipath/propsel.h @@ -39,6 +39,7 @@ int select_marginal_path_err_rate_threshold(struct config *conf, struct multipat int select_marginal_path_err_recheck_gap_time(struct config *conf, struct multipath *mp); int select_marginal_path_double_failed_time(struct config *conf, struct multipath *mp); int select_ghost_delay(struct config *conf, struct multipath * mp); +int select_purge_disconnected (struct config *conf, struct multipath * mp); void reconcile_features_with_options(const char *id, char **features, int* no_path_retry, int *retain_hwhandler); diff --git a/libmultipath/structs.h b/libmultipath/structs.h index 9247e74f..81ac6be5 100644 --- a/libmultipath/structs.h +++ b/libmultipath/structs.h @@ -186,6 +186,12 @@ enum auto_resize_state { AUTO_RESIZE_GROW_SHRINK, }; +enum purge_disconnected_states { + PURGE_DISCONNECTED_UNDEF = YNU_UNDEF, + PURGE_DISCONNECTED_OFF = YNU_NO, + PURGE_DISCONNECTED_ON = YNU_YES, +}; + #define PROTOCOL_UNSET -1 enum scsi_protocol { @@ -382,6 +388,9 @@ struct path { int dmstate; int chkrstate; int oldstate; + bool purge_path; + int purge_retries; /* number of purge attempts for this path */ + unsigned int purge_cycle; /* last purge cycle that checked this path */ int failcount; int priority; int pgindex; @@ -483,6 +492,7 @@ struct multipath { int ghost_delay; int ghost_delay_tick; int queue_mode; + int purge_disconnected; unsigned int sync_tick; int checker_count; enum prio_update_type prio_update; diff --git a/libmultipath/sysfs.c b/libmultipath/sysfs.c index af27d107..2160d7fd 100644 --- a/libmultipath/sysfs.c +++ b/libmultipath/sysfs.c @@ -134,7 +134,7 @@ ssize_t sysfs_attr_set_value(struct udev_device *dev, const char *attr_name, size = write(fd, value, value_len); if (size < 0) { size = -errno; - condlog(3, "%s: write to %s failed: %s", __func__, + condlog(3, "%s: write to %s failed: %s", __func__, devpath, strerror(errno)); } else if (size < (ssize_t)value_len) condlog(3, "%s: underflow writing %zu bytes to %s. Wrote %zd bytes", @@ -144,6 +144,62 @@ ssize_t sysfs_attr_set_value(struct udev_device *dev, const char *attr_name, return size; } +/* + * Non-blocking version of sysfs_attr_set_value for use with potentially + * blocking sysfs attributes (like SCSI "delete"). + * + * Returns: + * > 0: number of bytes written (success) + * -EAGAIN: write would block, caller should retry or wait for completion + * < 0: other error (negative errno) + */ +ssize_t sysfs_attr_set_value_nonblock(struct udev_device *dev, const char *attr_name, + const char * value, size_t value_len) +{ + const char *syspath; + char devpath[PATH_MAX]; + int fd = -1; + ssize_t size = -1; + + if (!dev || !attr_name || !value || !value_len) { + condlog(1, "%s: invalid parameters", __func__); + return -EINVAL; + } + + syspath = udev_device_get_syspath(dev); + if (!syspath) { + condlog(3, "%s: invalid udevice", __func__); + return -EINVAL; + } + if (safe_sprintf(devpath, "%s/%s", syspath, attr_name)) { + condlog(3, "%s: devpath overflow", __func__); + return -EOVERFLOW; + } + + condlog(4, "open '%s' (non-blocking)", devpath); + /* Open with O_NONBLOCK to avoid blocking in open() or write() */ + fd = open(devpath, O_WRONLY | O_NONBLOCK); + if (fd < 0) { + condlog(3, "%s: attribute '%s' cannot be opened: %s", + __func__, devpath, strerror(errno)); + return -errno; + } + pthread_cleanup_push(cleanup_fd_ptr, &fd); + + size = write(fd, value, value_len); + if (size < 0) { + size = -errno; + if (errno != EAGAIN) + condlog(3, "%s: write to %s failed: %s", __func__, + devpath, strerror(errno)); + } else if (size < (ssize_t)value_len) + condlog(3, "%s: underflow writing %zu bytes to %s. Wrote %zd bytes", + __func__, value_len, devpath, size); + + pthread_cleanup_pop(1); + return size; +} + int sysfs_get_size (struct path *pp, unsigned long long * size) { diff --git a/libmultipath/sysfs.h b/libmultipath/sysfs.h index 45f24c37..da2deaa3 100644 --- a/libmultipath/sysfs.h +++ b/libmultipath/sysfs.h @@ -10,6 +10,8 @@ int devt2devname (char *, int, const char *); ssize_t sysfs_attr_set_value(struct udev_device *dev, const char *attr_name, const char * value, size_t value_len); +ssize_t sysfs_attr_set_value_nonblock(struct udev_device *dev, const char *attr_name, + const char * value, size_t value_len); ssize_t sysfs_attr_get_value(struct udev_device *dev, const char *attr_name, char * value, size_t value_len); ssize_t sysfs_bin_attr_get_value(struct udev_device *dev, const char *attr_name, diff --git a/multipath/multipath.conf.5.in b/multipath/multipath.conf.5.in index 3c9ae097..84cd1a0a 100644 --- a/multipath/multipath.conf.5.in +++ b/multipath/multipath.conf.5.in @@ -1306,6 +1306,22 @@ The default is: \fBno\fR . . .TP +.B purge_disconnected +If set to +.I yes +, multipathd will automatically remove devices that are in a disconnected state. +A path is considered disconnected when the TUR (Test Unit Ready) path checker +receives the SCSI sense code "LOGICAL UNIT NOT SUPPORTED" (sense key 0x5, +ASC/ASCQ 0x25/0x00). This typically indicates that the LUN has been unmapped +or is no longer presented by the storage array. This option helps clean up +stale device entries that would otherwise remain in the system. +.RS +.TP +The default is: \fBno\fR +.RE +. +. +.TP .B disable_changed_wwids (Deprecated) This option is not supported anymore, and will be ignored. .RE @@ -1602,6 +1618,8 @@ section: .TP .B skip_kpartx .TP +.B purge_disconnected +.TP .B max_sectors_kb .TP .B ghost_delay @@ -1797,6 +1815,8 @@ section: .TP .B skip_kpartx .TP +.B purge_disconnected +.TP .B max_sectors_kb .TP .B ghost_delay @@ -1881,6 +1901,8 @@ the values are taken from the \fIdevices\fR or \fIdefaults\fR sections: .TP .B skip_kpartx .TP +.B purge_disconnected +.TP .B max_sectors_kb .TP .B ghost_delay diff --git a/multipathd/main.c b/multipathd/main.c index d11a8576..b105cd56 100644 --- a/multipathd/main.c +++ b/multipathd/main.c @@ -137,11 +137,14 @@ static enum force_reload_types reconfigure_pending = FORCE_RELOAD_NONE; pid_t daemon_pid; static pthread_mutex_t config_lock = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t config_cond; -static pthread_t check_thr, uevent_thr, uxlsnr_thr, uevq_thr, dmevent_thr, - fpin_thr, fpin_consumer_thr; -static bool check_thr_started, uevent_thr_started, uxlsnr_thr_started, - uevq_thr_started, dmevent_thr_started, fpin_thr_started, - fpin_consumer_thr_started; +static pthread_mutex_t purge_mutex = PTHREAD_MUTEX_INITIALIZER; +static pthread_cond_t purge_cond = PTHREAD_COND_INITIALIZER; +int devices_to_purge = 0; +static pthread_t check_thr, purge_thr, uevent_thr, uxlsnr_thr, uevq_thr, + dmevent_thr, fpin_thr, fpin_consumer_thr; +static bool check_thr_started, purge_thr_started, uevent_thr_started, + uxlsnr_thr_started, uevq_thr_started, dmevent_thr_started, + fpin_thr_started, fpin_consumer_thr_started; static int pid_fd = -1; static inline enum daemon_status get_running_state(bool *pending_reconfig) @@ -1089,6 +1092,109 @@ check_path_wwid_change(struct path *pp) return false; } +/* + * Information needed to delete a path via sysfs, copied while holding the lock + * so that the actual sysfs write can be done without holding vecs->lock. + */ +struct purge_path_info { + struct udev_device *udev; /* Reference to udev device (refcounted) */ + char dev_t[BLK_DEV_SIZE]; /* For logging */ + char alias[WWID_SIZE]; /* mpp alias for logging */ + struct path *pp; /* Path pointer for updating purge_path flag */ +}; + +/* + * Attempt to delete a path by writing to the SCSI device's sysfs delete attribute. + * This triggers kernel-level device removal. The actual cleanup of the path structure + * from pathvec happens later when a uevent arrives (handled by uev_remove_path). + * + * This function does NOT require vecs->lock to be held, as it operates on copied data. + * Uses non-blocking I/O to avoid blocking in scsi_remove_device(). + * + * Returns: + * true - sysfs write succeeded or device already being deleted + * false - sysfs write would block (EAGAIN), caller should retry + * + * Note: Errors other than EAGAIN are treated as success, because they typically + * indicate the device is already in SDEV_DEL or SDEV_CANCEL state, or the sysfs + * file doesn't exist (device already removed). + */ +static bool +delete_path_sysfs(struct purge_path_info *info) +{ + struct udev_device *ud; + ssize_t ret; + + if (!info->udev) + return true; /* No device, treat as success */ + + ud = udev_device_get_parent_with_subsystem_devtype(info->udev, + "scsi", "scsi_device"); + + if (!ud) { + /* No SCSI parent device, likely already removed */ + condlog(4, "%s: no SCSI parent device found, likely already removed", + info->dev_t); + return true; + } + + ret = sysfs_attr_set_value_nonblock(ud, "delete", "1", strlen("1")); + + if (ret == -EAGAIN) { + /* Write would block, caller should retry */ + condlog(4, "%s: delete would block, will retry", info->dev_t); + return false; + } + + if (ret < 0) { + /* + * Other errors (ENOENT, EINVAL, etc.) typically mean the device + * is already being deleted or is in SDEV_DEL/SDEV_CANCEL state. + * Treat as success since the end result is what we want. + */ + condlog(3, "%s: sysfs delete returned %zd (%s), treating as success", + info->dev_t, ret, strerror((int)-ret)); + return true; + } + + /* Success */ + condlog(2, "%s: initiated removal from %s", info->dev_t, info->alias); + return true; +} + +/* + * Prepare purge info for a path while holding vecs->lock. + * Takes a reference on the udev device so it remains valid after unlocking. + * Returns true if info was successfully prepared, false otherwise. + */ +static bool +prepare_purge_path_info(struct path *pp, struct purge_path_info *info) +{ + if (!pp->udev || !pp->mpp) + return false; + + info->udev = udev_device_ref(pp->udev); + if (!info->udev) + return false; + + strlcpy(info->dev_t, pp->dev_t, sizeof(info->dev_t)); + strlcpy(info->alias, pp->mpp->alias, sizeof(info->alias)); + info->pp = pp; + return true; +} + +/* + * Clean up purge info after use. + */ +static void +cleanup_purge_path_info(struct purge_path_info *info) +{ + if (info->udev) { + udev_device_unref(info->udev); + info->udev = NULL; + } +} + /* * uev_add_path can call uev_update_path, and uev_update_path can call * uev_add_path @@ -2031,6 +2137,10 @@ reinstate_path (struct path * pp) condlog(0, "%s: reinstate failed", pp->dev_t); else { condlog(2, "%s: reinstated", pp->dev_t); + if (pp->purge_path) { + pp->purge_path = false; + pp->purge_retries = 0; + } update_queue_mode_add_path(pp->mpp); } } @@ -2474,6 +2584,20 @@ get_new_state(struct path *pp) if (newstate == PATH_REMOVED) newstate = PATH_DOWN; + /* + * For PATH_DISOCONNECTED mark the path as OK to purge if that is + * enabled. Whether or not purge is enabled mark the path as down. + */ + if (newstate == PATH_DISCONNECTED) { + if (pp->mpp->purge_disconnected == PURGE_DISCONNECTED_ON && + !pp->purge_path) { + condlog(2, "%s: mark (%s) path for purge", + pp->dev, checker_state_name(newstate)); + pp->purge_path = true; + } + newstate = PATH_DOWN; + } + if (newstate == PATH_WILD || newstate == PATH_UNCHECKED) { condlog(2, "%s: unusable path (%s) - checker failed", pp->dev, checker_state_name(newstate)); @@ -3022,6 +3146,25 @@ update_paths(struct vectors *vecs, int *num_paths_p, time_t start_secs) return CHECKER_FINISHED; } +/* + * Check if there are newly marked paths to purge. + * Only returns true for paths with purge_path=true and purge_retries=0, + * which indicates they were just marked for purging and haven't been + * attempted yet. This prevents signaling the purge thread unnecessarily + * for paths that are already being retried. + */ +static bool +has_paths_to_purge(struct vectors *vecs) +{ + int i; + struct path *pp; + + vector_foreach_slot(vecs->pathvec, pp, i) + if (pp->purge_path && pp->purge_retries == 0) + return true; + return false; +} + static void enable_pathgroups(struct multipath *mpp) { struct pathgroup *pgp; @@ -3125,6 +3268,7 @@ checkerloop (void *ap) int num_paths = 0, strict_timing; unsigned int ticks = 0; enum checker_state checker_state = CHECKER_STARTING; + bool devs_to_purge = false; if (set_config_state(DAEMON_RUNNING) != DAEMON_RUNNING) /* daemon shutdown */ @@ -3168,8 +3312,10 @@ checkerloop (void *ap) if (checker_state == CHECKER_UPDATING_PATHS) checker_state = update_paths(vecs, &num_paths, start_time.tv_sec); - if (checker_state == CHECKER_FINISHED) + if (checker_state == CHECKER_FINISHED) { + devs_to_purge = has_paths_to_purge(vecs); checker_finished(vecs, ticks); + } lock_cleanup_pop(vecs->lock); } @@ -3192,6 +3338,13 @@ checkerloop (void *ap) (long)diff_time.tv_sec); } + if (devs_to_purge) { + pthread_mutex_lock(&purge_mutex); + devices_to_purge = 1; + pthread_cond_signal(&purge_cond); + pthread_mutex_unlock(&purge_mutex); + } + if (foreign_tick == 0) { conf = get_multipath_config(); foreign_tick = conf->max_checkint; @@ -3229,6 +3382,234 @@ checkerloop (void *ap) return NULL; } +/* + * Purge thread: removes disconnected paths by writing to sysfs. + * + * This thread is signaled by the checker thread when paths marked with + * purge_path=true are detected. It attempts to delete these paths by + * writing "1" to their SCSI device's sysfs "delete" attribute. + * + * The actual cleanup of path structures from pathvec happens asynchronously + * when the kernel sends a remove uevent (handled by uev_remove_path). + * If the uevent doesn't arrive or the sysfs write fails, paths will be + * retried on subsequent purge cycles with a timeout mechanism. + * + * To avoid holding vecs->lock for extended periods and starving other threads, + * this function processes one path at a time: lock -> handle one path -> unlock. + * Paths track which cycle they were checked in to avoid reprocessing. + */ +#define PURGE_TIMEOUT_CYCLES 10 /* ~50 seconds at 5 second check intervals */ + +static void +reset_purge_cycle_stats(unsigned int *purged, unsigned int *failed, + unsigned int *pending, bool *found) +{ + *purged = 0; + *failed = 0; + *pending = 0; + *found = false; +} + +static void * +purgeloop (void *ap) +{ + struct vectors *vecs; + unsigned int i; + struct path *pp; + unsigned int current_cycle = 1; + unsigned int devices_purged_total = 0; + unsigned int devices_failed_total = 0; + unsigned int devices_pending_total = 0; + bool found_path_this_cycle = false; + bool cycle_complete_logged = false; + vecs = (struct vectors *)ap; + + pthread_cleanup_push(rcu_unregister, NULL); + rcu_register_thread(); + mlockall(MCL_CURRENT | MCL_FUTURE); + + while (1) { + bool path_handled = false; + struct purge_path_info purge_info = { .udev = NULL }; + bool do_purge = false; + char dev_name[FILE_NAME_SIZE]; + + /* + * Lock and search for one path that needs processing. + * Copy the necessary info while holding the lock, then + * release the lock before doing any blocking sysfs operations. + */ + pthread_cleanup_push(cleanup_lock, &vecs->lock); + lock(&vecs->lock); + pthread_testcancel(); + + vector_foreach_slot (vecs->pathvec, pp, i) { + if (!pp->purge_path || pp->purge_cycle == current_cycle) + continue; + + pp->purge_cycle = current_cycle; + path_handled = true; + + /* Increment timeout counter for this purge attempt */ + pp->purge_retries++; + + if (pp->purge_retries >= PURGE_TIMEOUT_CYCLES) { + /* + * Timeout: path couldn't be deleted after multiple attempts. + * This likely means the device still exists in the kernel + * (perhaps it came back online, or the delete is stuck). + * + * We should NOT force-remove it from multipathd's state, + * as that would create a mismatch between multipathd's view + * and the kernel's view of existing devices. + * + * Instead, clear the purge flag and stop trying. The path + * will remain in multipathd. If it's truly disconnected, + * it will be marked for purge again on the next check cycle. + */ + condlog(1, "%s: purge timeout after %d attempts, giving up. " + "Device may still exist in kernel.", + pp->dev, pp->purge_retries); + pp->purge_path = false; + pp->purge_retries = 0; + devices_failed_total++; + } else { + /* + * Prepare info for sysfs deletion. + * Copy all needed data while holding the lock. + */ + if (prepare_purge_path_info(pp, &purge_info)) { + do_purge = true; + strlcpy(dev_name, pp->dev, sizeof(dev_name)); + } else { + /* Can't prepare info, will retry next cycle */ + devices_pending_total++; + condlog(4, "%s: failed to prepare purge info, will retry", + pp->dev); + } + } + break; + } + + lock_cleanup_pop(vecs->lock); + + /* + * Now we've released the lock. Do the non-blocking sysfs operation + * without holding vecs->lock. + */ + if (do_purge) { + bool success = delete_path_sysfs(&purge_info); + + /* + * Re-acquire lock to update path state. + * The path might have been removed while we didn't hold the lock, + * so we need to verify it still exists. + */ + pthread_cleanup_push(cleanup_lock, &vecs->lock); + lock(&vecs->lock); + pthread_testcancel(); + + /* Verify the path still exists in pathvec */ + if (find_slot(vecs->pathvec, purge_info.pp) != -1) { + if (success) { + /* Success: sysfs write succeeded, uevent expected */ + devices_purged_total++; + purge_info.pp->purge_path = false; + purge_info.pp->purge_retries = 0; + condlog(3, "%s: path purge successful", dev_name); + } else { + /* Will retry on next cycle */ + devices_pending_total++; + condlog(4, "%s: path purge failed, will retry (attempt %d/%d)", + dev_name, purge_info.pp->purge_retries, + PURGE_TIMEOUT_CYCLES); + } + } else { + /* + * Path was removed while we were doing sysfs write. + * This is actually success - the path is gone, which is what we wanted. + */ + if (success) { + devices_purged_total++; + condlog(3, "%s: path removed during purge operation (counted as success)", + dev_name); + } else { + /* Path removed but our sysfs write failed - don't count it */ + condlog(4, "%s: path removed during failed purge operation", dev_name); + } + } + + lock_cleanup_pop(vecs->lock); + cleanup_purge_path_info(&purge_info); + } + + /* + * If we didn't find any path to process, we've completed a cycle. + * Check if we should wait for more work or start a new cycle. + */ + if (path_handled) + found_path_this_cycle = true; + else { + if (found_path_this_cycle && !cycle_complete_logged) { + /* Completed a cycle, log statistics */ + condlog(2, "purge cycle %u complete: %u purged, %u failed, %u pending", + current_cycle, devices_purged_total, + devices_failed_total, devices_pending_total); + cycle_complete_logged = true; + + /* Check if we need to retry pending paths */ + if (devices_pending_total > 0) { + struct timespec wait_time; + struct config *conf; + + conf = get_multipath_config(); + wait_time.tv_sec = conf->checkint; + wait_time.tv_nsec = 0; + put_multipath_config(conf); + + condlog(3, "sleeping %ld seconds before next purge cycle", + (long)wait_time.tv_sec); + nanosleep(&wait_time, NULL); + + /* Start a new cycle for pending paths */ + current_cycle++; + reset_purge_cycle_stats(&devices_purged_total, + &devices_failed_total, + &devices_pending_total, + &found_path_this_cycle); + cycle_complete_logged = false; + condlog(3, "starting purge cycle %u", current_cycle); + continue; + } + + /* No pending paths, prepare for next cycle */ + current_cycle++; + reset_purge_cycle_stats(&devices_purged_total, + &devices_failed_total, + &devices_pending_total, + &found_path_this_cycle); + cycle_complete_logged = false; + } + + /* No paths to process, wait for signal */ + if (cycle_complete_logged) { + pthread_mutex_lock(&purge_mutex); + while (!devices_to_purge) { + condlog(3, "waiting on devices to purge"); + pthread_cond_wait(&purge_cond, &purge_mutex); + } + devices_to_purge = 0; + pthread_mutex_unlock(&purge_mutex); + + condlog(3, "starting purge cycle %u", current_cycle); + } + } + } + + pthread_cleanup_pop(1); + return NULL; +} + static int configure (struct vectors * vecs, enum force_reload_types reload_type) { @@ -3669,6 +4050,8 @@ static void cleanup_threads(void) if (check_thr_started) pthread_cancel(check_thr); + if (purge_thr_started) + pthread_cancel(purge_thr); if (uevent_thr_started) pthread_cancel(uevent_thr); if (uxlsnr_thr_started) @@ -3685,6 +4068,8 @@ static void cleanup_threads(void) if (check_thr_started) pthread_join(check_thr, NULL); + if (purge_thr_started) + pthread_join(purge_thr, NULL); if (uevent_thr_started) pthread_join(uevent_thr, NULL); if (uxlsnr_thr_started) @@ -3937,6 +4322,11 @@ child (__attribute__((unused)) void *param) goto failed; } else check_thr_started = true; + if ((rc = pthread_create(&purge_thr, &misc_attr, purgeloop, vecs))) { + condlog(0,"failed to create purge loop thread: %d", rc); + goto failed; + } else + purge_thr_started = true; if ((rc = pthread_create(&uevq_thr, &misc_attr, uevqloop, vecs))) { condlog(0, "failed to create uevent dispatcher: %d", rc); goto failed; -- 2.51.2 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3] mulitpath-tools Add purge capability to multipath-tools 2025-12-03 16:19 [PATCH v3] mulitpath-tools Add purge capability to multipath-tools Brian Bunker @ 2025-12-04 19:33 ` Benjamin Marzinski 2025-12-04 20:08 ` Benjamin Marzinski 2025-12-08 12:27 ` Martin Wilck 0 siblings, 2 replies; 7+ messages in thread From: Benjamin Marzinski @ 2025-12-04 19:33 UTC (permalink / raw) To: Brian Bunker; +Cc: dm-devel, mwilck, Krishna Kant On Wed, Dec 03, 2025 at 08:19:33AM -0800, Brian Bunker wrote: > In the Linux kernel when volumes are disconnected, they are left behind > to be re-enabled when the connection comes back. There are many cases > where this behavior is not desirable. > > The goal is to have paths appear when volumes are connected. This is > handled by target initiated rescans and posting of unit attentions. > Secondarily when volumes are disconnected at the target, paths should > be removed. > > This was discussed here in a kernel patch: > https://marc.info/?l=linux-scsi&m=164426722617834&w=2 > > It was suggested there that userspace would be more advantageous in > handling the device removal. > > An option was added to multipath-tools to purge devices that are > disconnected at the target. A purge thread was implemented to handle > the device removal when the path returns sense data (0x5/0x25/0x0) from > the path checker. The option, purge_disconnected, is by default off but > can be turned on by setting it to true. > > When enabled multipathd will log: > Dec 3 09:10:05 init110-14 multipathd[54768]: 8:128: path removed from > map 3624a93708ebb224d4a5e45c400011a43 > ... > Dec 3 09:10:14 init110-14 multipathd[54768]: purge cycle 2 complete: 3 > purged, 0 failed, 0 pending > > Signed-off-by: Brian Bunker <brian@purestorage.com> > Signed-off-by: Krishna Kant <krishna.kant@purestorage.com> > --- > libmultipath/checkers.c | 2 + > libmultipath/checkers.h | 2 + > libmultipath/checkers/tur.c | 10 + > libmultipath/config.c | 2 + > libmultipath/config.h | 3 + > libmultipath/configure.c | 1 + > libmultipath/defaults.h | 1 + > libmultipath/dict.c | 14 ++ > libmultipath/discovery.c | 3 +- > libmultipath/io_err_stat.c | 1 + > libmultipath/libmultipath.version | 1 + > libmultipath/print.c | 2 + > libmultipath/propsel.c | 16 ++ > libmultipath/propsel.h | 1 + > libmultipath/structs.h | 10 + > libmultipath/sysfs.c | 58 ++++- > libmultipath/sysfs.h | 2 + > multipath/multipath.conf.5.in | 22 ++ > multipathd/main.c | 402 +++++++++++++++++++++++++++++- > 19 files changed, 545 insertions(+), 8 deletions(-) > > diff --git a/libmultipath/checkers.c b/libmultipath/checkers.c > index e2eda58d..bb6ad1ee 100644 > --- a/libmultipath/checkers.c > +++ b/libmultipath/checkers.c > @@ -43,6 +43,7 @@ static const char *checker_state_names[PATH_MAX_STATE] = { > [PATH_TIMEOUT] = "timeout", > [PATH_REMOVED] = "removed", > [PATH_DELAYED] = "delayed", > + [PATH_DISCONNECTED] = "disconnected", > }; > > static LIST_HEAD(checkers); > @@ -363,6 +364,7 @@ static const char *generic_msg[CHECKER_GENERIC_MSGTABLE_SIZE] = { > [CHECKER_MSGID_DOWN] = " reports path is down", > [CHECKER_MSGID_GHOST] = " reports path is ghost", > [CHECKER_MSGID_UNSUPPORTED] = " doesn't support this device", > + [CHECKER_MSGID_DISCONNECTED] = " no access to this device", > }; > > const char *checker_message(const struct checker *c) > diff --git a/libmultipath/checkers.h b/libmultipath/checkers.h > index da91f499..903c3ebe 100644 > --- a/libmultipath/checkers.h > +++ b/libmultipath/checkers.h > @@ -78,6 +78,7 @@ enum path_check_state { > PATH_TIMEOUT, > PATH_REMOVED, > PATH_DELAYED, > + PATH_DISCONNECTED, > PATH_MAX_STATE > }; > > @@ -113,6 +114,7 @@ enum { > CHECKER_MSGID_DOWN, > CHECKER_MSGID_GHOST, > CHECKER_MSGID_UNSUPPORTED, > + CHECKER_MSGID_DISCONNECTED, > CHECKER_GENERIC_MSGTABLE_SIZE, > CHECKER_FIRST_MSGID = 100, /* lowest msgid for checkers */ > CHECKER_MSGTABLE_SIZE = 100, /* max msg table size for checkers */ > diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c > index 0010acf8..1bb9a992 100644 > --- a/libmultipath/checkers/tur.c > +++ b/libmultipath/checkers/tur.c > @@ -199,6 +199,16 @@ retry: > return PATH_PENDING; > } > } > + else if (key == 0x5) { > + /* Illegal request */ > + if (asc == 0x25 && ascq == 0x00) { > + /* > + * LOGICAL UNIT NOT SUPPORTED > + */ > + *msgid = CHECKER_MSGID_DISCONNECTED; > + return PATH_DISCONNECTED; > + } > + } > *msgid = CHECKER_MSGID_DOWN; > return PATH_DOWN; > } > diff --git a/libmultipath/config.c b/libmultipath/config.c > index 8b424d18..2c66a788 100644 > --- a/libmultipath/config.c > +++ b/libmultipath/config.c > @@ -470,6 +470,7 @@ merge_hwe (struct hwentry * dst, struct hwentry * src) > merge_num(marginal_path_err_rate_threshold); > merge_num(marginal_path_err_recheck_gap_time); > merge_num(marginal_path_double_failed_time); > + merge_num(purge_disconnected); > > snprintf(id, sizeof(id), "%s/%s", dst->vendor, dst->product); > reconcile_features_with_options(id, &dst->features, > @@ -517,6 +518,7 @@ merge_mpe(struct mpentry *dst, struct mpentry *src) > merge_num(skip_kpartx); > merge_num(max_sectors_kb); > merge_num(ghost_delay); > + merge_num(purge_disconnected); > merge_num(uid); > merge_num(gid); > merge_num(mode); > diff --git a/libmultipath/config.h b/libmultipath/config.h > index 5b4ebf8c..618fa572 100644 > --- a/libmultipath/config.h > +++ b/libmultipath/config.h > @@ -88,6 +88,7 @@ struct hwentry { > int marginal_path_err_rate_threshold; > int marginal_path_err_recheck_gap_time; > int marginal_path_double_failed_time; > + int purge_disconnected; > int skip_kpartx; > int max_sectors_kb; > int ghost_delay; > @@ -130,6 +131,7 @@ struct mpentry { > int marginal_path_err_rate_threshold; > int marginal_path_err_recheck_gap_time; > int marginal_path_double_failed_time; > + int purge_disconnected; > int skip_kpartx; > int max_sectors_kb; > int ghost_delay; > @@ -188,6 +190,7 @@ struct config { > int marginal_path_err_rate_threshold; > int marginal_path_err_recheck_gap_time; > int marginal_path_double_failed_time; > + int purge_disconnected; > int uxsock_timeout; > int strict_timing; > int retrigger_tries; > diff --git a/libmultipath/configure.c b/libmultipath/configure.c > index baa13573..e4431d2f 100644 > --- a/libmultipath/configure.c > +++ b/libmultipath/configure.c > @@ -355,6 +355,7 @@ int setup_map(struct multipath *mpp, char **params, struct vectors *vecs) > select_max_sectors_kb(conf, mpp); > select_ghost_delay(conf, mpp); > select_flush_on_last_del(conf, mpp); > + select_purge_disconnected(conf, mpp); > > sysfs_set_scsi_tmo(conf, mpp); > marginal_pathgroups = conf->marginal_pathgroups; > diff --git a/libmultipath/defaults.h b/libmultipath/defaults.h > index 134b690a..d80dd7d4 100644 > --- a/libmultipath/defaults.h > +++ b/libmultipath/defaults.h > @@ -58,6 +58,7 @@ > #define DEFAULT_ALL_TG_PT ALL_TG_PT_OFF > #define DEFAULT_RECHECK_WWID RECHECK_WWID_OFF > #define DEFAULT_AUTO_RESIZE AUTO_RESIZE_NEVER > +#define DEFAULT_PURGE_DISCONNECTED PURGE_DISCONNECTED_OFF > /* Enable no foreign libraries by default */ > #define DEFAULT_ENABLE_FOREIGN "NONE" > > diff --git a/libmultipath/dict.c b/libmultipath/dict.c > index a06a6138..504b28b5 100644 > --- a/libmultipath/dict.c > +++ b/libmultipath/dict.c > @@ -941,6 +941,16 @@ declare_hw_snprint(skip_kpartx, print_yes_no_undef) > declare_mp_handler(skip_kpartx, set_yes_no_undef) > declare_mp_snprint(skip_kpartx, print_yes_no_undef) > > +declare_def_handler(purge_disconnected, set_yes_no_undef) > +declare_def_snprint_defint(purge_disconnected, print_yes_no_undef, > + DEFAULT_PURGE_DISCONNECTED) > +declare_ovr_handler(purge_disconnected, set_yes_no_undef) > +declare_ovr_snprint(purge_disconnected, print_yes_no_undef) > +declare_hw_handler(purge_disconnected, set_yes_no_undef) > +declare_hw_snprint(purge_disconnected, print_yes_no_undef) > +declare_mp_handler(purge_disconnected, set_yes_no_undef) > +declare_mp_snprint(purge_disconnected, print_yes_no_undef) > + > declare_def_range_handler(remove_retries, 0, INT_MAX) > declare_def_snprint(remove_retries, print_int) > > @@ -2227,6 +2237,7 @@ init_keywords(vector keywords) > install_keyword("retrigger_delay", &def_retrigger_delay_handler, &snprint_def_retrigger_delay); > install_keyword("missing_uev_wait_timeout", &def_uev_wait_timeout_handler, &snprint_def_uev_wait_timeout); > install_keyword("skip_kpartx", &def_skip_kpartx_handler, &snprint_def_skip_kpartx); > + install_keyword("purge_disconnected", &def_purge_disconnected_handler, &snprint_def_purge_disconnected); > install_keyword("disable_changed_wwids", &deprecated_disable_changed_wwids_handler, &snprint_deprecated); > install_keyword("remove_retries", &def_remove_retries_handler, &snprint_def_remove_retries); > install_keyword("max_sectors_kb", &def_max_sectors_kb_handler, &snprint_def_max_sectors_kb); > @@ -2310,6 +2321,7 @@ init_keywords(vector keywords) > install_keyword("marginal_path_err_recheck_gap_time", &hw_marginal_path_err_recheck_gap_time_handler, &snprint_hw_marginal_path_err_recheck_gap_time); > install_keyword("marginal_path_double_failed_time", &hw_marginal_path_double_failed_time_handler, &snprint_hw_marginal_path_double_failed_time); > install_keyword("skip_kpartx", &hw_skip_kpartx_handler, &snprint_hw_skip_kpartx); > + install_keyword("purge_disconnected", &hw_purge_disconnected_handler, &snprint_hw_purge_disconnected); > install_keyword("max_sectors_kb", &hw_max_sectors_kb_handler, &snprint_hw_max_sectors_kb); > install_keyword("ghost_delay", &hw_ghost_delay_handler, &snprint_hw_ghost_delay); > install_keyword("all_tg_pt", &hw_all_tg_pt_handler, &snprint_hw_all_tg_pt); > @@ -2355,6 +2367,7 @@ init_keywords(vector keywords) > install_keyword("marginal_path_double_failed_time", &ovr_marginal_path_double_failed_time_handler, &snprint_ovr_marginal_path_double_failed_time); > > install_keyword("skip_kpartx", &ovr_skip_kpartx_handler, &snprint_ovr_skip_kpartx); > + install_keyword("purge_disconnected", &ovr_purge_disconnected_handler, &snprint_ovr_purge_disconnected); > install_keyword("max_sectors_kb", &ovr_max_sectors_kb_handler, &snprint_ovr_max_sectors_kb); > install_keyword("ghost_delay", &ovr_ghost_delay_handler, &snprint_ovr_ghost_delay); > install_keyword("all_tg_pt", &ovr_all_tg_pt_handler, &snprint_ovr_all_tg_pt); > @@ -2400,6 +2413,7 @@ init_keywords(vector keywords) > install_keyword("marginal_path_err_recheck_gap_time", &mp_marginal_path_err_recheck_gap_time_handler, &snprint_mp_marginal_path_err_recheck_gap_time); > install_keyword("marginal_path_double_failed_time", &mp_marginal_path_double_failed_time_handler, &snprint_mp_marginal_path_double_failed_time); > install_keyword("skip_kpartx", &mp_skip_kpartx_handler, &snprint_mp_skip_kpartx); > + install_keyword("purge_disconnected", &mp_purge_disconnected_handler, &snprint_mp_purge_disconnected); > install_keyword("max_sectors_kb", &mp_max_sectors_kb_handler, &snprint_mp_max_sectors_kb); > install_keyword("ghost_delay", &mp_ghost_delay_handler, &snprint_mp_ghost_delay); > install_sublevel_end(); > diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c > index 31db8758..46274f3f 100644 > --- a/libmultipath/discovery.c > +++ b/libmultipath/discovery.c > @@ -2482,7 +2482,8 @@ int pathinfo(struct path *pp, struct config *conf, int mask) > pp->state == PATH_UNCHECKED || > pp->state == PATH_WILD) > pp->chkrstate = pp->state = newstate; > - if (pp->state == PATH_TIMEOUT) > + if (pp->state == PATH_TIMEOUT || > + pp->state == PATH_DISCONNECTED) > pp->state = PATH_DOWN; > if (pp->state == PATH_UP && !pp->size) { > condlog(3, "%s: device size is 0, " > diff --git a/libmultipath/io_err_stat.c b/libmultipath/io_err_stat.c > index 64054c18..acb087b7 100644 > --- a/libmultipath/io_err_stat.c > +++ b/libmultipath/io_err_stat.c > @@ -397,6 +397,7 @@ static void account_async_io_state(struct io_err_stat_path *pp, int rc) > switch (rc) { > case PATH_DOWN: > case PATH_TIMEOUT: > + case PATH_DISCONNECTED: > pp->io_err_nr++; > break; > case PATH_UNCHECKED: > diff --git a/libmultipath/libmultipath.version b/libmultipath/libmultipath.version > index 89ae2a3c..c435847f 100644 > --- a/libmultipath/libmultipath.version > +++ b/libmultipath/libmultipath.version > @@ -239,6 +239,7 @@ global: > snprint_tgt_wwnn; > snprint_tgt_wwpn; > sysfs_attr_set_value; > + sysfs_attr_set_value_nonblock; > sysfs_attr_get_value; > sysfs_get_asymmetric_access_state; > > diff --git a/libmultipath/print.c b/libmultipath/print.c > index 0d44a5a9..8b09b174 100644 > --- a/libmultipath/print.c > +++ b/libmultipath/print.c > @@ -541,6 +541,8 @@ snprint_chk_state (struct strbuf *buff, const struct path * pp) > return append_strbuf_str(buff, "i/o timeout"); > case PATH_DELAYED: > return append_strbuf_str(buff, "delayed"); > + case PATH_DISCONNECTED: > + return append_strbuf_str(buff, "disconnected"); > default: > return append_strbuf_str(buff, "undef"); > } > diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c > index 4c0fbcf3..9a17f1ed 100644 > --- a/libmultipath/propsel.c > +++ b/libmultipath/propsel.c > @@ -1371,6 +1371,22 @@ out: > return 0; > } > > +int select_purge_disconnected (struct config *conf, struct multipath * mp) > +{ > + const char *origin; > + > + mp_set_mpe(purge_disconnected); > + mp_set_ovr(purge_disconnected); > + mp_set_hwe(purge_disconnected); > + mp_set_conf(purge_disconnected); > + mp_set_default(purge_disconnected, DEFAULT_PURGE_DISCONNECTED); > +out: > + condlog(3, "%s: purge_disconnected = %s %s", mp->alias, > + (mp->purge_disconnected == PURGE_DISCONNECTED_ON)? "yes" : "no", > + origin); > + return 0; > +} > + > int select_max_sectors_kb(struct config *conf, struct multipath * mp) > { > const char *origin; > diff --git a/libmultipath/propsel.h b/libmultipath/propsel.h > index 55930050..c6c937c3 100644 > --- a/libmultipath/propsel.h > +++ b/libmultipath/propsel.h > @@ -39,6 +39,7 @@ int select_marginal_path_err_rate_threshold(struct config *conf, struct multipat > int select_marginal_path_err_recheck_gap_time(struct config *conf, struct multipath *mp); > int select_marginal_path_double_failed_time(struct config *conf, struct multipath *mp); > int select_ghost_delay(struct config *conf, struct multipath * mp); > +int select_purge_disconnected (struct config *conf, struct multipath * mp); > void reconcile_features_with_options(const char *id, char **features, > int* no_path_retry, > int *retain_hwhandler); > diff --git a/libmultipath/structs.h b/libmultipath/structs.h > index 9247e74f..81ac6be5 100644 > --- a/libmultipath/structs.h > +++ b/libmultipath/structs.h > @@ -186,6 +186,12 @@ enum auto_resize_state { > AUTO_RESIZE_GROW_SHRINK, > }; > > +enum purge_disconnected_states { > + PURGE_DISCONNECTED_UNDEF = YNU_UNDEF, > + PURGE_DISCONNECTED_OFF = YNU_NO, > + PURGE_DISCONNECTED_ON = YNU_YES, > +}; > + > #define PROTOCOL_UNSET -1 > > enum scsi_protocol { > @@ -382,6 +388,9 @@ struct path { > int dmstate; > int chkrstate; > int oldstate; > + bool purge_path; > + int purge_retries; /* number of purge attempts for this path */ > + unsigned int purge_cycle; /* last purge cycle that checked this path */ > int failcount; > int priority; > int pgindex; > @@ -483,6 +492,7 @@ struct multipath { > int ghost_delay; > int ghost_delay_tick; > int queue_mode; > + int purge_disconnected; > unsigned int sync_tick; > int checker_count; > enum prio_update_type prio_update; > diff --git a/libmultipath/sysfs.c b/libmultipath/sysfs.c > index af27d107..2160d7fd 100644 > --- a/libmultipath/sysfs.c > +++ b/libmultipath/sysfs.c > @@ -134,7 +134,7 @@ ssize_t sysfs_attr_set_value(struct udev_device *dev, const char *attr_name, > size = write(fd, value, value_len); > if (size < 0) { > size = -errno; > - condlog(3, "%s: write to %s failed: %s", __func__, > + condlog(3, "%s: write to %s failed: %s", __func__, > devpath, strerror(errno)); > } else if (size < (ssize_t)value_len) > condlog(3, "%s: underflow writing %zu bytes to %s. Wrote %zd bytes", > @@ -144,6 +144,62 @@ ssize_t sysfs_attr_set_value(struct udev_device *dev, const char *attr_name, > return size; > } Unfortunately. This doesn't work at all. O_NONBLOCK doesn't mean anything for sysfs writes. The only way to make the write non-blocking would be to use libaio or io_uring. But we're already doing the write in a separate thread without holding any locks, so we're fine with it blocking. > > +/* > + * Non-blocking version of sysfs_attr_set_value for use with potentially > + * blocking sysfs attributes (like SCSI "delete"). > + * > + * Returns: > + * > 0: number of bytes written (success) > + * -EAGAIN: write would block, caller should retry or wait for completion > + * < 0: other error (negative errno) > + */ > +ssize_t sysfs_attr_set_value_nonblock(struct udev_device *dev, const char *attr_name, > + const char * value, size_t value_len) > +{ > + const char *syspath; > + char devpath[PATH_MAX]; > + int fd = -1; > + ssize_t size = -1; > + > + if (!dev || !attr_name || !value || !value_len) { > + condlog(1, "%s: invalid parameters", __func__); > + return -EINVAL; > + } > + > + syspath = udev_device_get_syspath(dev); > + if (!syspath) { > + condlog(3, "%s: invalid udevice", __func__); > + return -EINVAL; > + } > + if (safe_sprintf(devpath, "%s/%s", syspath, attr_name)) { > + condlog(3, "%s: devpath overflow", __func__); > + return -EOVERFLOW; > + } > + > + condlog(4, "open '%s' (non-blocking)", devpath); > + /* Open with O_NONBLOCK to avoid blocking in open() or write() */ > + fd = open(devpath, O_WRONLY | O_NONBLOCK); > + if (fd < 0) { > + condlog(3, "%s: attribute '%s' cannot be opened: %s", > + __func__, devpath, strerror(errno)); > + return -errno; > + } > + pthread_cleanup_push(cleanup_fd_ptr, &fd); > + > + size = write(fd, value, value_len); > + if (size < 0) { > + size = -errno; > + if (errno != EAGAIN) > + condlog(3, "%s: write to %s failed: %s", __func__, > + devpath, strerror(errno)); > + } else if (size < (ssize_t)value_len) > + condlog(3, "%s: underflow writing %zu bytes to %s. Wrote %zd bytes", > + __func__, value_len, devpath, size); > + > + pthread_cleanup_pop(1); > + return size; > +} > + > int > sysfs_get_size (struct path *pp, unsigned long long * size) > { > diff --git a/libmultipath/sysfs.h b/libmultipath/sysfs.h > index 45f24c37..da2deaa3 100644 > --- a/libmultipath/sysfs.h > +++ b/libmultipath/sysfs.h > @@ -10,6 +10,8 @@ > int devt2devname (char *, int, const char *); > ssize_t sysfs_attr_set_value(struct udev_device *dev, const char *attr_name, > const char * value, size_t value_len); > +ssize_t sysfs_attr_set_value_nonblock(struct udev_device *dev, const char *attr_name, > + const char * value, size_t value_len); > ssize_t sysfs_attr_get_value(struct udev_device *dev, const char *attr_name, > char * value, size_t value_len); > ssize_t sysfs_bin_attr_get_value(struct udev_device *dev, const char *attr_name, > diff --git a/multipath/multipath.conf.5.in b/multipath/multipath.conf.5.in > index 3c9ae097..84cd1a0a 100644 > --- a/multipath/multipath.conf.5.in > +++ b/multipath/multipath.conf.5.in > @@ -1306,6 +1306,22 @@ The default is: \fBno\fR > . > . > .TP > +.B purge_disconnected > +If set to > +.I yes > +, multipathd will automatically remove devices that are in a disconnected state. > +A path is considered disconnected when the TUR (Test Unit Ready) path checker > +receives the SCSI sense code "LOGICAL UNIT NOT SUPPORTED" (sense key 0x5, > +ASC/ASCQ 0x25/0x00). This typically indicates that the LUN has been unmapped > +or is no longer presented by the storage array. This option helps clean up > +stale device entries that would otherwise remain in the system. > +.RS > +.TP > +The default is: \fBno\fR > +.RE > +. > +. > +.TP > .B disable_changed_wwids > (Deprecated) This option is not supported anymore, and will be ignored. > .RE > @@ -1602,6 +1618,8 @@ section: > .TP > .B skip_kpartx > .TP > +.B purge_disconnected > +.TP > .B max_sectors_kb > .TP > .B ghost_delay > @@ -1797,6 +1815,8 @@ section: > .TP > .B skip_kpartx > .TP > +.B purge_disconnected > +.TP > .B max_sectors_kb > .TP > .B ghost_delay > @@ -1881,6 +1901,8 @@ the values are taken from the \fIdevices\fR or \fIdefaults\fR sections: > .TP > .B skip_kpartx > .TP > +.B purge_disconnected > +.TP > .B max_sectors_kb > .TP > .B ghost_delay > diff --git a/multipathd/main.c b/multipathd/main.c > index d11a8576..b105cd56 100644 > --- a/multipathd/main.c > +++ b/multipathd/main.c > @@ -137,11 +137,14 @@ static enum force_reload_types reconfigure_pending = FORCE_RELOAD_NONE; > pid_t daemon_pid; > static pthread_mutex_t config_lock = PTHREAD_MUTEX_INITIALIZER; > static pthread_cond_t config_cond; > -static pthread_t check_thr, uevent_thr, uxlsnr_thr, uevq_thr, dmevent_thr, > - fpin_thr, fpin_consumer_thr; > -static bool check_thr_started, uevent_thr_started, uxlsnr_thr_started, > - uevq_thr_started, dmevent_thr_started, fpin_thr_started, > - fpin_consumer_thr_started; > +static pthread_mutex_t purge_mutex = PTHREAD_MUTEX_INITIALIZER; > +static pthread_cond_t purge_cond = PTHREAD_COND_INITIALIZER; > +int devices_to_purge = 0; > +static pthread_t check_thr, purge_thr, uevent_thr, uxlsnr_thr, uevq_thr, > + dmevent_thr, fpin_thr, fpin_consumer_thr; > +static bool check_thr_started, purge_thr_started, uevent_thr_started, > + uxlsnr_thr_started, uevq_thr_started, dmevent_thr_started, > + fpin_thr_started, fpin_consumer_thr_started; > static int pid_fd = -1; > > static inline enum daemon_status get_running_state(bool *pending_reconfig) > @@ -1089,6 +1092,109 @@ check_path_wwid_change(struct path *pp) > return false; > } This is problematic. Without holding the vecs lock, we cannot trust the path to continue to exist. It could go away at any time, and we would be pointing at freed memory. > +/* > + * Information needed to delete a path via sysfs, copied while holding the lock > + * so that the actual sysfs write can be done without holding vecs->lock. > + */ > +struct purge_path_info { > + struct udev_device *udev; /* Reference to udev device (refcounted) */ > + char dev_t[BLK_DEV_SIZE]; /* For logging */ > + char alias[WWID_SIZE]; /* mpp alias for logging */ > + struct path *pp; /* Path pointer for updating purge_path flag */ > +}; > + > +/* > + * Attempt to delete a path by writing to the SCSI device's sysfs delete attribute. > + * This triggers kernel-level device removal. The actual cleanup of the path structure > + * from pathvec happens later when a uevent arrives (handled by uev_remove_path). > + * > + * This function does NOT require vecs->lock to be held, as it operates on copied data. > + * Uses non-blocking I/O to avoid blocking in scsi_remove_device(). > + * > + * Returns: > + * true - sysfs write succeeded or device already being deleted > + * false - sysfs write would block (EAGAIN), caller should retry > + * > + * Note: Errors other than EAGAIN are treated as success, because they typically > + * indicate the device is already in SDEV_DEL or SDEV_CANCEL state, or the sysfs > + * file doesn't exist (device already removed). > + */ Without the vecs lock being held there nothing to make sure that the path still exists. This means that it could get removed, at which point the devname could get reused. Having a udev reference will not prevent that. So, for instance sdc may now be a new device. Whether or not you would delete this new device depends on whether it's sysfs path remains the same as for the old device. The only real guarantee that the device won't get reused is if someone has it open. > +static bool > +delete_path_sysfs(struct purge_path_info *info) > +{ > + struct udev_device *ud; > + ssize_t ret; > + > + if (!info->udev) > + return true; /* No device, treat as success */ > + > + ud = udev_device_get_parent_with_subsystem_devtype(info->udev, > + "scsi", "scsi_device"); > + > + if (!ud) { > + /* No SCSI parent device, likely already removed */ > + condlog(4, "%s: no SCSI parent device found, likely already removed", > + info->dev_t); > + return true; > + } > + > + ret = sysfs_attr_set_value_nonblock(ud, "delete", "1", strlen("1")); > + > + if (ret == -EAGAIN) { > + /* Write would block, caller should retry */ > + condlog(4, "%s: delete would block, will retry", info->dev_t); > + return false; > + } > + > + if (ret < 0) { > + /* > + * Other errors (ENOENT, EINVAL, etc.) typically mean the device > + * is already being deleted or is in SDEV_DEL/SDEV_CANCEL state. > + * Treat as success since the end result is what we want. > + */ > + condlog(3, "%s: sysfs delete returned %zd (%s), treating as success", > + info->dev_t, ret, strerror((int)-ret)); > + return true; > + } > + > + /* Success */ > + condlog(2, "%s: initiated removal from %s", info->dev_t, info->alias); > + return true; > +} > + > +/* > + * Prepare purge info for a path while holding vecs->lock. > + * Takes a reference on the udev device so it remains valid after unlocking. > + * Returns true if info was successfully prepared, false otherwise. > + */ > +static bool > +prepare_purge_path_info(struct path *pp, struct purge_path_info *info) > +{ > + if (!pp->udev || !pp->mpp) > + return false; > + > + info->udev = udev_device_ref(pp->udev); > + if (!info->udev) > + return false; > + > + strlcpy(info->dev_t, pp->dev_t, sizeof(info->dev_t)); > + strlcpy(info->alias, pp->mpp->alias, sizeof(info->alias)); > + info->pp = pp; > + return true; > +} This has to be implemented as function that can get run by a cleanup handler. All multipathd threads must cleanup any state (references, open fd, locks) if they are cancelled. > +/* > + * Clean up purge info after use. > + */ > +static void > +cleanup_purge_path_info(struct purge_path_info *info) > +{ > + if (info->udev) { > + udev_device_unref(info->udev); > + info->udev = NULL; > + } > +} > + > /* > * uev_add_path can call uev_update_path, and uev_update_path can call > * uev_add_path > @@ -2031,6 +2137,10 @@ reinstate_path (struct path * pp) > condlog(0, "%s: reinstate failed", pp->dev_t); > else { > condlog(2, "%s: reinstated", pp->dev_t); > + if (pp->purge_path) { On the off chance that a path gets disconnected, not successfully purged, and then reconnected, we should probably clear pp->purge_cycle here as well. There is an admittedly very rare chance that the path could later be again disconnected after purgeloop's current_cycle rolled over and got back to the old value of pp->purge_cycle. This would keep it from getting handled, since it would appear to have already been handled in the current cycle. > + pp->purge_path = false; > + pp->purge_retries = 0; > + } > update_queue_mode_add_path(pp->mpp); > } > } > @@ -2474,6 +2584,20 @@ get_new_state(struct path *pp) > if (newstate == PATH_REMOVED) > newstate = PATH_DOWN; > > + /* > + * For PATH_DISOCONNECTED mark the path as OK to purge if that is > + * enabled. Whether or not purge is enabled mark the path as down. > + */ > + if (newstate == PATH_DISCONNECTED) { > + if (pp->mpp->purge_disconnected == PURGE_DISCONNECTED_ON && > + !pp->purge_path) { > + condlog(2, "%s: mark (%s) path for purge", > + pp->dev, checker_state_name(newstate)); > + pp->purge_path = true; > + } > + newstate = PATH_DOWN; > + } > + > if (newstate == PATH_WILD || newstate == PATH_UNCHECKED) { > condlog(2, "%s: unusable path (%s) - checker failed", > pp->dev, checker_state_name(newstate)); > @@ -3022,6 +3146,25 @@ update_paths(struct vectors *vecs, int *num_paths_p, time_t start_secs) > return CHECKER_FINISHED; > } > > +/* > + * Check if there are newly marked paths to purge. > + * Only returns true for paths with purge_path=true and purge_retries=0, > + * which indicates they were just marked for purging and haven't been > + * attempted yet. This prevents signaling the purge thread unnecessarily > + * for paths that are already being retried. > + */ > +static bool > +has_paths_to_purge(struct vectors *vecs) > +{ > + int i; > + struct path *pp; > + > + vector_foreach_slot(vecs->pathvec, pp, i) > + if (pp->purge_path && pp->purge_retries == 0) > + return true; > + return false; > +} > + > static void enable_pathgroups(struct multipath *mpp) > { > struct pathgroup *pgp; > @@ -3125,6 +3268,7 @@ checkerloop (void *ap) > int num_paths = 0, strict_timing; > unsigned int ticks = 0; > enum checker_state checker_state = CHECKER_STARTING; > + bool devs_to_purge = false; > > if (set_config_state(DAEMON_RUNNING) != DAEMON_RUNNING) > /* daemon shutdown */ > @@ -3168,8 +3312,10 @@ checkerloop (void *ap) > if (checker_state == CHECKER_UPDATING_PATHS) > checker_state = update_paths(vecs, &num_paths, > start_time.tv_sec); > - if (checker_state == CHECKER_FINISHED) > + if (checker_state == CHECKER_FINISHED) { > + devs_to_purge = has_paths_to_purge(vecs); > checker_finished(vecs, ticks); > + } > lock_cleanup_pop(vecs->lock); > } > > @@ -3192,6 +3338,13 @@ checkerloop (void *ap) > (long)diff_time.tv_sec); > } > > + if (devs_to_purge) { > + pthread_mutex_lock(&purge_mutex); > + devices_to_purge = 1; > + pthread_cond_signal(&purge_cond); > + pthread_mutex_unlock(&purge_mutex); > + } > + > if (foreign_tick == 0) { > conf = get_multipath_config(); > foreign_tick = conf->max_checkint; > @@ -3229,6 +3382,234 @@ checkerloop (void *ap) > return NULL; > } > > +/* > + * Purge thread: removes disconnected paths by writing to sysfs. > + * > + * This thread is signaled by the checker thread when paths marked with > + * purge_path=true are detected. It attempts to delete these paths by > + * writing "1" to their SCSI device's sysfs "delete" attribute. > + * > + * The actual cleanup of path structures from pathvec happens asynchronously > + * when the kernel sends a remove uevent (handled by uev_remove_path). > + * If the uevent doesn't arrive or the sysfs write fails, paths will be > + * retried on subsequent purge cycles with a timeout mechanism. > + * > + * To avoid holding vecs->lock for extended periods and starving other threads, > + * this function processes one path at a time: lock -> handle one path -> unlock. > + * Paths track which cycle they were checked in to avoid reprocessing. > + */ > +#define PURGE_TIMEOUT_CYCLES 10 /* ~50 seconds at 5 second check intervals */ > + > +static void > +reset_purge_cycle_stats(unsigned int *purged, unsigned int *failed, > + unsigned int *pending, bool *found) > +{ > + *purged = 0; > + *failed = 0; > + *pending = 0; > + *found = false; > +} How about a simplifying the purge thread slightly. When the checker gets a path it is supposed to purge, it sets purge_path, just like now. I don't think we need to track purge_retries at all. When the purge loop is triggered, it locks the vecs lock, and finds a single path to purge, like now. The only info it needs to save is the udev device (with a new refcount taken like now) and a dup of pp->fd (You need this to guarantee that the devnode won't get reused if the device is deleted and a new one is re-added). You could also save the name, but udev_device_get_devnode() will get it for you. Like I mentioned above, you need to have a pthread cleanup handler that gets run to clean these up if they are set and the thread is cancelled, just like how the vecs lock is handled. Then purgeloop() clears pp->purge_path, drops the vecs lock, does the sysfs delete, and cleans up the purge_info by calling pthread_cleanup_pop(). Finally it loops back to reacquiring the vecs lock and finding the next path to delete. I don't really see a value in all the stats tracking. If it goes through all the paths, and doesn't find one in need of purging, it waits, just like now. I'm not even sure we need to keep the cycle checking code. If we really think the sysfs call can fail, and leave the path around, then it may be necessary. In that case, if the purge thread took a long time (longer than 5 seconds) the checker could set pp->pruge_path again before the purge thread got to processing the later paths in need of purging. We could get stuck processing the same path over and over. But this only really happens if purging takes a long time and then fails. I also don't think the sleeping code is necessary. If there are paths to purge, they are either newly added, or the purge thread has taken long that the checker has run again and re-added them. Either way, it seems like we should handle them right away. Martin's point (that this could be done by an entirely separate program that get's called with a scsi device, opens it (to make sure it doesn't get removed), verifys that it is disconnected, and then removes it via sysfs) is reasonable. But if we want to solve this in multipathd. limiting the vecs lock holding to just when we set up the > +static void * > +purgeloop (void *ap) > +{ > + struct vectors *vecs; > + unsigned int i; > + struct path *pp; > + unsigned int current_cycle = 1; > + unsigned int devices_purged_total = 0; > + unsigned int devices_failed_total = 0; > + unsigned int devices_pending_total = 0; > + bool found_path_this_cycle = false; > + bool cycle_complete_logged = false; > + vecs = (struct vectors *)ap; > + > + pthread_cleanup_push(rcu_unregister, NULL); > + rcu_register_thread(); > + mlockall(MCL_CURRENT | MCL_FUTURE); > + > + while (1) { > + bool path_handled = false; > + struct purge_path_info purge_info = { .udev = NULL }; > + bool do_purge = false; > + char dev_name[FILE_NAME_SIZE]; > + > + /* > + * Lock and search for one path that needs processing. > + * Copy the necessary info while holding the lock, then > + * release the lock before doing any blocking sysfs operations. > + */ we need to pthread_cleanup_push a handle that cleans up purge_info (if it's set) here. > + pthread_cleanup_push(cleanup_lock, &vecs->lock); > + lock(&vecs->lock); > + pthread_testcancel(); > + > + vector_foreach_slot (vecs->pathvec, pp, i) { > + if (!pp->purge_path || pp->purge_cycle == current_cycle) > + continue; > + > + pp->purge_cycle = current_cycle; > + path_handled = true; > + I think we can remove all this retries code. > + /* Increment timeout counter for this purge attempt */ > + pp->purge_retries++; > + > + if (pp->purge_retries >= PURGE_TIMEOUT_CYCLES) { > + /* > + * Timeout: path couldn't be deleted after multiple attempts. > + * This likely means the device still exists in the kernel > + * (perhaps it came back online, or the delete is stuck). > + * > + * We should NOT force-remove it from multipathd's state, > + * as that would create a mismatch between multipathd's view > + * and the kernel's view of existing devices. > + * > + * Instead, clear the purge flag and stop trying. The path > + * will remain in multipathd. If it's truly disconnected, > + * it will be marked for purge again on the next check cycle. > + */ > + condlog(1, "%s: purge timeout after %d attempts, giving up. " > + "Device may still exist in kernel.", > + pp->dev, pp->purge_retries); > + pp->purge_path = false; > + pp->purge_retries = 0; > + devices_failed_total++; > + } else { > + /* > + * Prepare info for sysfs deletion. > + * Copy all needed data while holding the lock. > + */ > + if (prepare_purge_path_info(pp, &purge_info)) { > + do_purge = true; > + strlcpy(dev_name, pp->dev, sizeof(dev_name)); > + } else { > + /* Can't prepare info, will retry next cycle */ > + devices_pending_total++; > + condlog(4, "%s: failed to prepare purge info, will retry", > + pp->dev); > + } > + } > + break; > + } > + > + lock_cleanup_pop(vecs->lock); > + > + /* > + * Now we've released the lock. Do the non-blocking sysfs operation > + * without holding vecs->lock. > + */ > + if (do_purge) { > + bool success = delete_path_sysfs(&purge_info); I don't really see the benefit of all this work done with the lock reacquired. > + /* > + * Re-acquire lock to update path state. > + * The path might have been removed while we didn't hold the lock, > + * so we need to verify it still exists. > + */ > + pthread_cleanup_push(cleanup_lock, &vecs->lock); > + lock(&vecs->lock); > + pthread_testcancel(); > + > + /* Verify the path still exists in pathvec */ > + if (find_slot(vecs->pathvec, purge_info.pp) != -1) { > + if (success) { > + /* Success: sysfs write succeeded, uevent expected */ > + devices_purged_total++; > + purge_info.pp->purge_path = false; > + purge_info.pp->purge_retries = 0; > + condlog(3, "%s: path purge successful", dev_name); > + } else { > + /* Will retry on next cycle */ > + devices_pending_total++; > + condlog(4, "%s: path purge failed, will retry (attempt %d/%d)", > + dev_name, purge_info.pp->purge_retries, > + PURGE_TIMEOUT_CYCLES); > + } > + } else { > + /* > + * Path was removed while we were doing sysfs write. > + * This is actually success - the path is gone, which is what we wanted. > + */ > + if (success) { > + devices_purged_total++; > + condlog(3, "%s: path removed during purge operation (counted as success)", > + dev_name); > + } else { > + /* Path removed but our sysfs write failed - don't count it */ > + condlog(4, "%s: path removed during failed purge operation", dev_name); > + } > + } > + > + lock_cleanup_pop(vecs->lock); > + cleanup_purge_path_info(&purge_info); > + } We clean up the purge_info() here with pthread_cleanup_pop() Again, I don't think we need all this logging and sleeping code. If we handled a path we just loop. If not we increment current_cycle (assuming we think it is necessary) and wait to get woken back up. > + /* > + * If we didn't find any path to process, we've completed a cycle. > + * Check if we should wait for more work or start a new cycle. > + */ > + if (path_handled) > + found_path_this_cycle = true; > + else { > + if (found_path_this_cycle && !cycle_complete_logged) { > + /* Completed a cycle, log statistics */ > + condlog(2, "purge cycle %u complete: %u purged, %u failed, %u pending", > + current_cycle, devices_purged_total, > + devices_failed_total, devices_pending_total); > + cycle_complete_logged = true; > + > + /* Check if we need to retry pending paths */ > + if (devices_pending_total > 0) { > + struct timespec wait_time; > + struct config *conf; > + > + conf = get_multipath_config(); > + wait_time.tv_sec = conf->checkint; > + wait_time.tv_nsec = 0; > + put_multipath_config(conf); > + > + condlog(3, "sleeping %ld seconds before next purge cycle", > + (long)wait_time.tv_sec); > + nanosleep(&wait_time, NULL); > + > + /* Start a new cycle for pending paths */ > + current_cycle++; > + reset_purge_cycle_stats(&devices_purged_total, > + &devices_failed_total, > + &devices_pending_total, > + &found_path_this_cycle); > + cycle_complete_logged = false; > + condlog(3, "starting purge cycle %u", current_cycle); > + continue; > + } > + > + /* No pending paths, prepare for next cycle */ Just to make sure that paths don't get skipped, we should make sure that if current_cycle rolls over to 0, it's incremented again. That way, paths with pp->purge_cycle = 0 will always get checked. -Ben > + current_cycle++; > + reset_purge_cycle_stats(&devices_purged_total, > + &devices_failed_total, > + &devices_pending_total, > + &found_path_this_cycle); > + cycle_complete_logged = false; > + } > + > + /* No paths to process, wait for signal */ > + if (cycle_complete_logged) { > + pthread_mutex_lock(&purge_mutex); > + while (!devices_to_purge) { > + condlog(3, "waiting on devices to purge"); > + pthread_cond_wait(&purge_cond, &purge_mutex); > + } > + devices_to_purge = 0; > + pthread_mutex_unlock(&purge_mutex); > + > + condlog(3, "starting purge cycle %u", current_cycle); > + } > + } > + } > + > + pthread_cleanup_pop(1); > + return NULL; > +} > + > static int > configure (struct vectors * vecs, enum force_reload_types reload_type) > { > @@ -3669,6 +4050,8 @@ static void cleanup_threads(void) > > if (check_thr_started) > pthread_cancel(check_thr); > + if (purge_thr_started) > + pthread_cancel(purge_thr); > if (uevent_thr_started) > pthread_cancel(uevent_thr); > if (uxlsnr_thr_started) > @@ -3685,6 +4068,8 @@ static void cleanup_threads(void) > > if (check_thr_started) > pthread_join(check_thr, NULL); > + if (purge_thr_started) > + pthread_join(purge_thr, NULL); > if (uevent_thr_started) > pthread_join(uevent_thr, NULL); > if (uxlsnr_thr_started) > @@ -3937,6 +4322,11 @@ child (__attribute__((unused)) void *param) > goto failed; > } else > check_thr_started = true; > + if ((rc = pthread_create(&purge_thr, &misc_attr, purgeloop, vecs))) { > + condlog(0,"failed to create purge loop thread: %d", rc); > + goto failed; > + } else > + purge_thr_started = true; > if ((rc = pthread_create(&uevq_thr, &misc_attr, uevqloop, vecs))) { > condlog(0, "failed to create uevent dispatcher: %d", rc); > goto failed; > -- > 2.51.2 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] mulitpath-tools Add purge capability to multipath-tools 2025-12-04 19:33 ` Benjamin Marzinski @ 2025-12-04 20:08 ` Benjamin Marzinski 2025-12-08 12:27 ` Martin Wilck 1 sibling, 0 replies; 7+ messages in thread From: Benjamin Marzinski @ 2025-12-04 20:08 UTC (permalink / raw) To: Brian Bunker; +Cc: dm-devel, mwilck, Krishna Kant On Thu, Dec 04, 2025 at 02:33:34PM -0500, Benjamin Marzinski wrote: > On Wed, Dec 03, 2025 at 08:19:33AM -0800, Brian Bunker wrote: > > In the Linux kernel when volumes are disconnected, they are left behind > > to be re-enabled when the connection comes back. There are many cases > > where this behavior is not desirable. > > > > The goal is to have paths appear when volumes are connected. This is > > handled by target initiated rescans and posting of unit attentions. > > Secondarily when volumes are disconnected at the target, paths should > > be removed. > > > > This was discussed here in a kernel patch: > > https://marc.info/?l=linux-scsi&m=164426722617834&w=2 > > > > It was suggested there that userspace would be more advantageous in > > handling the device removal. > > > > An option was added to multipath-tools to purge devices that are > > disconnected at the target. A purge thread was implemented to handle > > the device removal when the path returns sense data (0x5/0x25/0x0) from > > the path checker. The option, purge_disconnected, is by default off but > > can be turned on by setting it to true. > > > > When enabled multipathd will log: > > Dec 3 09:10:05 init110-14 multipathd[54768]: 8:128: path removed from > > map 3624a93708ebb224d4a5e45c400011a43 > > ... > > Dec 3 09:10:14 init110-14 multipathd[54768]: purge cycle 2 complete: 3 > > purged, 0 failed, 0 pending > > > > Signed-off-by: Brian Bunker <brian@purestorage.com> > > Signed-off-by: Krishna Kant <krishna.kant@purestorage.com> > > --- > > libmultipath/checkers.c | 2 + > > libmultipath/checkers.h | 2 + > > libmultipath/checkers/tur.c | 10 + > > libmultipath/config.c | 2 + > > libmultipath/config.h | 3 + > > libmultipath/configure.c | 1 + > > libmultipath/defaults.h | 1 + > > libmultipath/dict.c | 14 ++ > > libmultipath/discovery.c | 3 +- > > libmultipath/io_err_stat.c | 1 + > > libmultipath/libmultipath.version | 1 + > > libmultipath/print.c | 2 + > > libmultipath/propsel.c | 16 ++ > > libmultipath/propsel.h | 1 + > > libmultipath/structs.h | 10 + > > libmultipath/sysfs.c | 58 ++++- > > libmultipath/sysfs.h | 2 + > > multipath/multipath.conf.5.in | 22 ++ > > multipathd/main.c | 402 +++++++++++++++++++++++++++++- > > 19 files changed, 545 insertions(+), 8 deletions(-) > > > > diff --git a/libmultipath/checkers.c b/libmultipath/checkers.c > > index e2eda58d..bb6ad1ee 100644 > > --- a/libmultipath/checkers.c > > +++ b/libmultipath/checkers.c > > @@ -43,6 +43,7 @@ static const char *checker_state_names[PATH_MAX_STATE] = { > > [PATH_TIMEOUT] = "timeout", > > [PATH_REMOVED] = "removed", > > [PATH_DELAYED] = "delayed", > > + [PATH_DISCONNECTED] = "disconnected", > > }; > > > > static LIST_HEAD(checkers); > > @@ -363,6 +364,7 @@ static const char *generic_msg[CHECKER_GENERIC_MSGTABLE_SIZE] = { > > [CHECKER_MSGID_DOWN] = " reports path is down", > > [CHECKER_MSGID_GHOST] = " reports path is ghost", > > [CHECKER_MSGID_UNSUPPORTED] = " doesn't support this device", > > + [CHECKER_MSGID_DISCONNECTED] = " no access to this device", > > }; > > > > const char *checker_message(const struct checker *c) > > diff --git a/libmultipath/checkers.h b/libmultipath/checkers.h > > index da91f499..903c3ebe 100644 > > --- a/libmultipath/checkers.h > > +++ b/libmultipath/checkers.h > > @@ -78,6 +78,7 @@ enum path_check_state { > > PATH_TIMEOUT, > > PATH_REMOVED, > > PATH_DELAYED, > > + PATH_DISCONNECTED, > > PATH_MAX_STATE > > }; > > > > @@ -113,6 +114,7 @@ enum { > > CHECKER_MSGID_DOWN, > > CHECKER_MSGID_GHOST, > > CHECKER_MSGID_UNSUPPORTED, > > + CHECKER_MSGID_DISCONNECTED, > > CHECKER_GENERIC_MSGTABLE_SIZE, > > CHECKER_FIRST_MSGID = 100, /* lowest msgid for checkers */ > > CHECKER_MSGTABLE_SIZE = 100, /* max msg table size for checkers */ > > diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c > > index 0010acf8..1bb9a992 100644 > > --- a/libmultipath/checkers/tur.c > > +++ b/libmultipath/checkers/tur.c > > @@ -199,6 +199,16 @@ retry: > > return PATH_PENDING; > > } > > } > > + else if (key == 0x5) { > > + /* Illegal request */ > > + if (asc == 0x25 && ascq == 0x00) { > > + /* > > + * LOGICAL UNIT NOT SUPPORTED > > + */ > > + *msgid = CHECKER_MSGID_DISCONNECTED; > > + return PATH_DISCONNECTED; > > + } > > + } > > *msgid = CHECKER_MSGID_DOWN; > > return PATH_DOWN; > > } > > diff --git a/libmultipath/config.c b/libmultipath/config.c > > index 8b424d18..2c66a788 100644 > > --- a/libmultipath/config.c > > +++ b/libmultipath/config.c > > @@ -470,6 +470,7 @@ merge_hwe (struct hwentry * dst, struct hwentry * src) > > merge_num(marginal_path_err_rate_threshold); > > merge_num(marginal_path_err_recheck_gap_time); > > merge_num(marginal_path_double_failed_time); > > + merge_num(purge_disconnected); > > > > snprintf(id, sizeof(id), "%s/%s", dst->vendor, dst->product); > > reconcile_features_with_options(id, &dst->features, > > @@ -517,6 +518,7 @@ merge_mpe(struct mpentry *dst, struct mpentry *src) > > merge_num(skip_kpartx); > > merge_num(max_sectors_kb); > > merge_num(ghost_delay); > > + merge_num(purge_disconnected); > > merge_num(uid); > > merge_num(gid); > > merge_num(mode); > > diff --git a/libmultipath/config.h b/libmultipath/config.h > > index 5b4ebf8c..618fa572 100644 > > --- a/libmultipath/config.h > > +++ b/libmultipath/config.h > > @@ -88,6 +88,7 @@ struct hwentry { > > int marginal_path_err_rate_threshold; > > int marginal_path_err_recheck_gap_time; > > int marginal_path_double_failed_time; > > + int purge_disconnected; > > int skip_kpartx; > > int max_sectors_kb; > > int ghost_delay; > > @@ -130,6 +131,7 @@ struct mpentry { > > int marginal_path_err_rate_threshold; > > int marginal_path_err_recheck_gap_time; > > int marginal_path_double_failed_time; > > + int purge_disconnected; > > int skip_kpartx; > > int max_sectors_kb; > > int ghost_delay; > > @@ -188,6 +190,7 @@ struct config { > > int marginal_path_err_rate_threshold; > > int marginal_path_err_recheck_gap_time; > > int marginal_path_double_failed_time; > > + int purge_disconnected; > > int uxsock_timeout; > > int strict_timing; > > int retrigger_tries; > > diff --git a/libmultipath/configure.c b/libmultipath/configure.c > > index baa13573..e4431d2f 100644 > > --- a/libmultipath/configure.c > > +++ b/libmultipath/configure.c > > @@ -355,6 +355,7 @@ int setup_map(struct multipath *mpp, char **params, struct vectors *vecs) > > select_max_sectors_kb(conf, mpp); > > select_ghost_delay(conf, mpp); > > select_flush_on_last_del(conf, mpp); > > + select_purge_disconnected(conf, mpp); > > > > sysfs_set_scsi_tmo(conf, mpp); > > marginal_pathgroups = conf->marginal_pathgroups; > > diff --git a/libmultipath/defaults.h b/libmultipath/defaults.h > > index 134b690a..d80dd7d4 100644 > > --- a/libmultipath/defaults.h > > +++ b/libmultipath/defaults.h > > @@ -58,6 +58,7 @@ > > #define DEFAULT_ALL_TG_PT ALL_TG_PT_OFF > > #define DEFAULT_RECHECK_WWID RECHECK_WWID_OFF > > #define DEFAULT_AUTO_RESIZE AUTO_RESIZE_NEVER > > +#define DEFAULT_PURGE_DISCONNECTED PURGE_DISCONNECTED_OFF > > /* Enable no foreign libraries by default */ > > #define DEFAULT_ENABLE_FOREIGN "NONE" > > > > diff --git a/libmultipath/dict.c b/libmultipath/dict.c > > index a06a6138..504b28b5 100644 > > --- a/libmultipath/dict.c > > +++ b/libmultipath/dict.c > > @@ -941,6 +941,16 @@ declare_hw_snprint(skip_kpartx, print_yes_no_undef) > > declare_mp_handler(skip_kpartx, set_yes_no_undef) > > declare_mp_snprint(skip_kpartx, print_yes_no_undef) > > > > +declare_def_handler(purge_disconnected, set_yes_no_undef) > > +declare_def_snprint_defint(purge_disconnected, print_yes_no_undef, > > + DEFAULT_PURGE_DISCONNECTED) > > +declare_ovr_handler(purge_disconnected, set_yes_no_undef) > > +declare_ovr_snprint(purge_disconnected, print_yes_no_undef) > > +declare_hw_handler(purge_disconnected, set_yes_no_undef) > > +declare_hw_snprint(purge_disconnected, print_yes_no_undef) > > +declare_mp_handler(purge_disconnected, set_yes_no_undef) > > +declare_mp_snprint(purge_disconnected, print_yes_no_undef) > > + > > declare_def_range_handler(remove_retries, 0, INT_MAX) > > declare_def_snprint(remove_retries, print_int) > > > > @@ -2227,6 +2237,7 @@ init_keywords(vector keywords) > > install_keyword("retrigger_delay", &def_retrigger_delay_handler, &snprint_def_retrigger_delay); > > install_keyword("missing_uev_wait_timeout", &def_uev_wait_timeout_handler, &snprint_def_uev_wait_timeout); > > install_keyword("skip_kpartx", &def_skip_kpartx_handler, &snprint_def_skip_kpartx); > > + install_keyword("purge_disconnected", &def_purge_disconnected_handler, &snprint_def_purge_disconnected); > > install_keyword("disable_changed_wwids", &deprecated_disable_changed_wwids_handler, &snprint_deprecated); > > install_keyword("remove_retries", &def_remove_retries_handler, &snprint_def_remove_retries); > > install_keyword("max_sectors_kb", &def_max_sectors_kb_handler, &snprint_def_max_sectors_kb); > > @@ -2310,6 +2321,7 @@ init_keywords(vector keywords) > > install_keyword("marginal_path_err_recheck_gap_time", &hw_marginal_path_err_recheck_gap_time_handler, &snprint_hw_marginal_path_err_recheck_gap_time); > > install_keyword("marginal_path_double_failed_time", &hw_marginal_path_double_failed_time_handler, &snprint_hw_marginal_path_double_failed_time); > > install_keyword("skip_kpartx", &hw_skip_kpartx_handler, &snprint_hw_skip_kpartx); > > + install_keyword("purge_disconnected", &hw_purge_disconnected_handler, &snprint_hw_purge_disconnected); > > install_keyword("max_sectors_kb", &hw_max_sectors_kb_handler, &snprint_hw_max_sectors_kb); > > install_keyword("ghost_delay", &hw_ghost_delay_handler, &snprint_hw_ghost_delay); > > install_keyword("all_tg_pt", &hw_all_tg_pt_handler, &snprint_hw_all_tg_pt); > > @@ -2355,6 +2367,7 @@ init_keywords(vector keywords) > > install_keyword("marginal_path_double_failed_time", &ovr_marginal_path_double_failed_time_handler, &snprint_ovr_marginal_path_double_failed_time); > > > > install_keyword("skip_kpartx", &ovr_skip_kpartx_handler, &snprint_ovr_skip_kpartx); > > + install_keyword("purge_disconnected", &ovr_purge_disconnected_handler, &snprint_ovr_purge_disconnected); > > install_keyword("max_sectors_kb", &ovr_max_sectors_kb_handler, &snprint_ovr_max_sectors_kb); > > install_keyword("ghost_delay", &ovr_ghost_delay_handler, &snprint_ovr_ghost_delay); > > install_keyword("all_tg_pt", &ovr_all_tg_pt_handler, &snprint_ovr_all_tg_pt); > > @@ -2400,6 +2413,7 @@ init_keywords(vector keywords) > > install_keyword("marginal_path_err_recheck_gap_time", &mp_marginal_path_err_recheck_gap_time_handler, &snprint_mp_marginal_path_err_recheck_gap_time); > > install_keyword("marginal_path_double_failed_time", &mp_marginal_path_double_failed_time_handler, &snprint_mp_marginal_path_double_failed_time); > > install_keyword("skip_kpartx", &mp_skip_kpartx_handler, &snprint_mp_skip_kpartx); > > + install_keyword("purge_disconnected", &mp_purge_disconnected_handler, &snprint_mp_purge_disconnected); > > install_keyword("max_sectors_kb", &mp_max_sectors_kb_handler, &snprint_mp_max_sectors_kb); > > install_keyword("ghost_delay", &mp_ghost_delay_handler, &snprint_mp_ghost_delay); > > install_sublevel_end(); > > diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c > > index 31db8758..46274f3f 100644 > > --- a/libmultipath/discovery.c > > +++ b/libmultipath/discovery.c > > @@ -2482,7 +2482,8 @@ int pathinfo(struct path *pp, struct config *conf, int mask) > > pp->state == PATH_UNCHECKED || > > pp->state == PATH_WILD) > > pp->chkrstate = pp->state = newstate; > > - if (pp->state == PATH_TIMEOUT) > > + if (pp->state == PATH_TIMEOUT || > > + pp->state == PATH_DISCONNECTED) > > pp->state = PATH_DOWN; > > if (pp->state == PATH_UP && !pp->size) { > > condlog(3, "%s: device size is 0, " > > diff --git a/libmultipath/io_err_stat.c b/libmultipath/io_err_stat.c > > index 64054c18..acb087b7 100644 > > --- a/libmultipath/io_err_stat.c > > +++ b/libmultipath/io_err_stat.c > > @@ -397,6 +397,7 @@ static void account_async_io_state(struct io_err_stat_path *pp, int rc) > > switch (rc) { > > case PATH_DOWN: > > case PATH_TIMEOUT: > > + case PATH_DISCONNECTED: > > pp->io_err_nr++; > > break; > > case PATH_UNCHECKED: > > diff --git a/libmultipath/libmultipath.version b/libmultipath/libmultipath.version > > index 89ae2a3c..c435847f 100644 > > --- a/libmultipath/libmultipath.version > > +++ b/libmultipath/libmultipath.version > > @@ -239,6 +239,7 @@ global: > > snprint_tgt_wwnn; > > snprint_tgt_wwpn; > > sysfs_attr_set_value; > > + sysfs_attr_set_value_nonblock; > > sysfs_attr_get_value; > > sysfs_get_asymmetric_access_state; > > > > diff --git a/libmultipath/print.c b/libmultipath/print.c > > index 0d44a5a9..8b09b174 100644 > > --- a/libmultipath/print.c > > +++ b/libmultipath/print.c > > @@ -541,6 +541,8 @@ snprint_chk_state (struct strbuf *buff, const struct path * pp) > > return append_strbuf_str(buff, "i/o timeout"); > > case PATH_DELAYED: > > return append_strbuf_str(buff, "delayed"); > > + case PATH_DISCONNECTED: > > + return append_strbuf_str(buff, "disconnected"); > > default: > > return append_strbuf_str(buff, "undef"); > > } > > diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c > > index 4c0fbcf3..9a17f1ed 100644 > > --- a/libmultipath/propsel.c > > +++ b/libmultipath/propsel.c > > @@ -1371,6 +1371,22 @@ out: > > return 0; > > } > > > > +int select_purge_disconnected (struct config *conf, struct multipath * mp) > > +{ > > + const char *origin; > > + > > + mp_set_mpe(purge_disconnected); > > + mp_set_ovr(purge_disconnected); > > + mp_set_hwe(purge_disconnected); > > + mp_set_conf(purge_disconnected); > > + mp_set_default(purge_disconnected, DEFAULT_PURGE_DISCONNECTED); > > +out: > > + condlog(3, "%s: purge_disconnected = %s %s", mp->alias, > > + (mp->purge_disconnected == PURGE_DISCONNECTED_ON)? "yes" : "no", > > + origin); > > + return 0; > > +} > > + > > int select_max_sectors_kb(struct config *conf, struct multipath * mp) > > { > > const char *origin; > > diff --git a/libmultipath/propsel.h b/libmultipath/propsel.h > > index 55930050..c6c937c3 100644 > > --- a/libmultipath/propsel.h > > +++ b/libmultipath/propsel.h > > @@ -39,6 +39,7 @@ int select_marginal_path_err_rate_threshold(struct config *conf, struct multipat > > int select_marginal_path_err_recheck_gap_time(struct config *conf, struct multipath *mp); > > int select_marginal_path_double_failed_time(struct config *conf, struct multipath *mp); > > int select_ghost_delay(struct config *conf, struct multipath * mp); > > +int select_purge_disconnected (struct config *conf, struct multipath * mp); > > void reconcile_features_with_options(const char *id, char **features, > > int* no_path_retry, > > int *retain_hwhandler); > > diff --git a/libmultipath/structs.h b/libmultipath/structs.h > > index 9247e74f..81ac6be5 100644 > > --- a/libmultipath/structs.h > > +++ b/libmultipath/structs.h > > @@ -186,6 +186,12 @@ enum auto_resize_state { > > AUTO_RESIZE_GROW_SHRINK, > > }; > > > > +enum purge_disconnected_states { > > + PURGE_DISCONNECTED_UNDEF = YNU_UNDEF, > > + PURGE_DISCONNECTED_OFF = YNU_NO, > > + PURGE_DISCONNECTED_ON = YNU_YES, > > +}; > > + > > #define PROTOCOL_UNSET -1 > > > > enum scsi_protocol { > > @@ -382,6 +388,9 @@ struct path { > > int dmstate; > > int chkrstate; > > int oldstate; > > + bool purge_path; > > + int purge_retries; /* number of purge attempts for this path */ > > + unsigned int purge_cycle; /* last purge cycle that checked this path */ > > int failcount; > > int priority; > > int pgindex; > > @@ -483,6 +492,7 @@ struct multipath { > > int ghost_delay; > > int ghost_delay_tick; > > int queue_mode; > > + int purge_disconnected; > > unsigned int sync_tick; > > int checker_count; > > enum prio_update_type prio_update; > > diff --git a/libmultipath/sysfs.c b/libmultipath/sysfs.c > > index af27d107..2160d7fd 100644 > > --- a/libmultipath/sysfs.c > > +++ b/libmultipath/sysfs.c > > @@ -134,7 +134,7 @@ ssize_t sysfs_attr_set_value(struct udev_device *dev, const char *attr_name, > > size = write(fd, value, value_len); > > if (size < 0) { > > size = -errno; > > - condlog(3, "%s: write to %s failed: %s", __func__, > > + condlog(3, "%s: write to %s failed: %s", __func__, > > devpath, strerror(errno)); > > } else if (size < (ssize_t)value_len) > > condlog(3, "%s: underflow writing %zu bytes to %s. Wrote %zd bytes", > > @@ -144,6 +144,62 @@ ssize_t sysfs_attr_set_value(struct udev_device *dev, const char *attr_name, > > return size; > > } > > Unfortunately. This doesn't work at all. O_NONBLOCK doesn't mean > anything for sysfs writes. The only way to make the write non-blocking > would be to use libaio or io_uring. But we're already doing the write in > a separate thread without holding any locks, so we're fine with it > blocking. > > > > > +/* > > + * Non-blocking version of sysfs_attr_set_value for use with potentially > > + * blocking sysfs attributes (like SCSI "delete"). > > + * > > + * Returns: > > + * > 0: number of bytes written (success) > > + * -EAGAIN: write would block, caller should retry or wait for completion > > + * < 0: other error (negative errno) > > + */ > > +ssize_t sysfs_attr_set_value_nonblock(struct udev_device *dev, const char *attr_name, > > + const char * value, size_t value_len) > > +{ > > + const char *syspath; > > + char devpath[PATH_MAX]; > > + int fd = -1; > > + ssize_t size = -1; > > + > > + if (!dev || !attr_name || !value || !value_len) { > > + condlog(1, "%s: invalid parameters", __func__); > > + return -EINVAL; > > + } > > + > > + syspath = udev_device_get_syspath(dev); > > + if (!syspath) { > > + condlog(3, "%s: invalid udevice", __func__); > > + return -EINVAL; > > + } > > + if (safe_sprintf(devpath, "%s/%s", syspath, attr_name)) { > > + condlog(3, "%s: devpath overflow", __func__); > > + return -EOVERFLOW; > > + } > > + > > + condlog(4, "open '%s' (non-blocking)", devpath); > > + /* Open with O_NONBLOCK to avoid blocking in open() or write() */ > > + fd = open(devpath, O_WRONLY | O_NONBLOCK); > > + if (fd < 0) { > > + condlog(3, "%s: attribute '%s' cannot be opened: %s", > > + __func__, devpath, strerror(errno)); > > + return -errno; > > + } > > + pthread_cleanup_push(cleanup_fd_ptr, &fd); > > + > > + size = write(fd, value, value_len); > > + if (size < 0) { > > + size = -errno; > > + if (errno != EAGAIN) > > + condlog(3, "%s: write to %s failed: %s", __func__, > > + devpath, strerror(errno)); > > + } else if (size < (ssize_t)value_len) > > + condlog(3, "%s: underflow writing %zu bytes to %s. Wrote %zd bytes", > > + __func__, value_len, devpath, size); > > + > > + pthread_cleanup_pop(1); > > + return size; > > +} > > + > > int > > sysfs_get_size (struct path *pp, unsigned long long * size) > > { > > diff --git a/libmultipath/sysfs.h b/libmultipath/sysfs.h > > index 45f24c37..da2deaa3 100644 > > --- a/libmultipath/sysfs.h > > +++ b/libmultipath/sysfs.h > > @@ -10,6 +10,8 @@ > > int devt2devname (char *, int, const char *); > > ssize_t sysfs_attr_set_value(struct udev_device *dev, const char *attr_name, > > const char * value, size_t value_len); > > +ssize_t sysfs_attr_set_value_nonblock(struct udev_device *dev, const char *attr_name, > > + const char * value, size_t value_len); > > ssize_t sysfs_attr_get_value(struct udev_device *dev, const char *attr_name, > > char * value, size_t value_len); > > ssize_t sysfs_bin_attr_get_value(struct udev_device *dev, const char *attr_name, > > diff --git a/multipath/multipath.conf.5.in b/multipath/multipath.conf.5.in > > index 3c9ae097..84cd1a0a 100644 > > --- a/multipath/multipath.conf.5.in > > +++ b/multipath/multipath.conf.5.in > > @@ -1306,6 +1306,22 @@ The default is: \fBno\fR > > . > > . > > .TP > > +.B purge_disconnected > > +If set to > > +.I yes > > +, multipathd will automatically remove devices that are in a disconnected state. > > +A path is considered disconnected when the TUR (Test Unit Ready) path checker > > +receives the SCSI sense code "LOGICAL UNIT NOT SUPPORTED" (sense key 0x5, > > +ASC/ASCQ 0x25/0x00). This typically indicates that the LUN has been unmapped > > +or is no longer presented by the storage array. This option helps clean up > > +stale device entries that would otherwise remain in the system. > > +.RS > > +.TP > > +The default is: \fBno\fR > > +.RE > > +. > > +. > > +.TP > > .B disable_changed_wwids > > (Deprecated) This option is not supported anymore, and will be ignored. > > .RE > > @@ -1602,6 +1618,8 @@ section: > > .TP > > .B skip_kpartx > > .TP > > +.B purge_disconnected > > +.TP > > .B max_sectors_kb > > .TP > > .B ghost_delay > > @@ -1797,6 +1815,8 @@ section: > > .TP > > .B skip_kpartx > > .TP > > +.B purge_disconnected > > +.TP > > .B max_sectors_kb > > .TP > > .B ghost_delay > > @@ -1881,6 +1901,8 @@ the values are taken from the \fIdevices\fR or \fIdefaults\fR sections: > > .TP > > .B skip_kpartx > > .TP > > +.B purge_disconnected > > +.TP > > .B max_sectors_kb > > .TP > > .B ghost_delay > > diff --git a/multipathd/main.c b/multipathd/main.c > > index d11a8576..b105cd56 100644 > > --- a/multipathd/main.c > > +++ b/multipathd/main.c > > @@ -137,11 +137,14 @@ static enum force_reload_types reconfigure_pending = FORCE_RELOAD_NONE; > > pid_t daemon_pid; > > static pthread_mutex_t config_lock = PTHREAD_MUTEX_INITIALIZER; > > static pthread_cond_t config_cond; > > -static pthread_t check_thr, uevent_thr, uxlsnr_thr, uevq_thr, dmevent_thr, > > - fpin_thr, fpin_consumer_thr; > > -static bool check_thr_started, uevent_thr_started, uxlsnr_thr_started, > > - uevq_thr_started, dmevent_thr_started, fpin_thr_started, > > - fpin_consumer_thr_started; > > +static pthread_mutex_t purge_mutex = PTHREAD_MUTEX_INITIALIZER; > > +static pthread_cond_t purge_cond = PTHREAD_COND_INITIALIZER; > > +int devices_to_purge = 0; > > +static pthread_t check_thr, purge_thr, uevent_thr, uxlsnr_thr, uevq_thr, > > + dmevent_thr, fpin_thr, fpin_consumer_thr; > > +static bool check_thr_started, purge_thr_started, uevent_thr_started, > > + uxlsnr_thr_started, uevq_thr_started, dmevent_thr_started, > > + fpin_thr_started, fpin_consumer_thr_started; > > static int pid_fd = -1; > > > > static inline enum daemon_status get_running_state(bool *pending_reconfig) > > @@ -1089,6 +1092,109 @@ check_path_wwid_change(struct path *pp) > > return false; > > } > > This is problematic. Without holding the vecs lock, we cannot trust the > path to continue to exist. It could go away at any time, and we would be > pointing at freed memory. > > > +/* > > + * Information needed to delete a path via sysfs, copied while holding the lock > > + * so that the actual sysfs write can be done without holding vecs->lock. > > + */ > > +struct purge_path_info { > > + struct udev_device *udev; /* Reference to udev device (refcounted) */ > > + char dev_t[BLK_DEV_SIZE]; /* For logging */ > > + char alias[WWID_SIZE]; /* mpp alias for logging */ > > + struct path *pp; /* Path pointer for updating purge_path flag */ > > +}; > > + > > +/* > > + * Attempt to delete a path by writing to the SCSI device's sysfs delete attribute. > > + * This triggers kernel-level device removal. The actual cleanup of the path structure > > + * from pathvec happens later when a uevent arrives (handled by uev_remove_path). > > + * > > + * This function does NOT require vecs->lock to be held, as it operates on copied data. > > + * Uses non-blocking I/O to avoid blocking in scsi_remove_device(). > > + * > > + * Returns: > > + * true - sysfs write succeeded or device already being deleted > > + * false - sysfs write would block (EAGAIN), caller should retry > > + * > > + * Note: Errors other than EAGAIN are treated as success, because they typically > > + * indicate the device is already in SDEV_DEL or SDEV_CANCEL state, or the sysfs > > + * file doesn't exist (device already removed). > > + */ > > Without the vecs lock being held there nothing to make sure that the > path still exists. This means that it could get removed, at which point > the devname could get reused. Having a udev reference will not prevent > that. So, for instance sdc may now be a new device. Whether or not you > would delete this new device depends on whether it's sysfs path remains > the same as for the old device. The only real guarantee that the device > won't get reused is if someone has it open. > > > +static bool > > +delete_path_sysfs(struct purge_path_info *info) > > +{ > > + struct udev_device *ud; > > + ssize_t ret; > > + > > + if (!info->udev) > > + return true; /* No device, treat as success */ > > + > > + ud = udev_device_get_parent_with_subsystem_devtype(info->udev, > > + "scsi", "scsi_device"); > > + > > + if (!ud) { > > + /* No SCSI parent device, likely already removed */ > > + condlog(4, "%s: no SCSI parent device found, likely already removed", > > + info->dev_t); > > + return true; > > + } > > + > > + ret = sysfs_attr_set_value_nonblock(ud, "delete", "1", strlen("1")); > > + > > + if (ret == -EAGAIN) { > > + /* Write would block, caller should retry */ > > + condlog(4, "%s: delete would block, will retry", info->dev_t); > > + return false; > > + } > > + > > + if (ret < 0) { > > + /* > > + * Other errors (ENOENT, EINVAL, etc.) typically mean the device > > + * is already being deleted or is in SDEV_DEL/SDEV_CANCEL state. > > + * Treat as success since the end result is what we want. > > + */ > > + condlog(3, "%s: sysfs delete returned %zd (%s), treating as success", > > + info->dev_t, ret, strerror((int)-ret)); > > + return true; > > + } > > + > > + /* Success */ > > + condlog(2, "%s: initiated removal from %s", info->dev_t, info->alias); > > + return true; > > +} > > + > > +/* > > + * Prepare purge info for a path while holding vecs->lock. > > + * Takes a reference on the udev device so it remains valid after unlocking. > > + * Returns true if info was successfully prepared, false otherwise. > > + */ > > +static bool > > +prepare_purge_path_info(struct path *pp, struct purge_path_info *info) > > +{ > > + if (!pp->udev || !pp->mpp) > > + return false; > > + > > + info->udev = udev_device_ref(pp->udev); > > + if (!info->udev) > > + return false; > > + > > + strlcpy(info->dev_t, pp->dev_t, sizeof(info->dev_t)); > > + strlcpy(info->alias, pp->mpp->alias, sizeof(info->alias)); > > + info->pp = pp; > > + return true; > > +} > > This has to be implemented as function that can get run by a cleanup > handler. All multipathd threads must cleanup any state (references, > open fd, locks) if they are cancelled. > > > +/* > > + * Clean up purge info after use. > > + */ > > +static void > > +cleanup_purge_path_info(struct purge_path_info *info) > > +{ > > + if (info->udev) { > > + udev_device_unref(info->udev); > > + info->udev = NULL; > > + } > > +} > > + > > /* > > * uev_add_path can call uev_update_path, and uev_update_path can call > > * uev_add_path > > @@ -2031,6 +2137,10 @@ reinstate_path (struct path * pp) > > condlog(0, "%s: reinstate failed", pp->dev_t); > > else { > > condlog(2, "%s: reinstated", pp->dev_t); > > + if (pp->purge_path) { > > On the off chance that a path gets disconnected, not successfully > purged, and then reconnected, we should probably clear pp->purge_cycle > here as well. There is an admittedly very rare chance that the path > could later be again disconnected after purgeloop's current_cycle rolled > over and got back to the old value of pp->purge_cycle. This would keep > it from getting handled, since it would appear to have already been > handled in the current cycle. > > > + pp->purge_path = false; > > + pp->purge_retries = 0; > > + } > > update_queue_mode_add_path(pp->mpp); > > } > > } > > @@ -2474,6 +2584,20 @@ get_new_state(struct path *pp) > > if (newstate == PATH_REMOVED) > > newstate = PATH_DOWN; > > > > + /* > > + * For PATH_DISOCONNECTED mark the path as OK to purge if that is > > + * enabled. Whether or not purge is enabled mark the path as down. > > + */ > > + if (newstate == PATH_DISCONNECTED) { > > + if (pp->mpp->purge_disconnected == PURGE_DISCONNECTED_ON && > > + !pp->purge_path) { > > + condlog(2, "%s: mark (%s) path for purge", > > + pp->dev, checker_state_name(newstate)); > > + pp->purge_path = true; > > + } > > + newstate = PATH_DOWN; > > + } > > + > > if (newstate == PATH_WILD || newstate == PATH_UNCHECKED) { > > condlog(2, "%s: unusable path (%s) - checker failed", > > pp->dev, checker_state_name(newstate)); > > @@ -3022,6 +3146,25 @@ update_paths(struct vectors *vecs, int *num_paths_p, time_t start_secs) > > return CHECKER_FINISHED; > > } > > > > +/* > > + * Check if there are newly marked paths to purge. > > + * Only returns true for paths with purge_path=true and purge_retries=0, > > + * which indicates they were just marked for purging and haven't been > > + * attempted yet. This prevents signaling the purge thread unnecessarily > > + * for paths that are already being retried. > > + */ > > +static bool > > +has_paths_to_purge(struct vectors *vecs) > > +{ > > + int i; > > + struct path *pp; > > + > > + vector_foreach_slot(vecs->pathvec, pp, i) > > + if (pp->purge_path && pp->purge_retries == 0) > > + return true; > > + return false; > > +} > > + > > static void enable_pathgroups(struct multipath *mpp) > > { > > struct pathgroup *pgp; > > @@ -3125,6 +3268,7 @@ checkerloop (void *ap) > > int num_paths = 0, strict_timing; > > unsigned int ticks = 0; > > enum checker_state checker_state = CHECKER_STARTING; > > + bool devs_to_purge = false; > > > > if (set_config_state(DAEMON_RUNNING) != DAEMON_RUNNING) > > /* daemon shutdown */ > > @@ -3168,8 +3312,10 @@ checkerloop (void *ap) > > if (checker_state == CHECKER_UPDATING_PATHS) > > checker_state = update_paths(vecs, &num_paths, > > start_time.tv_sec); > > - if (checker_state == CHECKER_FINISHED) > > + if (checker_state == CHECKER_FINISHED) { > > + devs_to_purge = has_paths_to_purge(vecs); > > checker_finished(vecs, ticks); > > + } > > lock_cleanup_pop(vecs->lock); > > } > > > > @@ -3192,6 +3338,13 @@ checkerloop (void *ap) > > (long)diff_time.tv_sec); > > } > > > > + if (devs_to_purge) { > > + pthread_mutex_lock(&purge_mutex); > > + devices_to_purge = 1; > > + pthread_cond_signal(&purge_cond); > > + pthread_mutex_unlock(&purge_mutex); > > + } > > + > > if (foreign_tick == 0) { > > conf = get_multipath_config(); > > foreign_tick = conf->max_checkint; > > @@ -3229,6 +3382,234 @@ checkerloop (void *ap) > > return NULL; > > } > > > > +/* > > + * Purge thread: removes disconnected paths by writing to sysfs. > > + * > > + * This thread is signaled by the checker thread when paths marked with > > + * purge_path=true are detected. It attempts to delete these paths by > > + * writing "1" to their SCSI device's sysfs "delete" attribute. > > + * > > + * The actual cleanup of path structures from pathvec happens asynchronously > > + * when the kernel sends a remove uevent (handled by uev_remove_path). > > + * If the uevent doesn't arrive or the sysfs write fails, paths will be > > + * retried on subsequent purge cycles with a timeout mechanism. > > + * > > + * To avoid holding vecs->lock for extended periods and starving other threads, > > + * this function processes one path at a time: lock -> handle one path -> unlock. > > + * Paths track which cycle they were checked in to avoid reprocessing. > > + */ > > +#define PURGE_TIMEOUT_CYCLES 10 /* ~50 seconds at 5 second check intervals */ > > + > > +static void > > +reset_purge_cycle_stats(unsigned int *purged, unsigned int *failed, > > + unsigned int *pending, bool *found) > > +{ > > + *purged = 0; > > + *failed = 0; > > + *pending = 0; > > + *found = false; > > +} > > How about a simplifying the purge thread slightly. When the checker gets > a path it is supposed to purge, it sets purge_path, just like now. I > don't think we need to track purge_retries at all. When the purge loop > is triggered, it locks the vecs lock, and finds a single path to purge, > like now. The only info it needs to save is the udev device (with a new > refcount taken like now) and a dup of pp->fd (You need this to guarantee > that the devnode won't get reused if the device is deleted and a new one > is re-added). You could also save the name, but > udev_device_get_devnode() will get it for you. Like I mentioned above, > you need to have a pthread cleanup handler that gets run to clean these > up if they are set and the thread is cancelled, just like how the vecs > lock is handled. Then purgeloop() clears pp->purge_path, drops the vecs > lock, does the sysfs delete, and cleans up the purge_info by calling > pthread_cleanup_pop(). Finally it loops back to reacquiring the vecs > lock and finding the next path to delete. I don't really see a value in > all the stats tracking. If it goes through all the paths, and doesn't > find one in need of purging, it waits, just like now. > > I'm not even sure we need to keep the cycle checking code. If we really > think the sysfs call can fail, and leave the path around, then it may be > necessary. In that case, if the purge thread took a long time (longer > than 5 seconds) the checker could set pp->pruge_path again before the > purge thread got to processing the later paths in need of purging. We > could get stuck processing the same path over and over. But this only > really happens if purging takes a long time and then fails. > > I also don't think the sleeping code is necessary. If there are paths to > purge, they are either newly added, or the purge thread has taken long > that the checker has run again and re-added them. Either way, it seems > like we should handle them right away. > > Martin's point (that this could be done by an entirely separate program > that get's called with a scsi device, opens it (to make sure it doesn't > get removed), verifys that it is disconnected, and then removes it via > sysfs) is reasonable. But if we want to solve this in multipathd, > limiting the vecs lock holding to just when we set up the Oops. I'm not sure how the rest of this got deleted. It said something like: ... limiting the vecs lock holding to just when we set up the purge_info should be fine. -Ben > > > +static void * > > +purgeloop (void *ap) > > +{ > > + struct vectors *vecs; > > + unsigned int i; > > + struct path *pp; > > + unsigned int current_cycle = 1; > > + unsigned int devices_purged_total = 0; > > + unsigned int devices_failed_total = 0; > > + unsigned int devices_pending_total = 0; > > + bool found_path_this_cycle = false; > > + bool cycle_complete_logged = false; > > + vecs = (struct vectors *)ap; > > + > > + pthread_cleanup_push(rcu_unregister, NULL); > > + rcu_register_thread(); > > + mlockall(MCL_CURRENT | MCL_FUTURE); > > + > > + while (1) { > > + bool path_handled = false; > > + struct purge_path_info purge_info = { .udev = NULL }; > > + bool do_purge = false; > > + char dev_name[FILE_NAME_SIZE]; > > + > > + /* > > + * Lock and search for one path that needs processing. > > + * Copy the necessary info while holding the lock, then > > + * release the lock before doing any blocking sysfs operations. > > + */ > > we need to pthread_cleanup_push a handle that cleans up purge_info (if > it's set) here. > > > + pthread_cleanup_push(cleanup_lock, &vecs->lock); > > + lock(&vecs->lock); > > + pthread_testcancel(); > > + > > + vector_foreach_slot (vecs->pathvec, pp, i) { > > + if (!pp->purge_path || pp->purge_cycle == current_cycle) > > + continue; > > + > > + pp->purge_cycle = current_cycle; > > + path_handled = true; > > + > > I think we can remove all this retries code. > > > + /* Increment timeout counter for this purge attempt */ > > + pp->purge_retries++; > > + > > + if (pp->purge_retries >= PURGE_TIMEOUT_CYCLES) { > > + /* > > + * Timeout: path couldn't be deleted after multiple attempts. > > + * This likely means the device still exists in the kernel > > + * (perhaps it came back online, or the delete is stuck). > > + * > > + * We should NOT force-remove it from multipathd's state, > > + * as that would create a mismatch between multipathd's view > > + * and the kernel's view of existing devices. > > + * > > + * Instead, clear the purge flag and stop trying. The path > > + * will remain in multipathd. If it's truly disconnected, > > + * it will be marked for purge again on the next check cycle. > > + */ > > + condlog(1, "%s: purge timeout after %d attempts, giving up. " > > + "Device may still exist in kernel.", > > + pp->dev, pp->purge_retries); > > + pp->purge_path = false; > > + pp->purge_retries = 0; > > + devices_failed_total++; > > + } else { > > + /* > > + * Prepare info for sysfs deletion. > > + * Copy all needed data while holding the lock. > > + */ > > + if (prepare_purge_path_info(pp, &purge_info)) { > > + do_purge = true; > > + strlcpy(dev_name, pp->dev, sizeof(dev_name)); > > + } else { > > + /* Can't prepare info, will retry next cycle */ > > + devices_pending_total++; > > + condlog(4, "%s: failed to prepare purge info, will retry", > > + pp->dev); > > + } > > + } > > + break; > > + } > > + > > + lock_cleanup_pop(vecs->lock); > > + > > + /* > > + * Now we've released the lock. Do the non-blocking sysfs operation > > + * without holding vecs->lock. > > + */ > > + if (do_purge) { > > + bool success = delete_path_sysfs(&purge_info); > > I don't really see the benefit of all this work done with the lock > reacquired. > > > + /* > > + * Re-acquire lock to update path state. > > + * The path might have been removed while we didn't hold the lock, > > + * so we need to verify it still exists. > > + */ > > + pthread_cleanup_push(cleanup_lock, &vecs->lock); > > + lock(&vecs->lock); > > + pthread_testcancel(); > > + > > + /* Verify the path still exists in pathvec */ > > + if (find_slot(vecs->pathvec, purge_info.pp) != -1) { > > + if (success) { > > + /* Success: sysfs write succeeded, uevent expected */ > > + devices_purged_total++; > > + purge_info.pp->purge_path = false; > > + purge_info.pp->purge_retries = 0; > > + condlog(3, "%s: path purge successful", dev_name); > > + } else { > > + /* Will retry on next cycle */ > > + devices_pending_total++; > > + condlog(4, "%s: path purge failed, will retry (attempt %d/%d)", > > + dev_name, purge_info.pp->purge_retries, > > + PURGE_TIMEOUT_CYCLES); > > + } > > + } else { > > + /* > > + * Path was removed while we were doing sysfs write. > > + * This is actually success - the path is gone, which is what we wanted. > > + */ > > + if (success) { > > + devices_purged_total++; > > + condlog(3, "%s: path removed during purge operation (counted as success)", > > + dev_name); > > + } else { > > + /* Path removed but our sysfs write failed - don't count it */ > > + condlog(4, "%s: path removed during failed purge operation", dev_name); > > + } > > + } > > + > > + lock_cleanup_pop(vecs->lock); > > + cleanup_purge_path_info(&purge_info); > > + } > > We clean up the purge_info() here with pthread_cleanup_pop() > > > Again, I don't think we need all this logging and sleeping code. > If we handled a path we just loop. If not we increment current_cycle > (assuming we think it is necessary) and wait to get woken back up. > > + /* > > + * If we didn't find any path to process, we've completed a cycle. > > + * Check if we should wait for more work or start a new cycle. > > + */ > > + if (path_handled) > > + found_path_this_cycle = true; > > + else { > > + if (found_path_this_cycle && !cycle_complete_logged) { > > + /* Completed a cycle, log statistics */ > > + condlog(2, "purge cycle %u complete: %u purged, %u failed, %u pending", > > + current_cycle, devices_purged_total, > > + devices_failed_total, devices_pending_total); > > + cycle_complete_logged = true; > > + > > + /* Check if we need to retry pending paths */ > > + if (devices_pending_total > 0) { > > + struct timespec wait_time; > > + struct config *conf; > > + > > + conf = get_multipath_config(); > > + wait_time.tv_sec = conf->checkint; > > + wait_time.tv_nsec = 0; > > + put_multipath_config(conf); > > + > > + condlog(3, "sleeping %ld seconds before next purge cycle", > > + (long)wait_time.tv_sec); > > + nanosleep(&wait_time, NULL); > > + > > + /* Start a new cycle for pending paths */ > > + current_cycle++; > > + reset_purge_cycle_stats(&devices_purged_total, > > + &devices_failed_total, > > + &devices_pending_total, > > + &found_path_this_cycle); > > + cycle_complete_logged = false; > > + condlog(3, "starting purge cycle %u", current_cycle); > > + continue; > > + } > > + > > + /* No pending paths, prepare for next cycle */ > > Just to make sure that paths don't get skipped, we should make sure that > if current_cycle rolls over to 0, it's incremented again. That way, > paths with pp->purge_cycle = 0 will always get checked. > > -Ben > > > + current_cycle++; > > + reset_purge_cycle_stats(&devices_purged_total, > > + &devices_failed_total, > > + &devices_pending_total, > > + &found_path_this_cycle); > > + cycle_complete_logged = false; > > + } > > + > > + /* No paths to process, wait for signal */ > > + if (cycle_complete_logged) { > > + pthread_mutex_lock(&purge_mutex); > > + while (!devices_to_purge) { > > + condlog(3, "waiting on devices to purge"); > > + pthread_cond_wait(&purge_cond, &purge_mutex); > > + } > > + devices_to_purge = 0; > > + pthread_mutex_unlock(&purge_mutex); > > + > > + condlog(3, "starting purge cycle %u", current_cycle); > > + } > > + } > > + } > > + > > + pthread_cleanup_pop(1); > > + return NULL; > > +} > > + > > static int > > configure (struct vectors * vecs, enum force_reload_types reload_type) > > { > > @@ -3669,6 +4050,8 @@ static void cleanup_threads(void) > > > > if (check_thr_started) > > pthread_cancel(check_thr); > > + if (purge_thr_started) > > + pthread_cancel(purge_thr); > > if (uevent_thr_started) > > pthread_cancel(uevent_thr); > > if (uxlsnr_thr_started) > > @@ -3685,6 +4068,8 @@ static void cleanup_threads(void) > > > > if (check_thr_started) > > pthread_join(check_thr, NULL); > > + if (purge_thr_started) > > + pthread_join(purge_thr, NULL); > > if (uevent_thr_started) > > pthread_join(uevent_thr, NULL); > > if (uxlsnr_thr_started) > > @@ -3937,6 +4322,11 @@ child (__attribute__((unused)) void *param) > > goto failed; > > } else > > check_thr_started = true; > > + if ((rc = pthread_create(&purge_thr, &misc_attr, purgeloop, vecs))) { > > + condlog(0,"failed to create purge loop thread: %d", rc); > > + goto failed; > > + } else > > + purge_thr_started = true; > > if ((rc = pthread_create(&uevq_thr, &misc_attr, uevqloop, vecs))) { > > condlog(0, "failed to create uevent dispatcher: %d", rc); > > goto failed; > > -- > > 2.51.2 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] mulitpath-tools Add purge capability to multipath-tools 2025-12-04 19:33 ` Benjamin Marzinski 2025-12-04 20:08 ` Benjamin Marzinski @ 2025-12-08 12:27 ` Martin Wilck 2025-12-08 19:30 ` Benjamin Marzinski 1 sibling, 1 reply; 7+ messages in thread From: Martin Wilck @ 2025-12-08 12:27 UTC (permalink / raw) To: Benjamin Marzinski, Brian Bunker; +Cc: dm-devel, Krishna Kant On Thu, 2025-12-04 at 14:33 -0500, Benjamin Marzinski wrote: > On Wed, Dec 03, 2025 at 08:19:33AM -0800, Brian Bunker wrote: > How about a simplifying the purge thread slightly. When the checker > gets > a path it is supposed to purge, it sets purge_path, just like now. I > don't think we need to track purge_retries at all. When the purge > loop > is triggered, it locks the vecs lock, and finds a single path to > purge, > like now. The only info it needs to save is the udev device (with a > new > refcount taken like now) and a dup of pp->fd (You need this to > guarantee > that the devnode won't get reused if the device is deleted and a new > one > is re-added). You could also save the name, but > udev_device_get_devnode() will get it for you. Like I mentioned > above, > you need to have a pthread cleanup handler that gets run to clean > these > up if they are set and the thread is cancelled, just like how the > vecs > lock is handled. Then purgeloop() clears pp->purge_path, drops the > vecs > lock, does the sysfs delete, and cleans up the purge_info by calling > pthread_cleanup_pop(). Finally it loops back to reacquiring the vecs > lock and finding the next path to delete. I don't really see a value > in > all the stats tracking. If it goes through all the paths, and doesn't > find one in need of purging, it waits, just like now. I'd go one step further and suggest an approach similar to what we do in the uevent handler, for which we have a separate list of uevents that shares no data with the main multipath threads. Roughly like this: enum purge_state { PURGE_REQUESTED = 1, PURGE_DONE, PURGE_ERROR }; struct path_to_purge { struct list_head queue; struct udev_device *udev; /* one ref taken while on the list */ int fd; /* dup of pp->fd */ enum purge_state state; }; // global list_head for the queue of paths to be purged. static LIST_HEAD(purge_queue); If multipathd detects a path to purge in the checker thread, it allocates and adds a path_to_purge entry with state PURGE_REQUESTED to the queue and triggers the purge thread e.g. with a pthreads condition variable. The purge thread then walks through the list (without modifying the list itself), sends a delete request to the kernel and atomically sets the state field to PURGE_DONE when successful. multipathd's checker thread walks the list next time and removes entries that have state != PURGE_REQUESTED, closing the fd and unref- ing the struct udev. IOW the resource allocation/freeing for this list and its elements would be handled by the checker alone. The checker would not immediately drop internal references to the path; this would be handled by the remove uevent handler as usual. The purge thread would never write to the list, except to the state attribute of list members (which we could protect from being freed via RCU). This would make the purge thread almost an independent process. Opinions? Martin ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] mulitpath-tools Add purge capability to multipath-tools 2025-12-08 12:27 ` Martin Wilck @ 2025-12-08 19:30 ` Benjamin Marzinski 2025-12-09 11:01 ` Martin Wilck 0 siblings, 1 reply; 7+ messages in thread From: Benjamin Marzinski @ 2025-12-08 19:30 UTC (permalink / raw) To: Martin Wilck; +Cc: Brian Bunker, dm-devel, Krishna Kant On Mon, Dec 08, 2025 at 01:27:32PM +0100, Martin Wilck wrote: > On Thu, 2025-12-04 at 14:33 -0500, Benjamin Marzinski wrote: > > On Wed, Dec 03, 2025 at 08:19:33AM -0800, Brian Bunker wrote: > > > How about a simplifying the purge thread slightly. When the checker > > gets > > a path it is supposed to purge, it sets purge_path, just like now. I > > don't think we need to track purge_retries at all. When the purge > > loop > > is triggered, it locks the vecs lock, and finds a single path to > > purge, > > like now. The only info it needs to save is the udev device (with a > > new > > refcount taken like now) and a dup of pp->fd (You need this to > > guarantee > > that the devnode won't get reused if the device is deleted and a new > > one > > is re-added). You could also save the name, but > > udev_device_get_devnode() will get it for you. Like I mentioned > > above, > > you need to have a pthread cleanup handler that gets run to clean > > these > > up if they are set and the thread is cancelled, just like how the > > vecs > > lock is handled. Then purgeloop() clears pp->purge_path, drops the > > vecs > > lock, does the sysfs delete, and cleans up the purge_info by calling > > pthread_cleanup_pop(). Finally it loops back to reacquiring the vecs > > lock and finding the next path to delete. I don't really see a value > > in > > all the stats tracking. If it goes through all the paths, and doesn't > > find one in need of purging, it waits, just like now. > > I'd go one step further and suggest an approach similar to what we do > in the uevent handler, for which we have a separate list of uevents > that shares no data with the main multipath threads. > > Roughly like this: > > enum purge_state { > PURGE_REQUESTED = 1, > PURGE_DONE, > PURGE_ERROR > }; > > struct path_to_purge { > struct list_head queue; > struct udev_device *udev; /* one ref taken while on the list > */ > int fd; /* dup of pp->fd */ > enum purge_state state; > }; > > // global list_head for the queue of paths to be purged. > static LIST_HEAD(purge_queue); > > If multipathd detects a path to purge in the checker thread, it > allocates and adds a path_to_purge entry with state PURGE_REQUESTED to > the queue and triggers the purge thread e.g. with a pthreads condition > variable. > > The purge thread then walks through the list (without modifying the > list itself), sends a delete request to the kernel and atomically sets > the state field to PURGE_DONE when successful. > > multipathd's checker thread walks the list next time and removes > entries that have state != PURGE_REQUESTED, closing the fd and unref- > ing the struct udev. IOW the resource allocation/freeing for this list > and its elements would be handled by the checker alone. The checker > would not immediately drop internal references to the path; this would > be handled by the remove uevent handler as usual. > > The purge thread would never write to the list, except to the state > attribute of list members (which we could protect from being freed via > RCU). > > This would make the purge thread almost an independent process. I'm a little confused about this design. Lets say a path gets removed while on this list. If the purgeloop doesn't modify the list, then how does the remove function know that it's safe to remove the item. The purgeloop could be in the process notifying sysfs right now, making use of the udev device. Similarly multipathd will change a path's udev device when it gets a new uevent. How does it know if it's safe to update the pointer on this list. Someone has to unref the purgeloops extra ref, and multipathd can't, once it's updated pp->udev and forgotten the old value. You might be able to solve this with more purge_states so that the purgeloop could signal that it's actually working on the device, but I'm pretty sure you'd need actual locking to avoid races here. Aside from the timing issue, we're adding multiple places where multipathd needs to go checking and modifying this list as part of other actions, which seems like it gets pretty fiddly. If we instead say that purgeloop is in charge of cleaning up the list, there would still need to be locking, because multipathd might want to take items off the list, if they become reconnected or removed before the thread gets around to purging. In theory, you could just not modify the list in these cases, and just let the purgeloop run on those paths, but then you would be removing paths that have been reconnected, and possibly keeping a number of references to udev devices and fds for paths that are gone (assuming that deleting devices can actually block for a chunk of time). Also, multipathd would need to check the list so paths didn't get added multiple times, if the checker re-ran before the path was purged. Again, unless you were fine with keeping unnecessary extra references to the device. My original idea was to try to keep purgeloop from needing to grab the vecs lock, but I just kept coming up with extra complications that this would cause. So I settled on something where the normal multipath code just needs to set a path variable to communicate with the purgeloop, and the purgeloop just needs to hold the vecs lock while looking through the path list until it finds a device in need of purging. My other option make the purgeloop do much more work. The idea was that multipathd just put the pathname on a list when it found a disconnected path, and the purge loop would pull items off that list. This would still need locking, but you could probably do it with rcu locking. Alternatively, you could use sockets to send the pathnames to the purgeloop thread, to avoid locking and possibly make the purgeloop it's own process completely. For each path, the purgeloop would 1. open the path device (to make sure it didn't get deleted and reused) 2. check the path holders to make sure it was part of a multipath device 3. run a TUR check on the path to make sure that it was still disconnected 4. delete the path, if so. 5. close the path device. This is completely disconnected from multipath, minus the pathname communication, and should be able to make the right decisions about paths that were removed or reconnected, without sharing any resources that need cleaing up. It just makes the purgeloop duplicate a bunch of work that multipathd already did. Thoughts? Am I misunderstanding your design? -Ben > Opinions? > > Martin ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] mulitpath-tools Add purge capability to multipath-tools 2025-12-08 19:30 ` Benjamin Marzinski @ 2025-12-09 11:01 ` Martin Wilck 2025-12-10 19:53 ` Benjamin Marzinski 0 siblings, 1 reply; 7+ messages in thread From: Martin Wilck @ 2025-12-09 11:01 UTC (permalink / raw) To: Benjamin Marzinski, Brian Bunker; +Cc: dm-devel, Krishna Kant On Mon, 2025-12-08 at 14:30 -0500, Benjamin Marzinski wrote: > On Mon, Dec 08, 2025 at 01:27:32PM +0100, Martin Wilck wrote: > > On Thu, 2025-12-04 at 14:33 -0500, Benjamin Marzinski wrote: > > > On Wed, Dec 03, 2025 at 08:19:33AM -0800, Brian Bunker wrote: > > > > > How about a simplifying the purge thread slightly. When the > > > checker > > > gets > > > a path it is supposed to purge, it sets purge_path, just like > > > now. I > > > don't think we need to track purge_retries at all. When the purge > > > loop > > > is triggered, it locks the vecs lock, and finds a single path to > > > purge, > > > like now. The only info it needs to save is the udev device (with > > > a > > > new > > > refcount taken like now) and a dup of pp->fd (You need this to > > > guarantee > > > that the devnode won't get reused if the device is deleted and a > > > new > > > one > > > is re-added). You could also save the name, but > > > udev_device_get_devnode() will get it for you. Like I mentioned > > > above, > > > you need to have a pthread cleanup handler that gets run to clean > > > these > > > up if they are set and the thread is cancelled, just like how the > > > vecs > > > lock is handled. Then purgeloop() clears pp->purge_path, drops > > > the > > > vecs > > > lock, does the sysfs delete, and cleans up the purge_info by > > > calling > > > pthread_cleanup_pop(). Finally it loops back to reacquiring the > > > vecs > > > lock and finding the next path to delete. I don't really see a > > > value > > > in > > > all the stats tracking. If it goes through all the paths, and > > > doesn't > > > find one in need of purging, it waits, just like now. > > > > I'd go one step further and suggest an approach similar to what we > > do > > in the uevent handler, for which we have a separate list of uevents > > that shares no data with the main multipath threads. > > > > Roughly like this: > > > > enum purge_state { > > PURGE_REQUESTED = 1, > > PURGE_DONE, > > PURGE_ERROR > > }; > > > > struct path_to_purge { > > struct list_head queue; > > struct udev_device *udev; /* one ref taken while on the > > list > > */ > > int fd; /* dup of pp->fd */ > > enum purge_state state; > > }; > > > > // global list_head for the queue of paths to be purged. > > static LIST_HEAD(purge_queue); > > > > If multipathd detects a path to purge in the checker thread, it > > allocates and adds a path_to_purge entry with state PURGE_REQUESTED > > to > > the queue and triggers the purge thread e.g. with a pthreads > > condition > > variable. > > > > The purge thread then walks through the list (without modifying the > > list itself), sends a delete request to the kernel and atomically > > sets > > the state field to PURGE_DONE when successful. > > > > multipathd's checker thread walks the list next time and removes > > entries that have state != PURGE_REQUESTED, closing the fd and > > unref- > > ing the struct udev. IOW the resource allocation/freeing for this > > list > > and its elements would be handled by the checker alone. The checker > > would not immediately drop internal references to the path; this > > would > > be handled by the remove uevent handler as usual. > > > > The purge thread would never write to the list, except to the state > > attribute of list members (which we could protect from being freed > > via > > RCU). > > > > This would make the purge thread almost an independent process. > > I'm a little confused about this design. Lets say a path gets removed > while on this list. If the purgeloop doesn't modify the list, then > how > does the remove function know that it's safe to remove the item. > The > purgeloop could be in the process notifying sysfs right now, making > use > of the udev device. My thinking was that we'd not free the items as long as the state is PURGE_REQUESTED, and the purge thread wouldn't set that state until it's done with the item. That should be safe during normal operations. We could of course add PURGE_INPROGRESS, mainly to protect against races during shutdown. But AFAICS, for that it should be sufficient to join the purge thread before freeing the the queue, which we need to do anyway. > Similarly multipathd will change a path's udev > device when it gets a new uevent. How does it know if it's safe to > update the pointer on this list. Good point (not sure how wise it is that multipathd does this, but that's a different discussion). I believe it's not that bad, though. The purge thread would have its own reference (the _ref() being done by the checker before adding the element to the queue), so that the udev_device won't be freed as long as it's working on it, even if multipathd unref()s and forgets it. When the purge thread is done with a path, it drops the ref before changing the state. The checker should then _not_ drop the ref any more when it frees the queue item. If multipathd has allocated a different struct udev_device in the meantime, it will just continue using that. Similar for the dup'd fd. > Someone has to unref the purgeloops > extra ref, and multipathd can't, once it's updated pp->udev and > forgotten the old value. You might be able to solve this with more > purge_states so that the purgeloop could signal that it's actually > working on the device, but I'm pretty sure you'd need actual locking > to > avoid races here. Aside from the timing issue, we're adding multiple > places where multipathd needs to go checking and modifying this list > as > part of other actions, which seems like it gets pretty fiddly. I don't see why this would be necessary. In my mind, we need one extra function in checker_finished() that takes care of the purge queue. checker_finished() will be aware of both unavailable and reinstated paths. This will require some locking around the queue, I guess. > If we instead say that purgeloop is in charge of cleaning up the > list, > there would still need to be locking, because multipathd might want > to > take items off the list, if they become reconnected or removed before > the thread gets around to purging. Right. That's why I proposed that the purge thread would never modify the queue. > In theory, you could just not modify > the list in these cases, and just let the purgeloop run on those > paths, > but then you would be removing paths that have been reconnected, This can't be avoided anyway. It's a generic risk of the purging approach. With any design, there's a time window between the purge thread being notified about a path and that path actually being purged, and the path can change state any time during this time window, meaning that we may be purging a healthy device. @Brian: AFAICS this problem affects your original design as well. I'm wondering — if this happens, would we ever get this path back? The safest method to protect against that is the purge thread double- checking the state before deleting, using TUR (or RTPG?). But even that isn't safe, the path could change state in the tiny time window between check and deletion. The purge thread would need to check the state _after_ deleting the Linux SCSI device to avoid a race, and re-probe the device in that case. But it isn't clear which device that TUR could be sent to, with the Linux device being deleted. We could delete the disk first and double-check via the sg device, maybe. When multipathd reinstates an item, it could walk the list and atomically change the state to e.g. PURGE_REINSTATED. That would add complexity; I'm not sure if it would buy us much in terms of safety against paths being wrongly deleted. It would cover only the time window between the checker noticing that the path is reinstated and the deletion, which is smaller than the time window between path checks. The fact that multipathd will carry around references to a path that it's going to delete in the purge thread is another issue. But I think we agree that this is non-fatal, and that future "remove" uevents will fix this situation. > and > possibly keeping a number of references to udev devices and fds for > paths that are gone (assuming that deleting devices can actually > block > for a chunk of time). That can happen, yes, but do you see a solution that avoids this? We do need to keep references to the device in some way until we actually delete it. (Edit: reading further, I saw you other proposal... see below). > Also, multipathd would need to check the list so > paths didn't get added multiple times, if the checker re-ran before > the > path was purged. Right. Not a big issue IMO. The same applies to passing path names or devt to the purge thread; it's even more difficult in that case, no? > Again, unless you were fine with keeping unnecessary > extra references to the device. > > > My original idea was to try to keep purgeloop from needing to grab > the > vecs lock, but I just kept coming up with extra complications that > this > would cause. So I settled on something where the normal multipath > code > just needs to set a path variable to communicate with the purgeloop, > and > the purgeloop just needs to hold the vecs lock while looking through > the > path list until it finds a device in need of purging. Unless I am mistaken, this has similar problems as my approach, just in different ways. After scanning the pathvec, the purge thread will need to hold some extra references to the paths it's going to work on, or copy data structures; otherwise they can go away under it. Just like above, the paths could be reinstated just after the purge loop is done reading the list. > My other option make the purgeloop do much more work. The idea was > that > multipathd just put the pathname on a list when it found a > disconnected > path, and the purge loop would pull items off that list. This would > still need locking, but you could probably do it with rcu locking. > Alternatively, you could use sockets to send the pathnames to the > purgeloop thread, to avoid locking and possibly make the purgeloop > it's > own process completely. If we do this, I'd vote for dev_t rather that path names. I thought about it (remember, using a pipe was my first idea) and came to the conclusion that the socket/pipe approach has no real benefit over the linked list in shared memory. As you pointed out yourself, we need to keep the fd open to avoid the device being replaced by something else in the kernel. If we just path names or dev_t, that isn't guaranteed any more. Similar for the udev_device, to me it seems to be a good thing to be sure not to free it before we know that the path is gone. That's why I came to think that passing udev_device and fd is safer than just a name. > For each path, the purgeloop would > > 1. open the path device (to make sure it didn't get deleted and > reused) What if the device is removed in multipathd before this happens? > 2. check the path holders to make sure it was part of a multipath > device So orphaned devices won't be deleted? AFAICS that's not in Brian's proposal. But that's orthogonal to the rest of the discussion here. > 3. run a TUR check on the path to make sure that it was still > disconnected > 4. delete the path, if so. > 5. close the path device. > > This is completely disconnected from multipath, minus the pathname > communication, and should be able to make the right decisions about > paths that were removed or reconnected, without sharing any resources > that need cleaing up. It just makes the purgeloop duplicate a bunch > of > work that multipathd already did. > > Thoughts? Am I misunderstanding your design? I don't think we're that far away from each other. As written above, I think that handling the resources (udev_device and fd) in multipathd actually has a few advantages, and that I (perhaps naïvely) think it can be cleanly handled. My proposal seems to be somewhat between the two you have put forward. It's not a religious issue to me in general, and I'm certain that my idea is far from perfect. Just that, as I wrote before, the less the purge thread accesses multipathd's internal data structures, the better. Regards, Martin ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] mulitpath-tools Add purge capability to multipath-tools 2025-12-09 11:01 ` Martin Wilck @ 2025-12-10 19:53 ` Benjamin Marzinski 0 siblings, 0 replies; 7+ messages in thread From: Benjamin Marzinski @ 2025-12-10 19:53 UTC (permalink / raw) To: Martin Wilck; +Cc: Brian Bunker, dm-devel, Krishna Kant On Tue, Dec 09, 2025 at 12:01:17PM +0100, Martin Wilck wrote: > On Mon, 2025-12-08 at 14:30 -0500, Benjamin Marzinski wrote: > > On Mon, Dec 08, 2025 at 01:27:32PM +0100, Martin Wilck wrote: > > > On Thu, 2025-12-04 at 14:33 -0500, Benjamin Marzinski wrote: > > > > On Wed, Dec 03, 2025 at 08:19:33AM -0800, Brian Bunker wrote: > > > > > > > How about a simplifying the purge thread slightly. When the > > > > checker > > > > gets > > > > a path it is supposed to purge, it sets purge_path, just like > > > > now. I > > > > don't think we need to track purge_retries at all. When the purge > > > > loop > > > > is triggered, it locks the vecs lock, and finds a single path to > > > > purge, > > > > like now. The only info it needs to save is the udev device (with > > > > a > > > > new > > > > refcount taken like now) and a dup of pp->fd (You need this to > > > > guarantee > > > > that the devnode won't get reused if the device is deleted and a > > > > new > > > > one > > > > is re-added). You could also save the name, but > > > > udev_device_get_devnode() will get it for you. Like I mentioned > > > > above, > > > > you need to have a pthread cleanup handler that gets run to clean > > > > these > > > > up if they are set and the thread is cancelled, just like how the > > > > vecs > > > > lock is handled. Then purgeloop() clears pp->purge_path, drops > > > > the > > > > vecs > > > > lock, does the sysfs delete, and cleans up the purge_info by > > > > calling > > > > pthread_cleanup_pop(). Finally it loops back to reacquiring the > > > > vecs > > > > lock and finding the next path to delete. I don't really see a > > > > value > > > > in > > > > all the stats tracking. If it goes through all the paths, and > > > > doesn't > > > > find one in need of purging, it waits, just like now. > > > > > > I'd go one step further and suggest an approach similar to what we > > > do > > > in the uevent handler, for which we have a separate list of uevents > > > that shares no data with the main multipath threads. > > > > > > Roughly like this: > > > > > > enum purge_state { > > > PURGE_REQUESTED = 1, > > > PURGE_DONE, > > > PURGE_ERROR > > > }; > > > > > > struct path_to_purge { > > > struct list_head queue; > > > struct udev_device *udev; /* one ref taken while on the > > > list > > > */ > > > int fd; /* dup of pp->fd */ > > > enum purge_state state; > > > }; > > > > > > // global list_head for the queue of paths to be purged. > > > static LIST_HEAD(purge_queue); > > > > > > If multipathd detects a path to purge in the checker thread, it > > > allocates and adds a path_to_purge entry with state PURGE_REQUESTED > > > to > > > the queue and triggers the purge thread e.g. with a pthreads > > > condition > > > variable. > > > > > > The purge thread then walks through the list (without modifying the > > > list itself), sends a delete request to the kernel and atomically > > > sets > > > the state field to PURGE_DONE when successful. > > > > > > multipathd's checker thread walks the list next time and removes > > > entries that have state != PURGE_REQUESTED, closing the fd and > > > unref- > > > ing the struct udev. IOW the resource allocation/freeing for this > > > list > > > and its elements would be handled by the checker alone. The checker > > > would not immediately drop internal references to the path; this > > > would > > > be handled by the remove uevent handler as usual. > > > > > > The purge thread would never write to the list, except to the state > > > attribute of list members (which we could protect from being freed > > > via > > > RCU). > > > > > > This would make the purge thread almost an independent process. > > > > I'm a little confused about this design. Lets say a path gets removed > > while on this list. If the purgeloop doesn't modify the list, then > > how > > does the remove function know that it's safe to remove the item. > > The > > purgeloop could be in the process notifying sysfs right now, making > > use > > of the udev device. > > My thinking was that we'd not free the items as long as the state is > PURGE_REQUESTED, and the purge thread wouldn't set that state until > it's done with the item. That should be safe during normal operations. > We could of course add PURGE_INPROGRESS, mainly to protect against > races during shutdown. But AFAICS, for that it should be sufficient to > join the purge thread before freeing the the queue, which we need to do > anyway. > > > Similarly multipathd will change a path's udev > > device when it gets a new uevent. How does it know if it's safe to > > update the pointer on this list. > > Good point (not sure how wise it is that multipathd does this, but > that's a different discussion). I believe it's not that bad, though. > > The purge thread would have its own reference (the _ref() being done by > the checker before adding the element to the queue), so that the > udev_device won't be freed as long as it's working on it, even if > multipathd unref()s and forgets it. When the purge thread is done with > a path, it drops the ref before changing the state. The checker should > then _not_ drop the ref any more when it frees the queue item. If > multipathd has allocated a different struct udev_device in the > meantime, it will just continue using that. Similar for the dup'd fd. > > > Someone has to unref the purgeloops > > extra ref, and multipathd can't, once it's updated pp->udev and > > forgotten the old value. You might be able to solve this with more > > purge_states so that the purgeloop could signal that it's actually > > working on the device, but I'm pretty sure you'd need actual locking > > to > > avoid races here. Aside from the timing issue, we're adding multiple > > places where multipathd needs to go checking and modifying this list > > as > > part of other actions, which seems like it gets pretty fiddly. > > I don't see why this would be necessary. In my mind, we need one extra > function in checker_finished() that takes care of the purge queue. > checker_finished() will be aware of both unavailable and reinstated > paths. This will require some locking around the queue, I guess. > > > If we instead say that purgeloop is in charge of cleaning up the > > list, > > there would still need to be locking, because multipathd might want > > to > > take items off the list, if they become reconnected or removed before > > the thread gets around to purging. > > Right. That's why I proposed that the purge thread would never modify > the queue. > > > In theory, you could just not modify > > the list in these cases, and just let the purgeloop run on those > > paths, > > but then you would be removing paths that have been reconnected, > > This can't be avoided anyway. It's a generic risk of the purging > approach. With any design, there's a time window between the purge > thread being notified about a path and that path actually being purged, > and the path can change state any time during this time window, meaning > that we may be purging a healthy device. > > @Brian: AFAICS this problem affects your original design as well. I'm > wondering — if this happens, would we ever get this path back? > > The safest method to protect against that is the purge thread double- > checking the state before deleting, using TUR (or RTPG?). But even that > isn't safe, the path could change state in the tiny time window between > check and deletion. The purge thread would need to check the state > _after_ deleting the Linux SCSI device to avoid a race, and re-probe > the device in that case. But it isn't clear which device that TUR could > be sent to, with the Linux device being deleted. We could delete the > disk first and double-check via the sg device, maybe. > > When multipathd reinstates an item, it could walk the list and > atomically change the state to e.g. PURGE_REINSTATED. That would add > complexity; I'm not sure if it would buy us much in terms of safety > against paths being wrongly deleted. It would cover only the time > window between the checker noticing that the path is reinstated and the > deletion, which is smaller than the time window between path checks. > > The fact that multipathd will carry around references to a path that > it's going to delete in the purge thread is another issue. But I think > we agree that this is non-fatal, and that future "remove" uevents will > fix this situation. > > > > and > > possibly keeping a number of references to udev devices and fds for > > paths that are gone (assuming that deleting devices can actually > > block > > for a chunk of time). > > That can happen, yes, but do you see a solution that avoids this? > We do need to keep references to the device in some way until we > actually delete it. (Edit: reading further, I saw you other proposal... > see below). > > > Also, multipathd would need to check the list so > > paths didn't get added multiple times, if the checker re-ran before > > the > > path was purged. > > Right. Not a big issue IMO. The same applies to passing path names or > devt to the purge thread; it's even more difficult in that case, no? > > > Again, unless you were fine with keeping unnecessary > > extra references to the device. > > > > > > My original idea was to try to keep purgeloop from needing to grab > > the > > vecs lock, but I just kept coming up with extra complications that > > this > > would cause. So I settled on something where the normal multipath > > code > > just needs to set a path variable to communicate with the purgeloop, > > and > > the purgeloop just needs to hold the vecs lock while looking through > > the > > path list until it finds a device in need of purging. > > Unless I am mistaken, this has similar problems as my approach, just in > different ways. After scanning the pathvec, the purge thread will need > to hold some extra references to the paths it's going to work on, or > copy data structures; otherwise they can go away under it. Just like > above, the paths could be reinstated just after the purge loop is done > reading the list. Yes. It would still need to hold a udev ref and a fd dup, that it would clean up after doing the purge. I think I was assuming that your fix was more complicated than you intended, with multipathd removing paths from the list if the path got removed. But if you don't care about paths being on the list after multipathd removed them, your can ignore my objections. The biggest difference between your method and mine is that in exchange for needing to grab the vecs lock in purge thread, if devices get freed or restored by multipathd after they've been marked for purging, you won't bother trying to purge them. Also, you don't have to worry about putting a device on the list multiple times. While both methods are race-y, if you find the path to purge while holding the vecs lock, and then unlock and immediately purge, it's very unlikely that multipathd freed or restored the path in that gap. If you put the path on a list and it sits there while the purge loop does blocking operations until it gets to the path, there's a much bigger gap where the path state can change. Also, you do need locking between the checker modifying the list, and the purge thread reading it, but if you're willing to duplicate the list whever you add or remove from it, you could use rcu locking, like you metioned. But none of these issue is that important. Either method should work fine. > > > My other option make the purgeloop do much more work. The idea was > > that > > multipathd just put the pathname on a list when it found a > > disconnected > > path, and the purge loop would pull items off that list. This would > > still need locking, but you could probably do it with rcu locking. > > Alternatively, you could use sockets to send the pathnames to the > > purgeloop thread, to avoid locking and possibly make the purgeloop > > it's > > own process completely. > > If we do this, I'd vote for dev_t rather that path names. I thought > about it (remember, using a pipe was my first idea) and came to the > conclusion that the socket/pipe approach has no real benefit over the > linked list in shared memory. > > As you pointed out yourself, we need to keep the fd open to avoid the > device being replaced by something else in the kernel. If we just path > names or dev_t, that isn't guaranteed any more. Similar for the > udev_device, to me it seems to be a good thing to be sure not to free > it before we know that the path is gone. > > That's why I came to think that passing udev_device and fd is safer > than just a name. > > > For each path, the purgeloop would > > > > 1. open the path device (to make sure it didn't get deleted and > > reused) > > What if the device is removed in multipathd before this happens? > > > 2. check the path holders to make sure it was part of a multipath > > device > > So orphaned devices won't be deleted? AFAICS that's not in Brian's > proposal. But that's orthogonal to the rest of the discussion here. > > > 3. run a TUR check on the path to make sure that it was still > > disconnected > > 4. delete the path, if so. > > 5. close the path device. > > > > This is completely disconnected from multipath, minus the pathname > > communication, and should be able to make the right decisions about > > paths that were removed or reconnected, without sharing any resources > > that need cleaing up. It just makes the purgeloop duplicate a bunch > > of > > work that multipathd already did. > > > > Thoughts? Am I misunderstanding your design? > > I don't think we're that far away from each other. As written above, I > think that handling the resources (udev_device and fd) in multipathd > actually has a few advantages, and that I (perhaps naïvely) think it > can be cleanly handled. > > My proposal seems to be somewhat between the two you have put forward. > It's not a religious issue to me in general, and I'm certain that my > idea is far from perfect. Just that, as I wrote before, the less the > purge thread accesses multipathd's internal data structures, the > better. Yep. Now that I understand your proposal, I don't think it really matters how we solve this. -Ben > Regards, > Martin ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-12-10 19:53 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-03 16:19 [PATCH v3] mulitpath-tools Add purge capability to multipath-tools Brian Bunker 2025-12-04 19:33 ` Benjamin Marzinski 2025-12-04 20:08 ` Benjamin Marzinski 2025-12-08 12:27 ` Martin Wilck 2025-12-08 19:30 ` Benjamin Marzinski 2025-12-09 11:01 ` Martin Wilck 2025-12-10 19:53 ` Benjamin Marzinski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox