Linux LVM users
 help / color / mirror / Atom feed
* Re: Reg thin pool chunk size and its impact on discard granularity
From: Lakshmi Narasimhan Sundararajan @ 2026-06-08 18:21 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: linux-lvm
In-Reply-To: <81b8a5ae-050f-4e8d-bf22-551d959174c3@redhat.com>

On Mon, Jun 8, 2026 at 8:07 PM Zdenek Kabelac <zkabelac@redhat.com> wrote:
>
> Dne 04. 06. 26 v 7:00 Lakshmi Narasimhan Sundararajan napsal(a):
> > Hi LVM Team! A very good day to you.
> >
> > I have a question about the chunk size configuration supported on a
> > thin pool and its relation to the exported discard_granularity on the
> > many thin volumes from that pool.
> >
> > Per the docs, chunk sizes range from 64K to 2M and must be a multiple of 64K.
> > So I assume , any integral units of  64K, for example, 192K (which is
> > three times 64K, is also supported, even though it's not a power of
> > 2).
> >
> > Now, I need confirmation on the correct behavior: I've observed that
> > thin volumes from a thin pool use the chunk size as the discard
> > granularity. To my understanding, the block layer code heavily assumes
> > that the discard granularity is a power of 2.
> >
> > In this case, this is not honoured, and I do not fully understand the
> > reason for or the impact of this behavior. Can you please help me
> > understand what this setting means?
>
> Hi
>
> You need to see a difference between released space in thin-pool itself,
> and sending TRIM request to underlying storage used for holding data chunks.
>
> Those are 2 'different' operations.
>
> Also typically TRIM is block based aka 4K - so the kernel can send TRIM to
> individual blocks - and then it depends how the storage itself works with this
> - it may need whole internal stripe - i.e. 128K or 256K to be trimmed to have
> some real usefulness.
>
> As you can already see there are several variables which needs to be meet
> together to make TRIM full passing down through whole stack to the core drive
> logic and releasing 'area' within storage.


Hi Zdenek, Thank you for your response.

Either I have not followed your response, or my question did not convey well.

Assume a thin pool (pwx2) exists with a chunk size of 384K.
Any thin volume created from that thin pool is exported as shown below.
```
root@ip-10-13-163-134:~# ls /dev/pwx2/200599089991732291
/dev/pwx2/200599089991732291
root@ip-10-13-163-134:~# lsblk -D  /dev/pwx2/200599089991732291
NAME                    DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
pwx2-200599089991732291        0      384K     384M         0
root@ip-10-13-163-134:~# readlink /dev/pwx2/200599089991732291
../dm-22
root@ip-10-13-163-134:~# cat /sys/block/dm-22/queue/discard_granularity
393216
```

Now when blkdiscard (TRIM/UNMAP variants) is used, all of it comes
directly from the application (it could work on a raw disk - thin
volume in this case)
or a mounted fs over this block device.
They issue blkdiscard or the ioctl(TRIM/UNMAP etc equivalent).
Now blkdiscard targets an end user mapped/direct extent on the thin volume.
The block device property exported cannot be a power of 2.
This is a bug where the thin pool exports a non-power-of-2 value for
discard granularity.
And this would impact all operations from the app/fs.

There is no mention of thin_trim in my email, but even that is
impacted, as it would internally issue blkdiscard.

And the behavior is undefined in this case!

I cannot fathom anything other than a BUG in the thin pool stack and
there is no workaround too.
Please correct me if I am mistaken.

>
> > a) thin pools can safely support any integral multiple of 64K chunk size
> > b) thin volumes always export chunk size as discard granularity
> > c) thin volumes exporting a non power of 2 discard granularity implies what?
> > d) what else is impacted here outside discard - and its behavior in this case.
> > Regular IO activity is fine, as discard granularity do not impact it.
> > e) anything else surrounding this context that is relevant to support
> > this configuration.
> >
> > As always, thank you so much for your inputs.
>
> You are possibly overestimating the need of 'trim'  - modern drives usually do
> not need it - so you may only need this - when thin-pool itself runs on top of
> another 'provisioned' space - where the TRIM might have some sensitivity.
>
> There is 'thin_trim'  currently only 'offline' - which is capable for inactive
> thin-pool to run through and send large 'trims' - essentially equivalence of
> 'fstrim' logic - and there is some work being done to make this tool online as
> need this for proper THIN+VDO integration.
>
> On fast modern nvme you might get possibly better performance with disabled
> trim pass-through...
>
> Regards
>
> Zdenek
>

^ permalink raw reply

* Re: Reg thin pool chunk size and its impact on discard granularity
From: Zdenek Kabelac @ 2026-06-08 14:37 UTC (permalink / raw)
  To: Lakshmi Narasimhan Sundararajan, linux-lvm
In-Reply-To: <CAFe+wq0JxH3_GZUdoKhxLM7RxbWphNs6fj0aetQ-VY4H3MXe_w@mail.gmail.com>

Dne 04. 06. 26 v 7:00 Lakshmi Narasimhan Sundararajan napsal(a):
> Hi LVM Team! A very good day to you.
> 
> I have a question about the chunk size configuration supported on a
> thin pool and its relation to the exported discard_granularity on the
> many thin volumes from that pool.
> 
> Per the docs, chunk sizes range from 64K to 2M and must be a multiple of 64K.
> So I assume , any integral units of  64K, for example, 192K (which is
> three times 64K, is also supported, even though it's not a power of
> 2).
> 
> Now, I need confirmation on the correct behavior: I've observed that
> thin volumes from a thin pool use the chunk size as the discard
> granularity. To my understanding, the block layer code heavily assumes
> that the discard granularity is a power of 2.
> 
> In this case, this is not honoured, and I do not fully understand the
> reason for or the impact of this behavior. Can you please help me
> understand what this setting means?

Hi

You need to see a difference between released space in thin-pool itself,
and sending TRIM request to underlying storage used for holding data chunks.

Those are 2 'different' operations.

Also typically TRIM is block based aka 4K - so the kernel can send TRIM to 
individual blocks - and then it depends how the storage itself works with this 
- it may need whole internal stripe - i.e. 128K or 256K to be trimmed to have 
some real usefulness.

As you can already see there are several variables which needs to be meet 
together to make TRIM full passing down through whole stack to the core drive 
logic and releasing 'area' within storage.

> a) thin pools can safely support any integral multiple of 64K chunk size
> b) thin volumes always export chunk size as discard granularity
> c) thin volumes exporting a non power of 2 discard granularity implies what?
> d) what else is impacted here outside discard - and its behavior in this case.
> Regular IO activity is fine, as discard granularity do not impact it.
> e) anything else surrounding this context that is relevant to support
> this configuration.
> 
> As always, thank you so much for your inputs.

You are possibly overestimating the need of 'trim'  - modern drives usually do 
not need it - so you may only need this - when thin-pool itself runs on top of 
another 'provisioned' space - where the TRIM might have some sensitivity.

There is 'thin_trim'  currently only 'offline' - which is capable for inactive 
thin-pool to run through and send large 'trims' - essentially equivalence of 
'fstrim' logic - and there is some work being done to make this tool online as 
need this for proper THIN+VDO integration.

On fast modern nvme you might get possibly better performance with disabled 
trim pass-through...

Regards

Zdenek


^ permalink raw reply

* Reg thin pool chunk size and its impact on discard granularity
From: Lakshmi Narasimhan Sundararajan @ 2026-06-04  5:00 UTC (permalink / raw)
  To: linux-lvm

Hi LVM Team! A very good day to you.

I have a question about the chunk size configuration supported on a
thin pool and its relation to the exported discard_granularity on the
many thin volumes from that pool.

Per the docs, chunk sizes range from 64K to 2M and must be a multiple of 64K.
So I assume , any integral units of  64K, for example, 192K (which is
three times 64K, is also supported, even though it's not a power of
2).

Now, I need confirmation on the correct behavior: I've observed that
thin volumes from a thin pool use the chunk size as the discard
granularity. To my understanding, the block layer code heavily assumes
that the discard granularity is a power of 2.

In this case, this is not honoured, and I do not fully understand the
reason for or the impact of this behavior. Can you please help me
understand what this setting means?

a) thin pools can safely support any integral multiple of 64K chunk size
b) thin volumes always export chunk size as discard granularity
c) thin volumes exporting a non power of 2 discard granularity implies what?
d) what else is impacted here outside discard - and its behavior in this case.
Regular IO activity is fine, as discard granularity do not impact it.
e) anything else surrounding this context that is relevant to support
this configuration.

As always, thank you so much for your inputs.
Best regards
LN

^ permalink raw reply

* Re: Reg dm thin pool metadata inconsistency
From: Lakshmi Narasimhan Sundararajan @ 2026-04-14 15:50 UTC (permalink / raw)
  To: Ming Hung Tsai; +Cc: linux-lvm
In-Reply-To: <CALjSBEuyC3U2RYj+tUVE+su6GaXi6bNp9fh3sMCOCsi0Cd-E0Q@mail.gmail.com>

On Tue, Apr 14, 2026 at 8:38 PM Ming Hung Tsai <mtsai@redhat.com> wrote:
>
> Hi Lakshmi,
>
> On Tue, Apr 14, 2026 at 8:03 PM Lakshmi Narasimhan Sundararajan
> <lsundararajan@purestorage.com> wrote:
> >
> > Hi Ming,
> >
> > I do not have the raw metadata unfortunately, did not collect from the
> > customer recovery.
> > We do have discard_passdown enabled on those setups where the issue occurred.
>
> Do you mean there's a "discard_passdown" flag in the `dmsetup status` output?
> The feature counter for the thin-pool table is set to 2, with
> skip_block_zeroing as the first flag, and the second flag appears to
> be truncated.
> Could you provide the name of the second flag, if present?
>
> "0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing ..."


Yes these are SAN volumes that are provisioned for use.
Hence discard_passdown is turned ON, I can confirm on that.

Regards

>
>
>
> > Have you seen this problem before?
> >
> > On Tue, Apr 14, 2026 at 4:58 PM Ming Hung Tsai <mtsai@redhat.com> wrote:
> > >
> > > Hi,
> > >
> > > I have two questions posted inline.
> > >
> > > On Tue, Apr 14, 2026 at 1:59 AM Lakshmi Narasimhan Sundararajan
> > > <lsundararajan@purestorage.com> wrote:
> > > >
> > > > Good day!
> > > >
> > > > A gentle ping to hear an update from the authors here. More updates inline.
> > > >
> > > > On Sat, Apr 11, 2026 at 5:41 PM Lakshmi Narasimhan Sundararajan
> > > > <lsundararajan@purestorage.com> wrote:
> > > > >
> > > > > Hi LVM Team! A very good day to you all.
> > > > > [ I hope this email is the right one now]
> > > > >
> > > > > I recently experienced an outage where thin pool activation failed,
> > > > > details are as follows.
> > > > > Good news is, I was able to recover the pool through thin_repair.
> > > > > Thank goodness!
> > > > >
> > > > > There was no infra induced failure i.e. no network, disk, usage over
> > > > > limit, memory or compute being faulty orover used in any way.
> > > > > Node was running healthy for 13 days and suddenly hit this issue.
> > > > > Pool would handle I/O load (including discards), new volume
> > > > > creation/deletion, and other regular activities.
> > > > >
> > > > > I tried to identify if there is a direct known issue, but I was unable to.
> > > > > This generally seems to be some known issue, but I am unable to find a
> > > > > direct link with the same signature.
> > > > >
> > > > > a) how to induce thin pool failures at will, so thin pool does not
> > > > > activate, but repair succeeds, so  I can test this recovery in some
> > > > > controlled form.
> > > >
> > > > I have found a way, I think I can pull out the metadata xml and modify
> > > > highest transaction and rewrite the metadata and swap the pool to
> > > > recreate this condition.
> > > > pool activation will fail and thin_repair can correct it. Any easier
> > > > way that my suggestion, please feel free to suggest.
> > >
> > > Were you able to reproduce the issue on your end? I'm concerned that
> > > using metadata rebuilt from XML might not trigger the bug because the
> > > rebuilt layout differs from the original.
> > >
> > > Could you please provide the raw metadata image prior to any repairs?
> > > This will allow us to investigate the issue further.
> > >
> > >
> > > > > b) To your best knowledge this seems a known issue and fixed in a later release?
> > > > > I did my search at both kernel bugzilla and RHEL - and I am hoping you
> > > > > can help me find it. Internet searches point to errata pages, but I am
> > > > > unable to find the
> > > > > exact ticket, commit that address this. The OCP platform was running a
> > > > > recent release from RHEL.
> > > > > Linux kernel: 5.14.0-427.109.1.el9_4 RHEL 9.4 This is likely 2 years old though.
> > > >
> > > > So far I and my team have not been able to reproduce this issue, and
> > > > look to your help confirming whether
> > > > a) is this known already and is fixed!
> > > > b) whats the safest kernel to upgrade to?
> > > > c) still an open issue!
> > > >
> > > > >
> > > > > c) After spending some time reviewing thin code and the commits since
> > > > > the mentioned
> > > > > kernel from kernel.org linux.. I suspect it could be a race with
> > > > > discard and either IO or device creation/deletion on the same pool
> > > > > could cause this?
> > > > > Could the authors here, please confirm my code reading below.
> > > > > ```
> > > >
> > > > As mentioned we tried a focussed reproducer around this, but unable to
> > > > trigger the issue.
> > > > There are volume creation/deletions, snapshot creations and deletions,
> > > > discards and regular IO at any point on the thin pool.
> > > > And in addition, there would be calls to reserve/release the thin
> > > > metadata to capture diff for backup between volumes.
> > > > These are serialized at our app layer and I also see these are
> > > > serialized within lvm layer too.
> > > >
> > > > Our volume deletions are 2 phased, based on our earlier discussion in
> > > > this thread, we had observed high IO latency when volumes are deleted
> > > > and the suggestion from this team was to discard and then delete
> > > > volumes to keep the deletion time short.
> > > >
> > > > I hope this is giving enough context to understand this better.
> > > > Unfortunately, since I am unable to reproduce this and have this
> > > > sighted now twice at customer, I have no more datapoints to add.
> > > > Would be willing to hear out if you have any suggestions I can pursue in house.
> > > >
> > > > Best regards
> > > >
> > > >
> > > > > *** phase 1 - userspace issues blkdiscard on thin volumes ***
> > > > >   dm-thin.c : thin_bio_map()
> > > > >     → detects REQ_OP_DISCARD
> > > > >     → thin_defer_bio_with_throttle(tc, bio)
> > > > >       → adds bio to tc->deferred_bio_list        // QUEUED, not processed
> > > > >       → wakes pool worker thread
> > > > >
> > > > >   dm-thin.c : do_worker()                         // runs ASYNCHRONOUSLY
> > > > >     → process_deferred_bios()
> > > > >       → process_thin_deferred_bios()
> > > > >         → process_discard_bio()
> > > > >           → creates mapping, adds to pool->prepared_discards
> > > > >     → process_prepared(pool->prepared_discards)
> > > > >       → process_prepared_discard_no_passdown(m):
> > > > >         → dm_thin_remove_range(tc->td, begin, end)
> > > > >             [dm-thin-metadata.c]
> > > > >           → dm_btree_remove_leaves()
> > > > >               [dm-btree-remove.c]
> > > > >             → data_block_dec()                    // for each data block
> > > > >                 [dm-thin-metadata.c]
> > > > >               → dm_sm_dec_blocks()                // DECREMENTS refcount
> > > > >                   [dm-space-map-common.c]
> > > > >
> > > > > ***  phase 2: these steps still be IN PROGRESS or QUEUED when
> > > > > userspace deletes the thin volume ***
> > > > >
> > > > >   dm-thin.c : thin_dtr()                          // dmsetup remove
> > > > >     → list_del_rcu(&tc->list)                     // removes from
> > > > >                                                   //   pool->active_thins
> > > > >     → synchronize_rcu()
> > > > >     → dm_pool_close_thin_device(tc->td)           // open_count--
> > > > >     → kfree(tc)                                   // tc FREED
> > > > >
> > > > >     *** does NOT flush pool workqueue ***          ← GAP 1
> > > > >     *** does NOT drain prepared_discards ***       ← GAP 2
> > > > >
> > > > >   dm-thin.c : process_delete_mesg()               // dmsetup message
> > > > >     → dm_pool_delete_thin_device(pool->pmd, dev_id)
> > > > >         [dm-thin-metadata.c : __delete_device()]
> > > > >       → dm_btree_remove(&pmd->tl_info, ...)       // remove from top-level
> > > > >           [dm-btree-remove.c]                      //   btree
> > > > >         → subtree_dec()                            // cascades into:
> > > > >             [dm-thin-metadata.c]
> > > > >           → dm_btree_del()                         // walks ALL leaves
> > > > >               [dm-btree.c]
> > > > >             → data_block_dec() for EVERY remaining block
> > > > >                 [dm-thin-metadata.c]
> > > > >               → dm_sm_dec_blocks()                 // DECREMENTS refcount
> > > > >                   [dm-space-map-common.c]          //   for ALL blocks
> > > > >
> > > > > ** phase 3: KERNEL (worker thread — still running from Phase 1) ***
> > > > >   dm-thin.c : do_worker()                         // ASYNC, still running
> > > > >     → process_prepared(pool->prepared_discards)
> > > > >       → process_prepared_discard_no_passdown(m):
> > > > >         → m->tc points to FREED tc                // ← use-after-free risk
> > > > >         → dm_thin_remove_range(tc->td, begin, end)
> > > > >             [dm-thin-metadata.c]
> > > > >           → dm_btree_remove_leaves()
> > > > >               [dm-btree-remove.c]
> > > > >             → data_block_dec()                    // SAME blocks already
> > > > >                 [dm-thin-metadata.c]              //   decremented in
> > > > >               → dm_sm_dec_blocks()                //   Phase 2!
> > > > >                   [dm-space-map-common.c]
> > > > >
> > > > >                 ┌──────────────────────────────────────────────────┐
> > > > >                   sm_ll_dec_bitmap():
> > > > >                     old = sm_lookup_bitmap(ic->bitmap, bit);
> > > > >                     switch (old) {
> > > > >                     case 0:  // ← refcount ALREADY 0
> > > > >                       DMERR("unable to decrement block");
> > > > >                       return -EINVAL;  // -22
> > > > >                     }
> > > > >                                  [dm-space-map-common.c]
> > > > >                 └──────────────────────────────────────────────────┘
> > > > >
> > > > >                           ▼
> > > > >                 dm_tm_shadow_block() fails (corrupted space map)
> > > > >                     [dm-transaction-manager.c]
> > > > >
> > > > >                           ▼
> > > > >                 dm_pool_inc_data_range() fails with -EINVAL (-22)
> > > > >                     [dm-thin-metadata.c]
> > > > >
> > > > >                           ▼
> > > > >                 metadata_operation_failed(pool, "dm_pool_inc_data_range")
> > > > >                     [dm-thin.c]
> > > > >
> > > > >                           ▼
> > > > >                 set_pool_mode(pool, PM_READ_ONLY)
> > > > >                     [dm-thin.c]
> > > > >
> > > > >                 *** POOL IS NOW DEAD ***
> > > > > ```
> > > > >
> > > > >
> > > > >
> > > > > As always, many thanks for your help.
> > > > >
> > > > >
> > > > > # issue unable to activate thin pool
> > > > > ```
> > > > > [Wed Apr  8 17:05:14 2026] device-mapper: space map common: unable to
> > > > > decrement block
> > > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > > > decrement block
> > > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > > > dm_tm_shadow_block() failed
> > > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > > > decrement block
> > > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > > > dm_tm_shadow_block() failed
> > > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > > > decrement block
> > > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > > > dm_tm_shadow_block() failed
> > > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > > > > decrement block
> > > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > > > > dm_tm_shadow_block() failed
> > > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > > > > decrement block
> > > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > > > > dm_tm_shadow_block() failed
> > > > > ```
> > > > >
> > > > > # host and lvm tools version
> > > > > ```
> > > > > uname -a
> > > > > Linux kernel: 5.14.0-427.109.1.el9_4
> > > > > RHEL 9.4
> > > > >
> > > > > lvm version
> > > > > 2.03.23(2) (2023-11-21)
> > > > > library: 1.02.197 (2023-11-21)
> > > > > driver: 4.48.1
> > > > > ```
> > > > >
> > > > > Below are references to the node block layer.
> > > > > There was IO, thin volume creations and deletions, IO includes discards too.
> > > > > ```
> > > > > [root@root core]# lvs -a pwx1
> > > > >   Please remove the lvm.conf global_filter, it is ignored with the devices file.
> > > > >   LV                  VG   Attr       LSize   Pool   Origin
> > > > >   Data%  Meta%  Move Log Cpy%Sync Convert
> > > > >   1004123733318649769 pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
> > > > >   103699400925372609  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
> > > > >   1072608604746349133 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
> > > > >   1115712468847455249 pwx1 Vwi-aot--- 750.00g pxpool                     59.75
> > > > >   1138695541641144166 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
> > > > >   136169780918964477  pwx1 Vwi-aot---  30.00g pxpool                     33.33
> > > > >   218651423266852202  pwx1 Vwi-aot---   5.00g pxpool                     3.49
> > > > >   404947242154831849  pwx1 Vwi-aot---   5.00g pxpool                     4.20
> > > > >   440731835552948333  pwx1 Vwi-aot---  50.00g pxpool                     5.59
> > > > >   462681831690737818  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
> > > > >   519898065353250833  pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
> > > > >   527922274169222783  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
> > > > >   537994915504805835  pwx1 Vwi-aot---  50.00g pxpool                     10.88
> > > > >   569690966828279529  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
> > > > >   594992999737145586  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
> > > > >   660563940592999863  pwx1 Vwi-aot---  50.00g pxpool                     0.25
> > > > >   702358223003836192  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
> > > > >   73089959772282964   pwx1 Vwi-aot---  50.00g pxpool                     0.25
> > > > >   793515512579595979  pwx1 Vwi-aot---  30.00g pxpool                     33.33
> > > > >   79731196567060146   pwx1 Vwi-aot---  50.00g pxpool                     10.90
> > > > >   865397616123963982  pwx1 Vwi-aot---  50.00g pxpool                     9.39
> > > > >   866802183893693297  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
> > > > >   941788757364603035  pwx1 Vwi-aot---  50.00g pxpool                     0.25
> > > > >   960350716126095496  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
> > > > >   [lvol0_pmspare]     pwx1 ewi-------   2.00g
> > > > >   pxMetaFS            pwx1 Vwi-aot---  64.00g pxpool                     0.05
> > > > >   pxpool              pwx1 twi-aot---   1.54t
> > > > >   43.59  5.06 <<< very low tmeta util.
> > > > >   [pxpool_tdata]      pwx1 Twi-ao----   1.54t
> > > > >   [pxpool_tmeta]      pwx1 ewi-ao----   4.00g
> > > > >   pxreserve           pwx1 -wi------k  15.00g
> > > > > [root@root core]#
> > > > > [root@root core]# vgs pwx1
> > > > >   Please remove the lvm.conf global_filter, it is ignored with the devices file.
> > > > >   VG   #PV #LV #SN Attr   VSize VFree
> > > > >   pwx1   1  27   0 wz--n- 1.56t    0
> > > > > [root@root core]# lsblk -s /dev/pwx1/1004123733318649769
> > > > > NAME                                         MAJ:MIN RM  SIZE RO TYPE
> > > > > MOUNTPOINTS
> > > > > pwx1-1004123733318649769                     253:107  0   50G  0 lvm
> > > > > └─pwx1-pxpool-tpool                          253:14   0  1.5T  0 lvm
> > > > >   ├─pwx1-pxpool_tmeta                        253:12   0    4G  0 lvm
> > > > >   │ └─md126                                    9:126  0  1.6T  0 raid0
> > > > >   │   └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> > > > >   │     ├─nvme4n2                            259:5    0  1.6T  0 disk
> > > > >   │     ├─nvme5n2                            259:8    0  1.6T  0 disk
> > > > >   │     └─nvme6n2                            259:11   0  1.6T  0 disk
> > > > >   └─pwx1-pxpool_tdata                        253:13   0  1.5T  0 lvm
> > > > >     └─md126                                    9:126  0  1.6T  0 raid0
> > > > >       └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> > > > >         ├─nvme4n2                            259:5    0  1.6T  0 disk
> > > > >         ├─nvme5n2                            259:8    0  1.6T  0 disk
> > > > >         └─nvme6n2                            259:11   0  1.6T  0 disk
> > > > > [root@root core]# ls -al /dev/md/pwx1
> > > > > lrwxrwxrwx. 1 root root 8 Apr 11 11:48 /dev/md/pwx1 -> ../md126
> > > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool-tpool
> > > > > 0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing
> > >
> > > The second feature flag was truncated. Does the thin-pool has
> > > no_discard_passdown enabled?
> > >
> > >
> > > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tdata
> > > > > 0 1008459776 linear 9:126 35653632
> > > > > 1008459776 629145600 linear 9:126 1048307712
> > > > > 1637605376 1673527296 linear 9:126 1681647616
> > > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tmeta
> > > > > 0 4194304 linear 9:126 1044113408
> > > > > 4194304 4194304 linear 9:126 1677453312
> > > > > [root@root core]#
> > > > > [root@root core]# dmsetup table --target multipath
> > > > > 3500a07513c1e23c4: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:16 1 1
> > > > > 3500a07513c1e2ade: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:64 1 1
> > > > > 3500a07513c1e2ca8: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:80 1 1
> > > > > 3500a07513c1e2cf3: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:0 1 1
> > > > > 3500a07513c1e3afc: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:48 1 1
> > > > > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 0 0 1 1
> > > > > service-time 0 1 2 259:2 1 1
> > > > > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 0 0 1 1
> > > > > service-time 0 1 2 259:0 1 1
> > > > > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 0 0 1 1
> > > > > service-time 0 1 2 259:3 1 1
> > > > > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 0 0 1 1
> > > > > service-time 0 1 2 259:1 1 1
> > > > > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 3
> > > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > > > 259:4 1 259:7 1 259:10 1
> > > > > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 3
> > > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > > > 259:5 1 259:8 1 259:11 1
> > > > > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 3
> > > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > > > 259:6 1 259:9 1 259:12 1
> > > > > [root@root core]# dmsetup status --target multipath
> > > > > 3500a07513c1e23c4: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:16 A 0 0 1
> > > > > 3500a07513c1e2ade: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:64 A 0 0 1
> > > > > 3500a07513c1e2ca8: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:80 A 0 0 1
> > > > > 3500a07513c1e2cf3: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:0 A 0 0 1
> > > > > 3500a07513c1e3afc: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:48 A 0 0 1
> > > > > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 2 0 0 0 1
> > > > > 1 A 0 1 2 259:2 A 0 0 1
> > > > > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 2 0 0 0 1
> > > > > 1 A 0 1 2 259:0 A 0 0 1
> > > > > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 2 0 0 0 1
> > > > > 1 A 0 1 2 259:3 A 0 0 1
> > > > > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 2 0 0 0 1
> > > > > 1 A 0 1 2 259:1 A 0 0 1
> > > > > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 2 0 0 0
> > > > > 1 1 A 0 3 1 259:4 A 1 22 259:7 A 1 21 259:10 A 1 22
> > > > > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 2 0 0 0 1
> > > > > 1 A 0 3 1 259:5 A 1 4 259:8 A 1 6 259:11 A 1 11
> > > > > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 2 0 0 0 1
> > > > > 1 A 0 3 1 259:6 A 1 0 259:9 A 1 0 259:12 A 1 0
> > > > > [root@root core]#
> > > > > [root@root core]# mdadm -D /dev/md/pwx1
> > > > > /dev/md/pwx1:
> > > > >            Version : 1.2
> > > > >      Creation Time : Tue Mar 17 15:29:31 2026
> > > > >         Raid Level : raid0
> > > > >         Array Size : 1677589504 (1599.87 GiB 1717.85 GB)
> > > > >       Raid Devices : 1
> > > > >      Total Devices : 1
> > > > >        Persistence : Superblock is persistent
> > > > >
> > > > >        Update Time : Mon Mar 23 20:52:51 2026
> > > > >              State : clean
> > > > >     Active Devices : 1
> > > > >    Working Devices : 1
> > > > >     Failed Devices : 0
> > > > >      Spare Devices : 0
> > > > >
> > > > >         Chunk Size : 1024K
> > > > >
> > > > > Consistency Policy : none
> > > > >
> > > > >               Name : any:pwx1
> > > > >               UUID : 1716a351:ed3e53e7:0ce83ccd:8d3a3021
> > > > >             Events : 16
> > > > >
> > > > >     Number   Major   Minor   RaidDevice State
> > > > >        0     253       10        0      active sync   /dev/dm-10
> > > > > [root@root core]#
> > > > > ```
> > > >
> > >
> > >
> >
>
>

^ permalink raw reply

* Re: Reg dm thin pool metadata inconsistency
From: Ming Hung Tsai @ 2026-04-14 15:07 UTC (permalink / raw)
  To: linux-lvm; +Cc: Lakshmi Narasimhan Sundararajan
In-Reply-To: <CAFe+wq25Few1Vx-TzKgCuFZ6tnVw=nuOFdhsR99ARM5PADuSDQ@mail.gmail.com>

Hi Lakshmi,

On Tue, Apr 14, 2026 at 8:03 PM Lakshmi Narasimhan Sundararajan
<lsundararajan@purestorage.com> wrote:
>
> Hi Ming,
>
> I do not have the raw metadata unfortunately, did not collect from the
> customer recovery.
> We do have discard_passdown enabled on those setups where the issue occurred.

Do you mean there's a "discard_passdown" flag in the `dmsetup status` output?
The feature counter for the thin-pool table is set to 2, with
skip_block_zeroing as the first flag, and the second flag appears to
be truncated.
Could you provide the name of the second flag, if present?

"0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing ..."



> Have you seen this problem before?
>
> On Tue, Apr 14, 2026 at 4:58 PM Ming Hung Tsai <mtsai@redhat.com> wrote:
> >
> > Hi,
> >
> > I have two questions posted inline.
> >
> > On Tue, Apr 14, 2026 at 1:59 AM Lakshmi Narasimhan Sundararajan
> > <lsundararajan@purestorage.com> wrote:
> > >
> > > Good day!
> > >
> > > A gentle ping to hear an update from the authors here. More updates inline.
> > >
> > > On Sat, Apr 11, 2026 at 5:41 PM Lakshmi Narasimhan Sundararajan
> > > <lsundararajan@purestorage.com> wrote:
> > > >
> > > > Hi LVM Team! A very good day to you all.
> > > > [ I hope this email is the right one now]
> > > >
> > > > I recently experienced an outage where thin pool activation failed,
> > > > details are as follows.
> > > > Good news is, I was able to recover the pool through thin_repair.
> > > > Thank goodness!
> > > >
> > > > There was no infra induced failure i.e. no network, disk, usage over
> > > > limit, memory or compute being faulty orover used in any way.
> > > > Node was running healthy for 13 days and suddenly hit this issue.
> > > > Pool would handle I/O load (including discards), new volume
> > > > creation/deletion, and other regular activities.
> > > >
> > > > I tried to identify if there is a direct known issue, but I was unable to.
> > > > This generally seems to be some known issue, but I am unable to find a
> > > > direct link with the same signature.
> > > >
> > > > a) how to induce thin pool failures at will, so thin pool does not
> > > > activate, but repair succeeds, so  I can test this recovery in some
> > > > controlled form.
> > >
> > > I have found a way, I think I can pull out the metadata xml and modify
> > > highest transaction and rewrite the metadata and swap the pool to
> > > recreate this condition.
> > > pool activation will fail and thin_repair can correct it. Any easier
> > > way that my suggestion, please feel free to suggest.
> >
> > Were you able to reproduce the issue on your end? I'm concerned that
> > using metadata rebuilt from XML might not trigger the bug because the
> > rebuilt layout differs from the original.
> >
> > Could you please provide the raw metadata image prior to any repairs?
> > This will allow us to investigate the issue further.
> >
> >
> > > > b) To your best knowledge this seems a known issue and fixed in a later release?
> > > > I did my search at both kernel bugzilla and RHEL - and I am hoping you
> > > > can help me find it. Internet searches point to errata pages, but I am
> > > > unable to find the
> > > > exact ticket, commit that address this. The OCP platform was running a
> > > > recent release from RHEL.
> > > > Linux kernel: 5.14.0-427.109.1.el9_4 RHEL 9.4 This is likely 2 years old though.
> > >
> > > So far I and my team have not been able to reproduce this issue, and
> > > look to your help confirming whether
> > > a) is this known already and is fixed!
> > > b) whats the safest kernel to upgrade to?
> > > c) still an open issue!
> > >
> > > >
> > > > c) After spending some time reviewing thin code and the commits since
> > > > the mentioned
> > > > kernel from kernel.org linux.. I suspect it could be a race with
> > > > discard and either IO or device creation/deletion on the same pool
> > > > could cause this?
> > > > Could the authors here, please confirm my code reading below.
> > > > ```
> > >
> > > As mentioned we tried a focussed reproducer around this, but unable to
> > > trigger the issue.
> > > There are volume creation/deletions, snapshot creations and deletions,
> > > discards and regular IO at any point on the thin pool.
> > > And in addition, there would be calls to reserve/release the thin
> > > metadata to capture diff for backup between volumes.
> > > These are serialized at our app layer and I also see these are
> > > serialized within lvm layer too.
> > >
> > > Our volume deletions are 2 phased, based on our earlier discussion in
> > > this thread, we had observed high IO latency when volumes are deleted
> > > and the suggestion from this team was to discard and then delete
> > > volumes to keep the deletion time short.
> > >
> > > I hope this is giving enough context to understand this better.
> > > Unfortunately, since I am unable to reproduce this and have this
> > > sighted now twice at customer, I have no more datapoints to add.
> > > Would be willing to hear out if you have any suggestions I can pursue in house.
> > >
> > > Best regards
> > >
> > >
> > > > *** phase 1 - userspace issues blkdiscard on thin volumes ***
> > > >   dm-thin.c : thin_bio_map()
> > > >     → detects REQ_OP_DISCARD
> > > >     → thin_defer_bio_with_throttle(tc, bio)
> > > >       → adds bio to tc->deferred_bio_list        // QUEUED, not processed
> > > >       → wakes pool worker thread
> > > >
> > > >   dm-thin.c : do_worker()                         // runs ASYNCHRONOUSLY
> > > >     → process_deferred_bios()
> > > >       → process_thin_deferred_bios()
> > > >         → process_discard_bio()
> > > >           → creates mapping, adds to pool->prepared_discards
> > > >     → process_prepared(pool->prepared_discards)
> > > >       → process_prepared_discard_no_passdown(m):
> > > >         → dm_thin_remove_range(tc->td, begin, end)
> > > >             [dm-thin-metadata.c]
> > > >           → dm_btree_remove_leaves()
> > > >               [dm-btree-remove.c]
> > > >             → data_block_dec()                    // for each data block
> > > >                 [dm-thin-metadata.c]
> > > >               → dm_sm_dec_blocks()                // DECREMENTS refcount
> > > >                   [dm-space-map-common.c]
> > > >
> > > > ***  phase 2: these steps still be IN PROGRESS or QUEUED when
> > > > userspace deletes the thin volume ***
> > > >
> > > >   dm-thin.c : thin_dtr()                          // dmsetup remove
> > > >     → list_del_rcu(&tc->list)                     // removes from
> > > >                                                   //   pool->active_thins
> > > >     → synchronize_rcu()
> > > >     → dm_pool_close_thin_device(tc->td)           // open_count--
> > > >     → kfree(tc)                                   // tc FREED
> > > >
> > > >     *** does NOT flush pool workqueue ***          ← GAP 1
> > > >     *** does NOT drain prepared_discards ***       ← GAP 2
> > > >
> > > >   dm-thin.c : process_delete_mesg()               // dmsetup message
> > > >     → dm_pool_delete_thin_device(pool->pmd, dev_id)
> > > >         [dm-thin-metadata.c : __delete_device()]
> > > >       → dm_btree_remove(&pmd->tl_info, ...)       // remove from top-level
> > > >           [dm-btree-remove.c]                      //   btree
> > > >         → subtree_dec()                            // cascades into:
> > > >             [dm-thin-metadata.c]
> > > >           → dm_btree_del()                         // walks ALL leaves
> > > >               [dm-btree.c]
> > > >             → data_block_dec() for EVERY remaining block
> > > >                 [dm-thin-metadata.c]
> > > >               → dm_sm_dec_blocks()                 // DECREMENTS refcount
> > > >                   [dm-space-map-common.c]          //   for ALL blocks
> > > >
> > > > ** phase 3: KERNEL (worker thread — still running from Phase 1) ***
> > > >   dm-thin.c : do_worker()                         // ASYNC, still running
> > > >     → process_prepared(pool->prepared_discards)
> > > >       → process_prepared_discard_no_passdown(m):
> > > >         → m->tc points to FREED tc                // ← use-after-free risk
> > > >         → dm_thin_remove_range(tc->td, begin, end)
> > > >             [dm-thin-metadata.c]
> > > >           → dm_btree_remove_leaves()
> > > >               [dm-btree-remove.c]
> > > >             → data_block_dec()                    // SAME blocks already
> > > >                 [dm-thin-metadata.c]              //   decremented in
> > > >               → dm_sm_dec_blocks()                //   Phase 2!
> > > >                   [dm-space-map-common.c]
> > > >
> > > >                 ┌──────────────────────────────────────────────────┐
> > > >                   sm_ll_dec_bitmap():
> > > >                     old = sm_lookup_bitmap(ic->bitmap, bit);
> > > >                     switch (old) {
> > > >                     case 0:  // ← refcount ALREADY 0
> > > >                       DMERR("unable to decrement block");
> > > >                       return -EINVAL;  // -22
> > > >                     }
> > > >                                  [dm-space-map-common.c]
> > > >                 └──────────────────────────────────────────────────┘
> > > >
> > > >                           ▼
> > > >                 dm_tm_shadow_block() fails (corrupted space map)
> > > >                     [dm-transaction-manager.c]
> > > >
> > > >                           ▼
> > > >                 dm_pool_inc_data_range() fails with -EINVAL (-22)
> > > >                     [dm-thin-metadata.c]
> > > >
> > > >                           ▼
> > > >                 metadata_operation_failed(pool, "dm_pool_inc_data_range")
> > > >                     [dm-thin.c]
> > > >
> > > >                           ▼
> > > >                 set_pool_mode(pool, PM_READ_ONLY)
> > > >                     [dm-thin.c]
> > > >
> > > >                 *** POOL IS NOW DEAD ***
> > > > ```
> > > >
> > > >
> > > >
> > > > As always, many thanks for your help.
> > > >
> > > >
> > > > # issue unable to activate thin pool
> > > > ```
> > > > [Wed Apr  8 17:05:14 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > > dm_tm_shadow_block() failed
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > > dm_tm_shadow_block() failed
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > > dm_tm_shadow_block() failed
> > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > > > dm_tm_shadow_block() failed
> > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > > > dm_tm_shadow_block() failed
> > > > ```
> > > >
> > > > # host and lvm tools version
> > > > ```
> > > > uname -a
> > > > Linux kernel: 5.14.0-427.109.1.el9_4
> > > > RHEL 9.4
> > > >
> > > > lvm version
> > > > 2.03.23(2) (2023-11-21)
> > > > library: 1.02.197 (2023-11-21)
> > > > driver: 4.48.1
> > > > ```
> > > >
> > > > Below are references to the node block layer.
> > > > There was IO, thin volume creations and deletions, IO includes discards too.
> > > > ```
> > > > [root@root core]# lvs -a pwx1
> > > >   Please remove the lvm.conf global_filter, it is ignored with the devices file.
> > > >   LV                  VG   Attr       LSize   Pool   Origin
> > > >   Data%  Meta%  Move Log Cpy%Sync Convert
> > > >   1004123733318649769 pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
> > > >   103699400925372609  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
> > > >   1072608604746349133 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
> > > >   1115712468847455249 pwx1 Vwi-aot--- 750.00g pxpool                     59.75
> > > >   1138695541641144166 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
> > > >   136169780918964477  pwx1 Vwi-aot---  30.00g pxpool                     33.33
> > > >   218651423266852202  pwx1 Vwi-aot---   5.00g pxpool                     3.49
> > > >   404947242154831849  pwx1 Vwi-aot---   5.00g pxpool                     4.20
> > > >   440731835552948333  pwx1 Vwi-aot---  50.00g pxpool                     5.59
> > > >   462681831690737818  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
> > > >   519898065353250833  pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
> > > >   527922274169222783  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
> > > >   537994915504805835  pwx1 Vwi-aot---  50.00g pxpool                     10.88
> > > >   569690966828279529  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
> > > >   594992999737145586  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
> > > >   660563940592999863  pwx1 Vwi-aot---  50.00g pxpool                     0.25
> > > >   702358223003836192  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
> > > >   73089959772282964   pwx1 Vwi-aot---  50.00g pxpool                     0.25
> > > >   793515512579595979  pwx1 Vwi-aot---  30.00g pxpool                     33.33
> > > >   79731196567060146   pwx1 Vwi-aot---  50.00g pxpool                     10.90
> > > >   865397616123963982  pwx1 Vwi-aot---  50.00g pxpool                     9.39
> > > >   866802183893693297  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
> > > >   941788757364603035  pwx1 Vwi-aot---  50.00g pxpool                     0.25
> > > >   960350716126095496  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
> > > >   [lvol0_pmspare]     pwx1 ewi-------   2.00g
> > > >   pxMetaFS            pwx1 Vwi-aot---  64.00g pxpool                     0.05
> > > >   pxpool              pwx1 twi-aot---   1.54t
> > > >   43.59  5.06 <<< very low tmeta util.
> > > >   [pxpool_tdata]      pwx1 Twi-ao----   1.54t
> > > >   [pxpool_tmeta]      pwx1 ewi-ao----   4.00g
> > > >   pxreserve           pwx1 -wi------k  15.00g
> > > > [root@root core]#
> > > > [root@root core]# vgs pwx1
> > > >   Please remove the lvm.conf global_filter, it is ignored with the devices file.
> > > >   VG   #PV #LV #SN Attr   VSize VFree
> > > >   pwx1   1  27   0 wz--n- 1.56t    0
> > > > [root@root core]# lsblk -s /dev/pwx1/1004123733318649769
> > > > NAME                                         MAJ:MIN RM  SIZE RO TYPE
> > > > MOUNTPOINTS
> > > > pwx1-1004123733318649769                     253:107  0   50G  0 lvm
> > > > └─pwx1-pxpool-tpool                          253:14   0  1.5T  0 lvm
> > > >   ├─pwx1-pxpool_tmeta                        253:12   0    4G  0 lvm
> > > >   │ └─md126                                    9:126  0  1.6T  0 raid0
> > > >   │   └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> > > >   │     ├─nvme4n2                            259:5    0  1.6T  0 disk
> > > >   │     ├─nvme5n2                            259:8    0  1.6T  0 disk
> > > >   │     └─nvme6n2                            259:11   0  1.6T  0 disk
> > > >   └─pwx1-pxpool_tdata                        253:13   0  1.5T  0 lvm
> > > >     └─md126                                    9:126  0  1.6T  0 raid0
> > > >       └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> > > >         ├─nvme4n2                            259:5    0  1.6T  0 disk
> > > >         ├─nvme5n2                            259:8    0  1.6T  0 disk
> > > >         └─nvme6n2                            259:11   0  1.6T  0 disk
> > > > [root@root core]# ls -al /dev/md/pwx1
> > > > lrwxrwxrwx. 1 root root 8 Apr 11 11:48 /dev/md/pwx1 -> ../md126
> > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool-tpool
> > > > 0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing
> >
> > The second feature flag was truncated. Does the thin-pool has
> > no_discard_passdown enabled?
> >
> >
> > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tdata
> > > > 0 1008459776 linear 9:126 35653632
> > > > 1008459776 629145600 linear 9:126 1048307712
> > > > 1637605376 1673527296 linear 9:126 1681647616
> > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tmeta
> > > > 0 4194304 linear 9:126 1044113408
> > > > 4194304 4194304 linear 9:126 1677453312
> > > > [root@root core]#
> > > > [root@root core]# dmsetup table --target multipath
> > > > 3500a07513c1e23c4: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:16 1 1
> > > > 3500a07513c1e2ade: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:64 1 1
> > > > 3500a07513c1e2ca8: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:80 1 1
> > > > 3500a07513c1e2cf3: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:0 1 1
> > > > 3500a07513c1e3afc: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:48 1 1
> > > > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 0 0 1 1
> > > > service-time 0 1 2 259:2 1 1
> > > > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 0 0 1 1
> > > > service-time 0 1 2 259:0 1 1
> > > > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 0 0 1 1
> > > > service-time 0 1 2 259:3 1 1
> > > > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 0 0 1 1
> > > > service-time 0 1 2 259:1 1 1
> > > > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 3
> > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > > 259:4 1 259:7 1 259:10 1
> > > > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 3
> > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > > 259:5 1 259:8 1 259:11 1
> > > > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 3
> > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > > 259:6 1 259:9 1 259:12 1
> > > > [root@root core]# dmsetup status --target multipath
> > > > 3500a07513c1e23c4: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:16 A 0 0 1
> > > > 3500a07513c1e2ade: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:64 A 0 0 1
> > > > 3500a07513c1e2ca8: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:80 A 0 0 1
> > > > 3500a07513c1e2cf3: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:0 A 0 0 1
> > > > 3500a07513c1e3afc: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:48 A 0 0 1
> > > > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 2 0 0 0 1
> > > > 1 A 0 1 2 259:2 A 0 0 1
> > > > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 2 0 0 0 1
> > > > 1 A 0 1 2 259:0 A 0 0 1
> > > > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 2 0 0 0 1
> > > > 1 A 0 1 2 259:3 A 0 0 1
> > > > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 2 0 0 0 1
> > > > 1 A 0 1 2 259:1 A 0 0 1
> > > > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 2 0 0 0
> > > > 1 1 A 0 3 1 259:4 A 1 22 259:7 A 1 21 259:10 A 1 22
> > > > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 2 0 0 0 1
> > > > 1 A 0 3 1 259:5 A 1 4 259:8 A 1 6 259:11 A 1 11
> > > > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 2 0 0 0 1
> > > > 1 A 0 3 1 259:6 A 1 0 259:9 A 1 0 259:12 A 1 0
> > > > [root@root core]#
> > > > [root@root core]# mdadm -D /dev/md/pwx1
> > > > /dev/md/pwx1:
> > > >            Version : 1.2
> > > >      Creation Time : Tue Mar 17 15:29:31 2026
> > > >         Raid Level : raid0
> > > >         Array Size : 1677589504 (1599.87 GiB 1717.85 GB)
> > > >       Raid Devices : 1
> > > >      Total Devices : 1
> > > >        Persistence : Superblock is persistent
> > > >
> > > >        Update Time : Mon Mar 23 20:52:51 2026
> > > >              State : clean
> > > >     Active Devices : 1
> > > >    Working Devices : 1
> > > >     Failed Devices : 0
> > > >      Spare Devices : 0
> > > >
> > > >         Chunk Size : 1024K
> > > >
> > > > Consistency Policy : none
> > > >
> > > >               Name : any:pwx1
> > > >               UUID : 1716a351:ed3e53e7:0ce83ccd:8d3a3021
> > > >             Events : 16
> > > >
> > > >     Number   Major   Minor   RaidDevice State
> > > >        0     253       10        0      active sync   /dev/dm-10
> > > > [root@root core]#
> > > > ```
> > >
> >
> >
>


^ permalink raw reply

* Re: Reg dm thin pool metadata inconsistency
From: Lakshmi Narasimhan Sundararajan @ 2026-04-14 12:03 UTC (permalink / raw)
  To: Ming Hung Tsai; +Cc: linux-lvm
In-Reply-To: <CALjSBEurs6mNvR1YdoF-UVEjJUpycahcOxU2Ok6YYh-jBJNwKw@mail.gmail.com>

Hi Ming,

I do not have the raw metadata unfortunately, did not collect from the
customer recovery.
We do have discard_passdown enabled on those setups where the issue occurred.

Have you seen this problem before?

On Tue, Apr 14, 2026 at 4:58 PM Ming Hung Tsai <mtsai@redhat.com> wrote:
>
> Hi,
>
> I have two questions posted inline.
>
> On Tue, Apr 14, 2026 at 1:59 AM Lakshmi Narasimhan Sundararajan
> <lsundararajan@purestorage.com> wrote:
> >
> > Good day!
> >
> > A gentle ping to hear an update from the authors here. More updates inline.
> >
> > On Sat, Apr 11, 2026 at 5:41 PM Lakshmi Narasimhan Sundararajan
> > <lsundararajan@purestorage.com> wrote:
> > >
> > > Hi LVM Team! A very good day to you all.
> > > [ I hope this email is the right one now]
> > >
> > > I recently experienced an outage where thin pool activation failed,
> > > details are as follows.
> > > Good news is, I was able to recover the pool through thin_repair.
> > > Thank goodness!
> > >
> > > There was no infra induced failure i.e. no network, disk, usage over
> > > limit, memory or compute being faulty orover used in any way.
> > > Node was running healthy for 13 days and suddenly hit this issue.
> > > Pool would handle I/O load (including discards), new volume
> > > creation/deletion, and other regular activities.
> > >
> > > I tried to identify if there is a direct known issue, but I was unable to.
> > > This generally seems to be some known issue, but I am unable to find a
> > > direct link with the same signature.
> > >
> > > a) how to induce thin pool failures at will, so thin pool does not
> > > activate, but repair succeeds, so  I can test this recovery in some
> > > controlled form.
> >
> > I have found a way, I think I can pull out the metadata xml and modify
> > highest transaction and rewrite the metadata and swap the pool to
> > recreate this condition.
> > pool activation will fail and thin_repair can correct it. Any easier
> > way that my suggestion, please feel free to suggest.
>
> Were you able to reproduce the issue on your end? I'm concerned that
> using metadata rebuilt from XML might not trigger the bug because the
> rebuilt layout differs from the original.
>
> Could you please provide the raw metadata image prior to any repairs?
> This will allow us to investigate the issue further.
>
>
> > > b) To your best knowledge this seems a known issue and fixed in a later release?
> > > I did my search at both kernel bugzilla and RHEL - and I am hoping you
> > > can help me find it. Internet searches point to errata pages, but I am
> > > unable to find the
> > > exact ticket, commit that address this. The OCP platform was running a
> > > recent release from RHEL.
> > > Linux kernel: 5.14.0-427.109.1.el9_4 RHEL 9.4 This is likely 2 years old though.
> >
> > So far I and my team have not been able to reproduce this issue, and
> > look to your help confirming whether
> > a) is this known already and is fixed!
> > b) whats the safest kernel to upgrade to?
> > c) still an open issue!
> >
> > >
> > > c) After spending some time reviewing thin code and the commits since
> > > the mentioned
> > > kernel from kernel.org linux.. I suspect it could be a race with
> > > discard and either IO or device creation/deletion on the same pool
> > > could cause this?
> > > Could the authors here, please confirm my code reading below.
> > > ```
> >
> > As mentioned we tried a focussed reproducer around this, but unable to
> > trigger the issue.
> > There are volume creation/deletions, snapshot creations and deletions,
> > discards and regular IO at any point on the thin pool.
> > And in addition, there would be calls to reserve/release the thin
> > metadata to capture diff for backup between volumes.
> > These are serialized at our app layer and I also see these are
> > serialized within lvm layer too.
> >
> > Our volume deletions are 2 phased, based on our earlier discussion in
> > this thread, we had observed high IO latency when volumes are deleted
> > and the suggestion from this team was to discard and then delete
> > volumes to keep the deletion time short.
> >
> > I hope this is giving enough context to understand this better.
> > Unfortunately, since I am unable to reproduce this and have this
> > sighted now twice at customer, I have no more datapoints to add.
> > Would be willing to hear out if you have any suggestions I can pursue in house.
> >
> > Best regards
> >
> >
> > > *** phase 1 - userspace issues blkdiscard on thin volumes ***
> > >   dm-thin.c : thin_bio_map()
> > >     → detects REQ_OP_DISCARD
> > >     → thin_defer_bio_with_throttle(tc, bio)
> > >       → adds bio to tc->deferred_bio_list        // QUEUED, not processed
> > >       → wakes pool worker thread
> > >
> > >   dm-thin.c : do_worker()                         // runs ASYNCHRONOUSLY
> > >     → process_deferred_bios()
> > >       → process_thin_deferred_bios()
> > >         → process_discard_bio()
> > >           → creates mapping, adds to pool->prepared_discards
> > >     → process_prepared(pool->prepared_discards)
> > >       → process_prepared_discard_no_passdown(m):
> > >         → dm_thin_remove_range(tc->td, begin, end)
> > >             [dm-thin-metadata.c]
> > >           → dm_btree_remove_leaves()
> > >               [dm-btree-remove.c]
> > >             → data_block_dec()                    // for each data block
> > >                 [dm-thin-metadata.c]
> > >               → dm_sm_dec_blocks()                // DECREMENTS refcount
> > >                   [dm-space-map-common.c]
> > >
> > > ***  phase 2: these steps still be IN PROGRESS or QUEUED when
> > > userspace deletes the thin volume ***
> > >
> > >   dm-thin.c : thin_dtr()                          // dmsetup remove
> > >     → list_del_rcu(&tc->list)                     // removes from
> > >                                                   //   pool->active_thins
> > >     → synchronize_rcu()
> > >     → dm_pool_close_thin_device(tc->td)           // open_count--
> > >     → kfree(tc)                                   // tc FREED
> > >
> > >     *** does NOT flush pool workqueue ***          ← GAP 1
> > >     *** does NOT drain prepared_discards ***       ← GAP 2
> > >
> > >   dm-thin.c : process_delete_mesg()               // dmsetup message
> > >     → dm_pool_delete_thin_device(pool->pmd, dev_id)
> > >         [dm-thin-metadata.c : __delete_device()]
> > >       → dm_btree_remove(&pmd->tl_info, ...)       // remove from top-level
> > >           [dm-btree-remove.c]                      //   btree
> > >         → subtree_dec()                            // cascades into:
> > >             [dm-thin-metadata.c]
> > >           → dm_btree_del()                         // walks ALL leaves
> > >               [dm-btree.c]
> > >             → data_block_dec() for EVERY remaining block
> > >                 [dm-thin-metadata.c]
> > >               → dm_sm_dec_blocks()                 // DECREMENTS refcount
> > >                   [dm-space-map-common.c]          //   for ALL blocks
> > >
> > > ** phase 3: KERNEL (worker thread — still running from Phase 1) ***
> > >   dm-thin.c : do_worker()                         // ASYNC, still running
> > >     → process_prepared(pool->prepared_discards)
> > >       → process_prepared_discard_no_passdown(m):
> > >         → m->tc points to FREED tc                // ← use-after-free risk
> > >         → dm_thin_remove_range(tc->td, begin, end)
> > >             [dm-thin-metadata.c]
> > >           → dm_btree_remove_leaves()
> > >               [dm-btree-remove.c]
> > >             → data_block_dec()                    // SAME blocks already
> > >                 [dm-thin-metadata.c]              //   decremented in
> > >               → dm_sm_dec_blocks()                //   Phase 2!
> > >                   [dm-space-map-common.c]
> > >
> > >                 ┌──────────────────────────────────────────────────┐
> > >                   sm_ll_dec_bitmap():
> > >                     old = sm_lookup_bitmap(ic->bitmap, bit);
> > >                     switch (old) {
> > >                     case 0:  // ← refcount ALREADY 0
> > >                       DMERR("unable to decrement block");
> > >                       return -EINVAL;  // -22
> > >                     }
> > >                                  [dm-space-map-common.c]
> > >                 └──────────────────────────────────────────────────┘
> > >
> > >                           ▼
> > >                 dm_tm_shadow_block() fails (corrupted space map)
> > >                     [dm-transaction-manager.c]
> > >
> > >                           ▼
> > >                 dm_pool_inc_data_range() fails with -EINVAL (-22)
> > >                     [dm-thin-metadata.c]
> > >
> > >                           ▼
> > >                 metadata_operation_failed(pool, "dm_pool_inc_data_range")
> > >                     [dm-thin.c]
> > >
> > >                           ▼
> > >                 set_pool_mode(pool, PM_READ_ONLY)
> > >                     [dm-thin.c]
> > >
> > >                 *** POOL IS NOW DEAD ***
> > > ```
> > >
> > >
> > >
> > > As always, many thanks for your help.
> > >
> > >
> > > # issue unable to activate thin pool
> > > ```
> > > [Wed Apr  8 17:05:14 2026] device-mapper: space map common: unable to
> > > decrement block
> > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > decrement block
> > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > dm_tm_shadow_block() failed
> > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > decrement block
> > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > dm_tm_shadow_block() failed
> > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > decrement block
> > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > dm_tm_shadow_block() failed
> > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > > decrement block
> > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > > dm_tm_shadow_block() failed
> > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > > decrement block
> > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > > dm_tm_shadow_block() failed
> > > ```
> > >
> > > # host and lvm tools version
> > > ```
> > > uname -a
> > > Linux kernel: 5.14.0-427.109.1.el9_4
> > > RHEL 9.4
> > >
> > > lvm version
> > > 2.03.23(2) (2023-11-21)
> > > library: 1.02.197 (2023-11-21)
> > > driver: 4.48.1
> > > ```
> > >
> > > Below are references to the node block layer.
> > > There was IO, thin volume creations and deletions, IO includes discards too.
> > > ```
> > > [root@root core]# lvs -a pwx1
> > >   Please remove the lvm.conf global_filter, it is ignored with the devices file.
> > >   LV                  VG   Attr       LSize   Pool   Origin
> > >   Data%  Meta%  Move Log Cpy%Sync Convert
> > >   1004123733318649769 pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
> > >   103699400925372609  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
> > >   1072608604746349133 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
> > >   1115712468847455249 pwx1 Vwi-aot--- 750.00g pxpool                     59.75
> > >   1138695541641144166 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
> > >   136169780918964477  pwx1 Vwi-aot---  30.00g pxpool                     33.33
> > >   218651423266852202  pwx1 Vwi-aot---   5.00g pxpool                     3.49
> > >   404947242154831849  pwx1 Vwi-aot---   5.00g pxpool                     4.20
> > >   440731835552948333  pwx1 Vwi-aot---  50.00g pxpool                     5.59
> > >   462681831690737818  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
> > >   519898065353250833  pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
> > >   527922274169222783  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
> > >   537994915504805835  pwx1 Vwi-aot---  50.00g pxpool                     10.88
> > >   569690966828279529  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
> > >   594992999737145586  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
> > >   660563940592999863  pwx1 Vwi-aot---  50.00g pxpool                     0.25
> > >   702358223003836192  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
> > >   73089959772282964   pwx1 Vwi-aot---  50.00g pxpool                     0.25
> > >   793515512579595979  pwx1 Vwi-aot---  30.00g pxpool                     33.33
> > >   79731196567060146   pwx1 Vwi-aot---  50.00g pxpool                     10.90
> > >   865397616123963982  pwx1 Vwi-aot---  50.00g pxpool                     9.39
> > >   866802183893693297  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
> > >   941788757364603035  pwx1 Vwi-aot---  50.00g pxpool                     0.25
> > >   960350716126095496  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
> > >   [lvol0_pmspare]     pwx1 ewi-------   2.00g
> > >   pxMetaFS            pwx1 Vwi-aot---  64.00g pxpool                     0.05
> > >   pxpool              pwx1 twi-aot---   1.54t
> > >   43.59  5.06 <<< very low tmeta util.
> > >   [pxpool_tdata]      pwx1 Twi-ao----   1.54t
> > >   [pxpool_tmeta]      pwx1 ewi-ao----   4.00g
> > >   pxreserve           pwx1 -wi------k  15.00g
> > > [root@root core]#
> > > [root@root core]# vgs pwx1
> > >   Please remove the lvm.conf global_filter, it is ignored with the devices file.
> > >   VG   #PV #LV #SN Attr   VSize VFree
> > >   pwx1   1  27   0 wz--n- 1.56t    0
> > > [root@root core]# lsblk -s /dev/pwx1/1004123733318649769
> > > NAME                                         MAJ:MIN RM  SIZE RO TYPE
> > > MOUNTPOINTS
> > > pwx1-1004123733318649769                     253:107  0   50G  0 lvm
> > > └─pwx1-pxpool-tpool                          253:14   0  1.5T  0 lvm
> > >   ├─pwx1-pxpool_tmeta                        253:12   0    4G  0 lvm
> > >   │ └─md126                                    9:126  0  1.6T  0 raid0
> > >   │   └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> > >   │     ├─nvme4n2                            259:5    0  1.6T  0 disk
> > >   │     ├─nvme5n2                            259:8    0  1.6T  0 disk
> > >   │     └─nvme6n2                            259:11   0  1.6T  0 disk
> > >   └─pwx1-pxpool_tdata                        253:13   0  1.5T  0 lvm
> > >     └─md126                                    9:126  0  1.6T  0 raid0
> > >       └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> > >         ├─nvme4n2                            259:5    0  1.6T  0 disk
> > >         ├─nvme5n2                            259:8    0  1.6T  0 disk
> > >         └─nvme6n2                            259:11   0  1.6T  0 disk
> > > [root@root core]# ls -al /dev/md/pwx1
> > > lrwxrwxrwx. 1 root root 8 Apr 11 11:48 /dev/md/pwx1 -> ../md126
> > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool-tpool
> > > 0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing
>
> The second feature flag was truncated. Does the thin-pool has
> no_discard_passdown enabled?
>
>
> > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tdata
> > > 0 1008459776 linear 9:126 35653632
> > > 1008459776 629145600 linear 9:126 1048307712
> > > 1637605376 1673527296 linear 9:126 1681647616
> > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tmeta
> > > 0 4194304 linear 9:126 1044113408
> > > 4194304 4194304 linear 9:126 1677453312
> > > [root@root core]#
> > > [root@root core]# dmsetup table --target multipath
> > > 3500a07513c1e23c4: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:16 1 1
> > > 3500a07513c1e2ade: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:64 1 1
> > > 3500a07513c1e2ca8: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:80 1 1
> > > 3500a07513c1e2cf3: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:0 1 1
> > > 3500a07513c1e3afc: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:48 1 1
> > > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 0 0 1 1
> > > service-time 0 1 2 259:2 1 1
> > > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 0 0 1 1
> > > service-time 0 1 2 259:0 1 1
> > > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 0 0 1 1
> > > service-time 0 1 2 259:3 1 1
> > > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 0 0 1 1
> > > service-time 0 1 2 259:1 1 1
> > > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 3
> > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > 259:4 1 259:7 1 259:10 1
> > > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 3
> > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > 259:5 1 259:8 1 259:11 1
> > > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 3
> > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > 259:6 1 259:9 1 259:12 1
> > > [root@root core]# dmsetup status --target multipath
> > > 3500a07513c1e23c4: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:16 A 0 0 1
> > > 3500a07513c1e2ade: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:64 A 0 0 1
> > > 3500a07513c1e2ca8: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:80 A 0 0 1
> > > 3500a07513c1e2cf3: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:0 A 0 0 1
> > > 3500a07513c1e3afc: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:48 A 0 0 1
> > > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 2 0 0 0 1
> > > 1 A 0 1 2 259:2 A 0 0 1
> > > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 2 0 0 0 1
> > > 1 A 0 1 2 259:0 A 0 0 1
> > > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 2 0 0 0 1
> > > 1 A 0 1 2 259:3 A 0 0 1
> > > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 2 0 0 0 1
> > > 1 A 0 1 2 259:1 A 0 0 1
> > > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 2 0 0 0
> > > 1 1 A 0 3 1 259:4 A 1 22 259:7 A 1 21 259:10 A 1 22
> > > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 2 0 0 0 1
> > > 1 A 0 3 1 259:5 A 1 4 259:8 A 1 6 259:11 A 1 11
> > > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 2 0 0 0 1
> > > 1 A 0 3 1 259:6 A 1 0 259:9 A 1 0 259:12 A 1 0
> > > [root@root core]#
> > > [root@root core]# mdadm -D /dev/md/pwx1
> > > /dev/md/pwx1:
> > >            Version : 1.2
> > >      Creation Time : Tue Mar 17 15:29:31 2026
> > >         Raid Level : raid0
> > >         Array Size : 1677589504 (1599.87 GiB 1717.85 GB)
> > >       Raid Devices : 1
> > >      Total Devices : 1
> > >        Persistence : Superblock is persistent
> > >
> > >        Update Time : Mon Mar 23 20:52:51 2026
> > >              State : clean
> > >     Active Devices : 1
> > >    Working Devices : 1
> > >     Failed Devices : 0
> > >      Spare Devices : 0
> > >
> > >         Chunk Size : 1024K
> > >
> > > Consistency Policy : none
> > >
> > >               Name : any:pwx1
> > >               UUID : 1716a351:ed3e53e7:0ce83ccd:8d3a3021
> > >             Events : 16
> > >
> > >     Number   Major   Minor   RaidDevice State
> > >        0     253       10        0      active sync   /dev/dm-10
> > > [root@root core]#
> > > ```
> >
>
>

^ permalink raw reply

* Re: Reg dm thin pool metadata inconsistency
From: Ming Hung Tsai @ 2026-04-14 11:26 UTC (permalink / raw)
  To: linux-lvm
In-Reply-To: <CAFe+wq1RNJiiaak5dJycDsRBZ08ga1YQGYh4QTS9mV353YPLmw@mail.gmail.com>

Hi,

I have two questions posted inline.

On Tue, Apr 14, 2026 at 1:59 AM Lakshmi Narasimhan Sundararajan
<lsundararajan@purestorage.com> wrote:
>
> Good day!
>
> A gentle ping to hear an update from the authors here. More updates inline.
>
> On Sat, Apr 11, 2026 at 5:41 PM Lakshmi Narasimhan Sundararajan
> <lsundararajan@purestorage.com> wrote:
> >
> > Hi LVM Team! A very good day to you all.
> > [ I hope this email is the right one now]
> >
> > I recently experienced an outage where thin pool activation failed,
> > details are as follows.
> > Good news is, I was able to recover the pool through thin_repair.
> > Thank goodness!
> >
> > There was no infra induced failure i.e. no network, disk, usage over
> > limit, memory or compute being faulty orover used in any way.
> > Node was running healthy for 13 days and suddenly hit this issue.
> > Pool would handle I/O load (including discards), new volume
> > creation/deletion, and other regular activities.
> >
> > I tried to identify if there is a direct known issue, but I was unable to.
> > This generally seems to be some known issue, but I am unable to find a
> > direct link with the same signature.
> >
> > a) how to induce thin pool failures at will, so thin pool does not
> > activate, but repair succeeds, so  I can test this recovery in some
> > controlled form.
>
> I have found a way, I think I can pull out the metadata xml and modify
> highest transaction and rewrite the metadata and swap the pool to
> recreate this condition.
> pool activation will fail and thin_repair can correct it. Any easier
> way that my suggestion, please feel free to suggest.

Were you able to reproduce the issue on your end? I'm concerned that
using metadata rebuilt from XML might not trigger the bug because the
rebuilt layout differs from the original.

Could you please provide the raw metadata image prior to any repairs?
This will allow us to investigate the issue further.


> > b) To your best knowledge this seems a known issue and fixed in a later release?
> > I did my search at both kernel bugzilla and RHEL - and I am hoping you
> > can help me find it. Internet searches point to errata pages, but I am
> > unable to find the
> > exact ticket, commit that address this. The OCP platform was running a
> > recent release from RHEL.
> > Linux kernel: 5.14.0-427.109.1.el9_4 RHEL 9.4 This is likely 2 years old though.
>
> So far I and my team have not been able to reproduce this issue, and
> look to your help confirming whether
> a) is this known already and is fixed!
> b) whats the safest kernel to upgrade to?
> c) still an open issue!
>
> >
> > c) After spending some time reviewing thin code and the commits since
> > the mentioned
> > kernel from kernel.org linux.. I suspect it could be a race with
> > discard and either IO or device creation/deletion on the same pool
> > could cause this?
> > Could the authors here, please confirm my code reading below.
> > ```
>
> As mentioned we tried a focussed reproducer around this, but unable to
> trigger the issue.
> There are volume creation/deletions, snapshot creations and deletions,
> discards and regular IO at any point on the thin pool.
> And in addition, there would be calls to reserve/release the thin
> metadata to capture diff for backup between volumes.
> These are serialized at our app layer and I also see these are
> serialized within lvm layer too.
>
> Our volume deletions are 2 phased, based on our earlier discussion in
> this thread, we had observed high IO latency when volumes are deleted
> and the suggestion from this team was to discard and then delete
> volumes to keep the deletion time short.
>
> I hope this is giving enough context to understand this better.
> Unfortunately, since I am unable to reproduce this and have this
> sighted now twice at customer, I have no more datapoints to add.
> Would be willing to hear out if you have any suggestions I can pursue in house.
>
> Best regards
>
>
> > *** phase 1 - userspace issues blkdiscard on thin volumes ***
> >   dm-thin.c : thin_bio_map()
> >     → detects REQ_OP_DISCARD
> >     → thin_defer_bio_with_throttle(tc, bio)
> >       → adds bio to tc->deferred_bio_list        // QUEUED, not processed
> >       → wakes pool worker thread
> >
> >   dm-thin.c : do_worker()                         // runs ASYNCHRONOUSLY
> >     → process_deferred_bios()
> >       → process_thin_deferred_bios()
> >         → process_discard_bio()
> >           → creates mapping, adds to pool->prepared_discards
> >     → process_prepared(pool->prepared_discards)
> >       → process_prepared_discard_no_passdown(m):
> >         → dm_thin_remove_range(tc->td, begin, end)
> >             [dm-thin-metadata.c]
> >           → dm_btree_remove_leaves()
> >               [dm-btree-remove.c]
> >             → data_block_dec()                    // for each data block
> >                 [dm-thin-metadata.c]
> >               → dm_sm_dec_blocks()                // DECREMENTS refcount
> >                   [dm-space-map-common.c]
> >
> > ***  phase 2: these steps still be IN PROGRESS or QUEUED when
> > userspace deletes the thin volume ***
> >
> >   dm-thin.c : thin_dtr()                          // dmsetup remove
> >     → list_del_rcu(&tc->list)                     // removes from
> >                                                   //   pool->active_thins
> >     → synchronize_rcu()
> >     → dm_pool_close_thin_device(tc->td)           // open_count--
> >     → kfree(tc)                                   // tc FREED
> >
> >     *** does NOT flush pool workqueue ***          ← GAP 1
> >     *** does NOT drain prepared_discards ***       ← GAP 2
> >
> >   dm-thin.c : process_delete_mesg()               // dmsetup message
> >     → dm_pool_delete_thin_device(pool->pmd, dev_id)
> >         [dm-thin-metadata.c : __delete_device()]
> >       → dm_btree_remove(&pmd->tl_info, ...)       // remove from top-level
> >           [dm-btree-remove.c]                      //   btree
> >         → subtree_dec()                            // cascades into:
> >             [dm-thin-metadata.c]
> >           → dm_btree_del()                         // walks ALL leaves
> >               [dm-btree.c]
> >             → data_block_dec() for EVERY remaining block
> >                 [dm-thin-metadata.c]
> >               → dm_sm_dec_blocks()                 // DECREMENTS refcount
> >                   [dm-space-map-common.c]          //   for ALL blocks
> >
> > ** phase 3: KERNEL (worker thread — still running from Phase 1) ***
> >   dm-thin.c : do_worker()                         // ASYNC, still running
> >     → process_prepared(pool->prepared_discards)
> >       → process_prepared_discard_no_passdown(m):
> >         → m->tc points to FREED tc                // ← use-after-free risk
> >         → dm_thin_remove_range(tc->td, begin, end)
> >             [dm-thin-metadata.c]
> >           → dm_btree_remove_leaves()
> >               [dm-btree-remove.c]
> >             → data_block_dec()                    // SAME blocks already
> >                 [dm-thin-metadata.c]              //   decremented in
> >               → dm_sm_dec_blocks()                //   Phase 2!
> >                   [dm-space-map-common.c]
> >
> >                 ┌──────────────────────────────────────────────────┐
> >                   sm_ll_dec_bitmap():
> >                     old = sm_lookup_bitmap(ic->bitmap, bit);
> >                     switch (old) {
> >                     case 0:  // ← refcount ALREADY 0
> >                       DMERR("unable to decrement block");
> >                       return -EINVAL;  // -22
> >                     }
> >                                  [dm-space-map-common.c]
> >                 └──────────────────────────────────────────────────┘
> >
> >                           ▼
> >                 dm_tm_shadow_block() fails (corrupted space map)
> >                     [dm-transaction-manager.c]
> >
> >                           ▼
> >                 dm_pool_inc_data_range() fails with -EINVAL (-22)
> >                     [dm-thin-metadata.c]
> >
> >                           ▼
> >                 metadata_operation_failed(pool, "dm_pool_inc_data_range")
> >                     [dm-thin.c]
> >
> >                           ▼
> >                 set_pool_mode(pool, PM_READ_ONLY)
> >                     [dm-thin.c]
> >
> >                 *** POOL IS NOW DEAD ***
> > ```
> >
> >
> >
> > As always, many thanks for your help.
> >
> >
> > # issue unable to activate thin pool
> > ```
> > [Wed Apr  8 17:05:14 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > dm_tm_shadow_block() failed
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > dm_tm_shadow_block() failed
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > dm_tm_shadow_block() failed
> > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > dm_tm_shadow_block() failed
> > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > dm_tm_shadow_block() failed
> > ```
> >
> > # host and lvm tools version
> > ```
> > uname -a
> > Linux kernel: 5.14.0-427.109.1.el9_4
> > RHEL 9.4
> >
> > lvm version
> > 2.03.23(2) (2023-11-21)
> > library: 1.02.197 (2023-11-21)
> > driver: 4.48.1
> > ```
> >
> > Below are references to the node block layer.
> > There was IO, thin volume creations and deletions, IO includes discards too.
> > ```
> > [root@root core]# lvs -a pwx1
> >   Please remove the lvm.conf global_filter, it is ignored with the devices file.
> >   LV                  VG   Attr       LSize   Pool   Origin
> >   Data%  Meta%  Move Log Cpy%Sync Convert
> >   1004123733318649769 pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
> >   103699400925372609  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
> >   1072608604746349133 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
> >   1115712468847455249 pwx1 Vwi-aot--- 750.00g pxpool                     59.75
> >   1138695541641144166 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
> >   136169780918964477  pwx1 Vwi-aot---  30.00g pxpool                     33.33
> >   218651423266852202  pwx1 Vwi-aot---   5.00g pxpool                     3.49
> >   404947242154831849  pwx1 Vwi-aot---   5.00g pxpool                     4.20
> >   440731835552948333  pwx1 Vwi-aot---  50.00g pxpool                     5.59
> >   462681831690737818  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
> >   519898065353250833  pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
> >   527922274169222783  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
> >   537994915504805835  pwx1 Vwi-aot---  50.00g pxpool                     10.88
> >   569690966828279529  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
> >   594992999737145586  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
> >   660563940592999863  pwx1 Vwi-aot---  50.00g pxpool                     0.25
> >   702358223003836192  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
> >   73089959772282964   pwx1 Vwi-aot---  50.00g pxpool                     0.25
> >   793515512579595979  pwx1 Vwi-aot---  30.00g pxpool                     33.33
> >   79731196567060146   pwx1 Vwi-aot---  50.00g pxpool                     10.90
> >   865397616123963982  pwx1 Vwi-aot---  50.00g pxpool                     9.39
> >   866802183893693297  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
> >   941788757364603035  pwx1 Vwi-aot---  50.00g pxpool                     0.25
> >   960350716126095496  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
> >   [lvol0_pmspare]     pwx1 ewi-------   2.00g
> >   pxMetaFS            pwx1 Vwi-aot---  64.00g pxpool                     0.05
> >   pxpool              pwx1 twi-aot---   1.54t
> >   43.59  5.06 <<< very low tmeta util.
> >   [pxpool_tdata]      pwx1 Twi-ao----   1.54t
> >   [pxpool_tmeta]      pwx1 ewi-ao----   4.00g
> >   pxreserve           pwx1 -wi------k  15.00g
> > [root@root core]#
> > [root@root core]# vgs pwx1
> >   Please remove the lvm.conf global_filter, it is ignored with the devices file.
> >   VG   #PV #LV #SN Attr   VSize VFree
> >   pwx1   1  27   0 wz--n- 1.56t    0
> > [root@root core]# lsblk -s /dev/pwx1/1004123733318649769
> > NAME                                         MAJ:MIN RM  SIZE RO TYPE
> > MOUNTPOINTS
> > pwx1-1004123733318649769                     253:107  0   50G  0 lvm
> > └─pwx1-pxpool-tpool                          253:14   0  1.5T  0 lvm
> >   ├─pwx1-pxpool_tmeta                        253:12   0    4G  0 lvm
> >   │ └─md126                                    9:126  0  1.6T  0 raid0
> >   │   └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> >   │     ├─nvme4n2                            259:5    0  1.6T  0 disk
> >   │     ├─nvme5n2                            259:8    0  1.6T  0 disk
> >   │     └─nvme6n2                            259:11   0  1.6T  0 disk
> >   └─pwx1-pxpool_tdata                        253:13   0  1.5T  0 lvm
> >     └─md126                                    9:126  0  1.6T  0 raid0
> >       └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> >         ├─nvme4n2                            259:5    0  1.6T  0 disk
> >         ├─nvme5n2                            259:8    0  1.6T  0 disk
> >         └─nvme6n2                            259:11   0  1.6T  0 disk
> > [root@root core]# ls -al /dev/md/pwx1
> > lrwxrwxrwx. 1 root root 8 Apr 11 11:48 /dev/md/pwx1 -> ../md126
> > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool-tpool
> > 0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing

The second feature flag was truncated. Does the thin-pool has
no_discard_passdown enabled?


> > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tdata
> > 0 1008459776 linear 9:126 35653632
> > 1008459776 629145600 linear 9:126 1048307712
> > 1637605376 1673527296 linear 9:126 1681647616
> > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tmeta
> > 0 4194304 linear 9:126 1044113408
> > 4194304 4194304 linear 9:126 1677453312
> > [root@root core]#
> > [root@root core]# dmsetup table --target multipath
> > 3500a07513c1e23c4: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:16 1 1
> > 3500a07513c1e2ade: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:64 1 1
> > 3500a07513c1e2ca8: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:80 1 1
> > 3500a07513c1e2cf3: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:0 1 1
> > 3500a07513c1e3afc: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:48 1 1
> > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 0 0 1 1
> > service-time 0 1 2 259:2 1 1
> > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 0 0 1 1
> > service-time 0 1 2 259:0 1 1
> > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 0 0 1 1
> > service-time 0 1 2 259:3 1 1
> > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 0 0 1 1
> > service-time 0 1 2 259:1 1 1
> > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 3
> > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > 259:4 1 259:7 1 259:10 1
> > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 3
> > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > 259:5 1 259:8 1 259:11 1
> > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 3
> > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > 259:6 1 259:9 1 259:12 1
> > [root@root core]# dmsetup status --target multipath
> > 3500a07513c1e23c4: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:16 A 0 0 1
> > 3500a07513c1e2ade: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:64 A 0 0 1
> > 3500a07513c1e2ca8: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:80 A 0 0 1
> > 3500a07513c1e2cf3: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:0 A 0 0 1
> > 3500a07513c1e3afc: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:48 A 0 0 1
> > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 2 0 0 0 1
> > 1 A 0 1 2 259:2 A 0 0 1
> > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 2 0 0 0 1
> > 1 A 0 1 2 259:0 A 0 0 1
> > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 2 0 0 0 1
> > 1 A 0 1 2 259:3 A 0 0 1
> > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 2 0 0 0 1
> > 1 A 0 1 2 259:1 A 0 0 1
> > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 2 0 0 0
> > 1 1 A 0 3 1 259:4 A 1 22 259:7 A 1 21 259:10 A 1 22
> > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 2 0 0 0 1
> > 1 A 0 3 1 259:5 A 1 4 259:8 A 1 6 259:11 A 1 11
> > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 2 0 0 0 1
> > 1 A 0 3 1 259:6 A 1 0 259:9 A 1 0 259:12 A 1 0
> > [root@root core]#
> > [root@root core]# mdadm -D /dev/md/pwx1
> > /dev/md/pwx1:
> >            Version : 1.2
> >      Creation Time : Tue Mar 17 15:29:31 2026
> >         Raid Level : raid0
> >         Array Size : 1677589504 (1599.87 GiB 1717.85 GB)
> >       Raid Devices : 1
> >      Total Devices : 1
> >        Persistence : Superblock is persistent
> >
> >        Update Time : Mon Mar 23 20:52:51 2026
> >              State : clean
> >     Active Devices : 1
> >    Working Devices : 1
> >     Failed Devices : 0
> >      Spare Devices : 0
> >
> >         Chunk Size : 1024K
> >
> > Consistency Policy : none
> >
> >               Name : any:pwx1
> >               UUID : 1716a351:ed3e53e7:0ce83ccd:8d3a3021
> >             Events : 16
> >
> >     Number   Major   Minor   RaidDevice State
> >        0     253       10        0      active sync   /dev/dm-10
> > [root@root core]#
> > ```
>


^ permalink raw reply

* Re: LVM-thin metadata corruption on Proxmox after RAID-5 issues – thin_repair/thin_dump fail
From: Fabian Grünbichler @ 2026-04-14 10:17 UTC (permalink / raw)
  To: Ming-Hung Tsai, Ray Davis, Zdenek Kabelac; +Cc: linux-lvm
In-Reply-To: <9d9b7ab8-4c42-4359-8f68-c7076b0c59a5@gmail.com>

On April 10, 2026 10:31 am, Zdenek Kabelac wrote:
> Dne 10. 04. 26 v 5:43 Ming-Hung Tsai napsal(a):
>> On Fri, Apr 10, 2026 at 12:08 AM Ray Davis <ray@carpe.net> wrote:
>>>
>>> Hello,
>>>
> 
>> 
>> The specific error you mentioned, "value size mismatch: expected 8,
>> but got 24 (block 13182)", suggests it's an older version of
>> thin-provisioning-tools. Hopefully, a newer version will help resolve
>> this.
>> 
>> Would you mind providing the raw metadata image for me to look into? Thank you.
>> 
>> 
> 
> 
> Hi
> 
> There might be an interesting problem - whether distros like  Debian/Ubuntu 
> noticed the switch to:
> 
> https://github.com/device-mapper-utils/thin-provisioning-tools
> 
> 
> I think you may possibly need to do a gentle ping to maintainers of the 
> original tool based on now outdated:
> 
> https://github.com/jthornber/thin-provisioning-tools
> 
> started to package newer version of these tools.
> 
> (Possibly there should be a much much bigger reference to new repo :) if the 
> current one is blindly ignored....)

Debian unstable currently has 1.1.0, but there's already a bug open to
update to 1.2.1 which I've pinged to mention the new repository and
version available there.


^ permalink raw reply

* Re: Reg dm thin pool metadata inconsistency
From: Lakshmi Narasimhan Sundararajan @ 2026-04-13 17:49 UTC (permalink / raw)
  To: linux-lvm
In-Reply-To: <CAFe+wq3EXW0n_bLCri0zC2zVQK987D3x5LY+GiOpaLsgZxk4Rw@mail.gmail.com>

Good day!

A gentle ping to hear an update from the authors here. More updates inline.

On Sat, Apr 11, 2026 at 5:41 PM Lakshmi Narasimhan Sundararajan
<lsundararajan@purestorage.com> wrote:
>
> Hi LVM Team! A very good day to you all.
> [ I hope this email is the right one now]
>
> I recently experienced an outage where thin pool activation failed,
> details are as follows.
> Good news is, I was able to recover the pool through thin_repair.
> Thank goodness!
>
> There was no infra induced failure i.e. no network, disk, usage over
> limit, memory or compute being faulty orover used in any way.
> Node was running healthy for 13 days and suddenly hit this issue.
> Pool would handle I/O load (including discards), new volume
> creation/deletion, and other regular activities.
>
> I tried to identify if there is a direct known issue, but I was unable to.
> This generally seems to be some known issue, but I am unable to find a
> direct link with the same signature.
>
> a) how to induce thin pool failures at will, so thin pool does not
> activate, but repair succeeds, so  I can test this recovery in some
> controlled form.

I have found a way, I think I can pull out the metadata xml and modify
highest transaction and rewrite the metadata and swap the pool to
recreate this condition.
pool activation will fail and thin_repair can correct it. Any easier
way that my suggestion, please feel free to suggest.

> b) To your best knowledge this seems a known issue and fixed in a later release?
> I did my search at both kernel bugzilla and RHEL - and I am hoping you
> can help me find it. Internet searches point to errata pages, but I am
> unable to find the
> exact ticket, commit that address this. The OCP platform was running a
> recent release from RHEL.
> Linux kernel: 5.14.0-427.109.1.el9_4 RHEL 9.4 This is likely 2 years old though.

So far I and my team have not been able to reproduce this issue, and
look to your help confirming whether
a) is this known already and is fixed!
b) whats the safest kernel to upgrade to?
c) still an open issue!

>
> c) After spending some time reviewing thin code and the commits since
> the mentioned
> kernel from kernel.org linux.. I suspect it could be a race with
> discard and either IO or device creation/deletion on the same pool
> could cause this?
> Could the authors here, please confirm my code reading below.
> ```

As mentioned we tried a focussed reproducer around this, but unable to
trigger the issue.
There are volume creation/deletions, snapshot creations and deletions,
discards and regular IO at any point on the thin pool.
And in addition, there would be calls to reserve/release the thin
metadata to capture diff for backup between volumes.
These are serialized at our app layer and I also see these are
serialized within lvm layer too.

Our volume deletions are 2 phased, based on our earlier discussion in
this thread, we had observed high IO latency when volumes are deleted
and the suggestion from this team was to discard and then delete
volumes to keep the deletion time short.

I hope this is giving enough context to understand this better.
Unfortunately, since I am unable to reproduce this and have this
sighted now twice at customer, I have no more datapoints to add.
Would be willing to hear out if you have any suggestions I can pursue in house.

Best regards


> *** phase 1 - userspace issues blkdiscard on thin volumes ***
>   dm-thin.c : thin_bio_map()
>     → detects REQ_OP_DISCARD
>     → thin_defer_bio_with_throttle(tc, bio)
>       → adds bio to tc->deferred_bio_list        // QUEUED, not processed
>       → wakes pool worker thread
>
>   dm-thin.c : do_worker()                         // runs ASYNCHRONOUSLY
>     → process_deferred_bios()
>       → process_thin_deferred_bios()
>         → process_discard_bio()
>           → creates mapping, adds to pool->prepared_discards
>     → process_prepared(pool->prepared_discards)
>       → process_prepared_discard_no_passdown(m):
>         → dm_thin_remove_range(tc->td, begin, end)
>             [dm-thin-metadata.c]
>           → dm_btree_remove_leaves()
>               [dm-btree-remove.c]
>             → data_block_dec()                    // for each data block
>                 [dm-thin-metadata.c]
>               → dm_sm_dec_blocks()                // DECREMENTS refcount
>                   [dm-space-map-common.c]
>
> ***  phase 2: these steps still be IN PROGRESS or QUEUED when
> userspace deletes the thin volume ***
>
>   dm-thin.c : thin_dtr()                          // dmsetup remove
>     → list_del_rcu(&tc->list)                     // removes from
>                                                   //   pool->active_thins
>     → synchronize_rcu()
>     → dm_pool_close_thin_device(tc->td)           // open_count--
>     → kfree(tc)                                   // tc FREED
>
>     *** does NOT flush pool workqueue ***          ← GAP 1
>     *** does NOT drain prepared_discards ***       ← GAP 2
>
>   dm-thin.c : process_delete_mesg()               // dmsetup message
>     → dm_pool_delete_thin_device(pool->pmd, dev_id)
>         [dm-thin-metadata.c : __delete_device()]
>       → dm_btree_remove(&pmd->tl_info, ...)       // remove from top-level
>           [dm-btree-remove.c]                      //   btree
>         → subtree_dec()                            // cascades into:
>             [dm-thin-metadata.c]
>           → dm_btree_del()                         // walks ALL leaves
>               [dm-btree.c]
>             → data_block_dec() for EVERY remaining block
>                 [dm-thin-metadata.c]
>               → dm_sm_dec_blocks()                 // DECREMENTS refcount
>                   [dm-space-map-common.c]          //   for ALL blocks
>
> ** phase 3: KERNEL (worker thread — still running from Phase 1) ***
>   dm-thin.c : do_worker()                         // ASYNC, still running
>     → process_prepared(pool->prepared_discards)
>       → process_prepared_discard_no_passdown(m):
>         → m->tc points to FREED tc                // ← use-after-free risk
>         → dm_thin_remove_range(tc->td, begin, end)
>             [dm-thin-metadata.c]
>           → dm_btree_remove_leaves()
>               [dm-btree-remove.c]
>             → data_block_dec()                    // SAME blocks already
>                 [dm-thin-metadata.c]              //   decremented in
>               → dm_sm_dec_blocks()                //   Phase 2!
>                   [dm-space-map-common.c]
>
>                 ┌──────────────────────────────────────────────────┐
>                   sm_ll_dec_bitmap():
>                     old = sm_lookup_bitmap(ic->bitmap, bit);
>                     switch (old) {
>                     case 0:  // ← refcount ALREADY 0
>                       DMERR("unable to decrement block");
>                       return -EINVAL;  // -22
>                     }
>                                  [dm-space-map-common.c]
>                 └──────────────────────────────────────────────────┘
>
>                           ▼
>                 dm_tm_shadow_block() fails (corrupted space map)
>                     [dm-transaction-manager.c]
>
>                           ▼
>                 dm_pool_inc_data_range() fails with -EINVAL (-22)
>                     [dm-thin-metadata.c]
>
>                           ▼
>                 metadata_operation_failed(pool, "dm_pool_inc_data_range")
>                     [dm-thin.c]
>
>                           ▼
>                 set_pool_mode(pool, PM_READ_ONLY)
>                     [dm-thin.c]
>
>                 *** POOL IS NOW DEAD ***
> ```
>
>
>
> As always, many thanks for your help.
>
>
> # issue unable to activate thin pool
> ```
> [Wed Apr  8 17:05:14 2026] device-mapper: space map common: unable to
> decrement block
> [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> decrement block
> [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> dm_tm_shadow_block() failed
> [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> decrement block
> [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> dm_tm_shadow_block() failed
> [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> decrement block
> [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> dm_tm_shadow_block() failed
> [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> decrement block
> [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> dm_tm_shadow_block() failed
> [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> decrement block
> [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> dm_tm_shadow_block() failed
> ```
>
> # host and lvm tools version
> ```
> uname -a
> Linux kernel: 5.14.0-427.109.1.el9_4
> RHEL 9.4
>
> lvm version
> 2.03.23(2) (2023-11-21)
> library: 1.02.197 (2023-11-21)
> driver: 4.48.1
> ```
>
> Below are references to the node block layer.
> There was IO, thin volume creations and deletions, IO includes discards too.
> ```
> [root@root core]# lvs -a pwx1
>   Please remove the lvm.conf global_filter, it is ignored with the devices file.
>   LV                  VG   Attr       LSize   Pool   Origin
>   Data%  Meta%  Move Log Cpy%Sync Convert
>   1004123733318649769 pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
>   103699400925372609  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
>   1072608604746349133 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
>   1115712468847455249 pwx1 Vwi-aot--- 750.00g pxpool                     59.75
>   1138695541641144166 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
>   136169780918964477  pwx1 Vwi-aot---  30.00g pxpool                     33.33
>   218651423266852202  pwx1 Vwi-aot---   5.00g pxpool                     3.49
>   404947242154831849  pwx1 Vwi-aot---   5.00g pxpool                     4.20
>   440731835552948333  pwx1 Vwi-aot---  50.00g pxpool                     5.59
>   462681831690737818  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
>   519898065353250833  pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
>   527922274169222783  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
>   537994915504805835  pwx1 Vwi-aot---  50.00g pxpool                     10.88
>   569690966828279529  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
>   594992999737145586  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
>   660563940592999863  pwx1 Vwi-aot---  50.00g pxpool                     0.25
>   702358223003836192  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
>   73089959772282964   pwx1 Vwi-aot---  50.00g pxpool                     0.25
>   793515512579595979  pwx1 Vwi-aot---  30.00g pxpool                     33.33
>   79731196567060146   pwx1 Vwi-aot---  50.00g pxpool                     10.90
>   865397616123963982  pwx1 Vwi-aot---  50.00g pxpool                     9.39
>   866802183893693297  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
>   941788757364603035  pwx1 Vwi-aot---  50.00g pxpool                     0.25
>   960350716126095496  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
>   [lvol0_pmspare]     pwx1 ewi-------   2.00g
>   pxMetaFS            pwx1 Vwi-aot---  64.00g pxpool                     0.05
>   pxpool              pwx1 twi-aot---   1.54t
>   43.59  5.06 <<< very low tmeta util.
>   [pxpool_tdata]      pwx1 Twi-ao----   1.54t
>   [pxpool_tmeta]      pwx1 ewi-ao----   4.00g
>   pxreserve           pwx1 -wi------k  15.00g
> [root@root core]#
> [root@root core]# vgs pwx1
>   Please remove the lvm.conf global_filter, it is ignored with the devices file.
>   VG   #PV #LV #SN Attr   VSize VFree
>   pwx1   1  27   0 wz--n- 1.56t    0
> [root@root core]# lsblk -s /dev/pwx1/1004123733318649769
> NAME                                         MAJ:MIN RM  SIZE RO TYPE
> MOUNTPOINTS
> pwx1-1004123733318649769                     253:107  0   50G  0 lvm
> └─pwx1-pxpool-tpool                          253:14   0  1.5T  0 lvm
>   ├─pwx1-pxpool_tmeta                        253:12   0    4G  0 lvm
>   │ └─md126                                    9:126  0  1.6T  0 raid0
>   │   └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
>   │     ├─nvme4n2                            259:5    0  1.6T  0 disk
>   │     ├─nvme5n2                            259:8    0  1.6T  0 disk
>   │     └─nvme6n2                            259:11   0  1.6T  0 disk
>   └─pwx1-pxpool_tdata                        253:13   0  1.5T  0 lvm
>     └─md126                                    9:126  0  1.6T  0 raid0
>       └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
>         ├─nvme4n2                            259:5    0  1.6T  0 disk
>         ├─nvme5n2                            259:8    0  1.6T  0 disk
>         └─nvme6n2                            259:11   0  1.6T  0 disk
> [root@root core]# ls -al /dev/md/pwx1
> lrwxrwxrwx. 1 root root 8 Apr 11 11:48 /dev/md/pwx1 -> ../md126
> [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool-tpool
> 0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing
> [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tdata
> 0 1008459776 linear 9:126 35653632
> 1008459776 629145600 linear 9:126 1048307712
> 1637605376 1673527296 linear 9:126 1681647616
> [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tmeta
> 0 4194304 linear 9:126 1044113408
> 4194304 4194304 linear 9:126 1677453312
> [root@root core]#
> [root@root core]# dmsetup table --target multipath
> 3500a07513c1e23c4: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:16 1 1
> 3500a07513c1e2ade: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:64 1 1
> 3500a07513c1e2ca8: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:80 1 1
> 3500a07513c1e2cf3: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:0 1 1
> 3500a07513c1e3afc: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:48 1 1
> eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 0 0 1 1
> service-time 0 1 2 259:2 1 1
> eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 0 0 1 1
> service-time 0 1 2 259:0 1 1
> eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 0 0 1 1
> service-time 0 1 2 259:3 1 1
> eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 0 0 1 1
> service-time 0 1 2 259:1 1 1
> eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 3
> retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> 259:4 1 259:7 1 259:10 1
> eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 3
> retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> 259:5 1 259:8 1 259:11 1
> eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 3
> retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> 259:6 1 259:9 1 259:12 1
> [root@root core]# dmsetup status --target multipath
> 3500a07513c1e23c4: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:16 A 0 0 1
> 3500a07513c1e2ade: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:64 A 0 0 1
> 3500a07513c1e2ca8: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:80 A 0 0 1
> 3500a07513c1e2cf3: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:0 A 0 0 1
> 3500a07513c1e3afc: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:48 A 0 0 1
> eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 2 0 0 0 1
> 1 A 0 1 2 259:2 A 0 0 1
> eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 2 0 0 0 1
> 1 A 0 1 2 259:0 A 0 0 1
> eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 2 0 0 0 1
> 1 A 0 1 2 259:3 A 0 0 1
> eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 2 0 0 0 1
> 1 A 0 1 2 259:1 A 0 0 1
> eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 2 0 0 0
> 1 1 A 0 3 1 259:4 A 1 22 259:7 A 1 21 259:10 A 1 22
> eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 2 0 0 0 1
> 1 A 0 3 1 259:5 A 1 4 259:8 A 1 6 259:11 A 1 11
> eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 2 0 0 0 1
> 1 A 0 3 1 259:6 A 1 0 259:9 A 1 0 259:12 A 1 0
> [root@root core]#
> [root@root core]# mdadm -D /dev/md/pwx1
> /dev/md/pwx1:
>            Version : 1.2
>      Creation Time : Tue Mar 17 15:29:31 2026
>         Raid Level : raid0
>         Array Size : 1677589504 (1599.87 GiB 1717.85 GB)
>       Raid Devices : 1
>      Total Devices : 1
>        Persistence : Superblock is persistent
>
>        Update Time : Mon Mar 23 20:52:51 2026
>              State : clean
>     Active Devices : 1
>    Working Devices : 1
>     Failed Devices : 0
>      Spare Devices : 0
>
>         Chunk Size : 1024K
>
> Consistency Policy : none
>
>               Name : any:pwx1
>               UUID : 1716a351:ed3e53e7:0ce83ccd:8d3a3021
>             Events : 16
>
>     Number   Major   Minor   RaidDevice State
>        0     253       10        0      active sync   /dev/dm-10
> [root@root core]#
> ```

^ permalink raw reply

* Reg dm thin pool metadata inconsistency
From: Lakshmi Narasimhan Sundararajan @ 2026-04-11 12:11 UTC (permalink / raw)
  To: linux-lvm

Hi LVM Team! A very good day to you all.
[ I hope this email is the right one now]

I recently experienced an outage where thin pool activation failed,
details are as follows.
Good news is, I was able to recover the pool through thin_repair.
Thank goodness!

There was no infra induced failure i.e. no network, disk, usage over
limit, memory or compute being faulty orover used in any way.
Node was running healthy for 13 days and suddenly hit this issue.
Pool would handle I/O load (including discards), new volume
creation/deletion, and other regular activities.

I tried to identify if there is a direct known issue, but I was unable to.
This generally seems to be some known issue, but I am unable to find a
direct link with the same signature.

a) how to induce thin pool failures at will, so thin pool does not
activate, but repair succeeds, so  I can test this recovery in some
controlled form.
b) To your best knowledge this seems a known issue and fixed in a later release?
I did my search at both kernel bugzilla and RHEL - and I am hoping you
can help me find it. Internet searches point to errata pages, but I am
unable to find the
exact ticket, commit that address this. The OCP platform was running a
recent release from RHEL.
Linux kernel: 5.14.0-427.109.1.el9_4 RHEL 9.4 This is likely 2 years old though.

c) After spending some time reviewing thin code and the commits since
the mentioned
kernel from kernel.org linux.. I suspect it could be a race with
discard and either IO or device creation/deletion on the same pool
could cause this?
Could the authors here, please confirm my code reading below.
```
*** phase 1 - userspace issues blkdiscard on thin volumes ***
  dm-thin.c : thin_bio_map()
    → detects REQ_OP_DISCARD
    → thin_defer_bio_with_throttle(tc, bio)
      → adds bio to tc->deferred_bio_list        // QUEUED, not processed
      → wakes pool worker thread

  dm-thin.c : do_worker()                         // runs ASYNCHRONOUSLY
    → process_deferred_bios()
      → process_thin_deferred_bios()
        → process_discard_bio()
          → creates mapping, adds to pool->prepared_discards
    → process_prepared(pool->prepared_discards)
      → process_prepared_discard_no_passdown(m):
        → dm_thin_remove_range(tc->td, begin, end)
            [dm-thin-metadata.c]
          → dm_btree_remove_leaves()
              [dm-btree-remove.c]
            → data_block_dec()                    // for each data block
                [dm-thin-metadata.c]
              → dm_sm_dec_blocks()                // DECREMENTS refcount
                  [dm-space-map-common.c]

***  phase 2: these steps still be IN PROGRESS or QUEUED when
userspace deletes the thin volume ***

  dm-thin.c : thin_dtr()                          // dmsetup remove
    → list_del_rcu(&tc->list)                     // removes from
                                                  //   pool->active_thins
    → synchronize_rcu()
    → dm_pool_close_thin_device(tc->td)           // open_count--
    → kfree(tc)                                   // tc FREED

    *** does NOT flush pool workqueue ***          ← GAP 1
    *** does NOT drain prepared_discards ***       ← GAP 2

  dm-thin.c : process_delete_mesg()               // dmsetup message
    → dm_pool_delete_thin_device(pool->pmd, dev_id)
        [dm-thin-metadata.c : __delete_device()]
      → dm_btree_remove(&pmd->tl_info, ...)       // remove from top-level
          [dm-btree-remove.c]                      //   btree
        → subtree_dec()                            // cascades into:
            [dm-thin-metadata.c]
          → dm_btree_del()                         // walks ALL leaves
              [dm-btree.c]
            → data_block_dec() for EVERY remaining block
                [dm-thin-metadata.c]
              → dm_sm_dec_blocks()                 // DECREMENTS refcount
                  [dm-space-map-common.c]          //   for ALL blocks

** phase 3: KERNEL (worker thread — still running from Phase 1) ***
  dm-thin.c : do_worker()                         // ASYNC, still running
    → process_prepared(pool->prepared_discards)
      → process_prepared_discard_no_passdown(m):
        → m->tc points to FREED tc                // ← use-after-free risk
        → dm_thin_remove_range(tc->td, begin, end)
            [dm-thin-metadata.c]
          → dm_btree_remove_leaves()
              [dm-btree-remove.c]
            → data_block_dec()                    // SAME blocks already
                [dm-thin-metadata.c]              //   decremented in
              → dm_sm_dec_blocks()                //   Phase 2!
                  [dm-space-map-common.c]

                ┌──────────────────────────────────────────────────┐
                  sm_ll_dec_bitmap():
                    old = sm_lookup_bitmap(ic->bitmap, bit);
                    switch (old) {
                    case 0:  // ← refcount ALREADY 0
                      DMERR("unable to decrement block");
                      return -EINVAL;  // -22
                    }
                                 [dm-space-map-common.c]
                └──────────────────────────────────────────────────┘

                          ▼
                dm_tm_shadow_block() fails (corrupted space map)
                    [dm-transaction-manager.c]

                          ▼
                dm_pool_inc_data_range() fails with -EINVAL (-22)
                    [dm-thin-metadata.c]

                          ▼
                metadata_operation_failed(pool, "dm_pool_inc_data_range")
                    [dm-thin.c]

                          ▼
                set_pool_mode(pool, PM_READ_ONLY)
                    [dm-thin.c]

                *** POOL IS NOW DEAD ***
```



As always, many thanks for your help.


# issue unable to activate thin pool
```
[Wed Apr  8 17:05:14 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr  8 17:08:11 2026] device-mapper: space map common:
dm_tm_shadow_block() failed
[Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr  8 17:08:11 2026] device-mapper: space map common:
dm_tm_shadow_block() failed
[Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr  8 17:08:11 2026] device-mapper: space map common:
dm_tm_shadow_block() failed
[Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr  8 17:08:31 2026] device-mapper: space map common:
dm_tm_shadow_block() failed
[Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr  8 17:08:31 2026] device-mapper: space map common:
dm_tm_shadow_block() failed
```

# host and lvm tools version
```
uname -a
Linux kernel: 5.14.0-427.109.1.el9_4
RHEL 9.4

lvm version
2.03.23(2) (2023-11-21)
library: 1.02.197 (2023-11-21)
driver: 4.48.1
```

Below are references to the node block layer.
There was IO, thin volume creations and deletions, IO includes discards too.
```
[root@root core]# lvs -a pwx1
  Please remove the lvm.conf global_filter, it is ignored with the devices file.
  LV                  VG   Attr       LSize   Pool   Origin
  Data%  Meta%  Move Log Cpy%Sync Convert
  1004123733318649769 pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
  103699400925372609  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
  1072608604746349133 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
  1115712468847455249 pwx1 Vwi-aot--- 750.00g pxpool                     59.75
  1138695541641144166 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  0.25
  136169780918964477  pwx1 Vwi-aot---  30.00g pxpool                     33.33
  218651423266852202  pwx1 Vwi-aot---   5.00g pxpool                     3.49
  404947242154831849  pwx1 Vwi-aot---   5.00g pxpool                     4.20
  440731835552948333  pwx1 Vwi-aot---  50.00g pxpool                     5.59
  462681831690737818  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
  519898065353250833  pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  0.25
  527922274169222783  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
  537994915504805835  pwx1 Vwi-aot---  50.00g pxpool                     10.88
  569690966828279529  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
  594992999737145586  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
  660563940592999863  pwx1 Vwi-aot---  50.00g pxpool                     0.25
  702358223003836192  pwx1 Vwi-aot--- 200.00g pxpool                     28.64
  73089959772282964   pwx1 Vwi-aot---  50.00g pxpool                     0.25
  793515512579595979  pwx1 Vwi-aot---  30.00g pxpool                     33.33
  79731196567060146   pwx1 Vwi-aot---  50.00g pxpool                     10.90
  865397616123963982  pwx1 Vwi-aot---  50.00g pxpool                     9.39
  866802183893693297  pwx1 Vwi-aot--- 200.00g pxpool                     28.91
  941788757364603035  pwx1 Vwi-aot---  50.00g pxpool                     0.25
  960350716126095496  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   0.25
  [lvol0_pmspare]     pwx1 ewi-------   2.00g
  pxMetaFS            pwx1 Vwi-aot---  64.00g pxpool                     0.05
  pxpool              pwx1 twi-aot---   1.54t
  43.59  5.06 <<< very low tmeta util.
  [pxpool_tdata]      pwx1 Twi-ao----   1.54t
  [pxpool_tmeta]      pwx1 ewi-ao----   4.00g
  pxreserve           pwx1 -wi------k  15.00g
[root@root core]#
[root@root core]# vgs pwx1
  Please remove the lvm.conf global_filter, it is ignored with the devices file.
  VG   #PV #LV #SN Attr   VSize VFree
  pwx1   1  27   0 wz--n- 1.56t    0
[root@root core]# lsblk -s /dev/pwx1/1004123733318649769
NAME                                         MAJ:MIN RM  SIZE RO TYPE
MOUNTPOINTS
pwx1-1004123733318649769                     253:107  0   50G  0 lvm
└─pwx1-pxpool-tpool                          253:14   0  1.5T  0 lvm
  ├─pwx1-pxpool_tmeta                        253:12   0    4G  0 lvm
  │ └─md126                                    9:126  0  1.6T  0 raid0
  │   └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
  │     ├─nvme4n2                            259:5    0  1.6T  0 disk
  │     ├─nvme5n2                            259:8    0  1.6T  0 disk
  │     └─nvme6n2                            259:11   0  1.6T  0 disk
  └─pwx1-pxpool_tdata                        253:13   0  1.5T  0 lvm
    └─md126                                    9:126  0  1.6T  0 raid0
      └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
        ├─nvme4n2                            259:5    0  1.6T  0 disk
        ├─nvme5n2                            259:8    0  1.6T  0 disk
        └─nvme6n2                            259:11   0  1.6T  0 disk
[root@root core]# ls -al /dev/md/pwx1
lrwxrwxrwx. 1 root root 8 Apr 11 11:48 /dev/md/pwx1 -> ../md126
[root@root core]# dmsetup table /dev/mapper/pwx1-pxpool-tpool
0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing
[root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tdata
0 1008459776 linear 9:126 35653632
1008459776 629145600 linear 9:126 1048307712
1637605376 1673527296 linear 9:126 1681647616
[root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tmeta
0 4194304 linear 9:126 1044113408
4194304 4194304 linear 9:126 1677453312
[root@root core]#
[root@root core]# dmsetup table --target multipath
3500a07513c1e23c4: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:16 1 1
3500a07513c1e2ade: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:64 1 1
3500a07513c1e2ca8: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:80 1 1
3500a07513c1e2cf3: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:0 1 1
3500a07513c1e3afc: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:48 1 1
eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 0 0 1 1
service-time 0 1 2 259:2 1 1
eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 0 0 1 1
service-time 0 1 2 259:0 1 1
eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 0 0 1 1
service-time 0 1 2 259:3 1 1
eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 0 0 1 1
service-time 0 1 2 259:1 1 1
eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 3
retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
259:4 1 259:7 1 259:10 1
eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 3
retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
259:5 1 259:8 1 259:11 1
eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 3
retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
259:6 1 259:9 1 259:12 1
[root@root core]# dmsetup status --target multipath
3500a07513c1e23c4: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:16 A 0 0 1
3500a07513c1e2ade: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:64 A 0 0 1
3500a07513c1e2ca8: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:80 A 0 0 1
3500a07513c1e2cf3: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:0 A 0 0 1
3500a07513c1e3afc: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:48 A 0 0 1
eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 2 0 0 0 1
1 A 0 1 2 259:2 A 0 0 1
eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 2 0 0 0 1
1 A 0 1 2 259:0 A 0 0 1
eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 2 0 0 0 1
1 A 0 1 2 259:3 A 0 0 1
eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 2 0 0 0 1
1 A 0 1 2 259:1 A 0 0 1
eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 2 0 0 0
1 1 A 0 3 1 259:4 A 1 22 259:7 A 1 21 259:10 A 1 22
eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 2 0 0 0 1
1 A 0 3 1 259:5 A 1 4 259:8 A 1 6 259:11 A 1 11
eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 2 0 0 0 1
1 A 0 3 1 259:6 A 1 0 259:9 A 1 0 259:12 A 1 0
[root@root core]#
[root@root core]# mdadm -D /dev/md/pwx1
/dev/md/pwx1:
           Version : 1.2
     Creation Time : Tue Mar 17 15:29:31 2026
        Raid Level : raid0
        Array Size : 1677589504 (1599.87 GiB 1717.85 GB)
      Raid Devices : 1
     Total Devices : 1
       Persistence : Superblock is persistent

       Update Time : Mon Mar 23 20:52:51 2026
             State : clean
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

        Chunk Size : 1024K

Consistency Policy : none

              Name : any:pwx1
              UUID : 1716a351:ed3e53e7:0ce83ccd:8d3a3021
            Events : 16

    Number   Major   Minor   RaidDevice State
       0     253       10        0      active sync   /dev/dm-10
[root@root core]#
```

^ permalink raw reply

* Re: LVM-thin metadata corruption on Proxmox after RAID-5 issues – thin_repair/thin_dump fail
From: Zdenek Kabelac @ 2026-04-10  8:31 UTC (permalink / raw)
  To: Ming-Hung Tsai, Ray Davis; +Cc: linux-lvm
In-Reply-To: <CAAYit8T1jQ-35GApTVRiq2wYd5G2R8fXSm1aEjNcgR8Oi=oC7g@mail.gmail.com>

Dne 10. 04. 26 v 5:43 Ming-Hung Tsai napsal(a):
> On Fri, Apr 10, 2026 at 12:08 AM Ray Davis <ray@carpe.net> wrote:
>>
>> Hello,
>>

> 
> The specific error you mentioned, "value size mismatch: expected 8,
> but got 24 (block 13182)", suggests it's an older version of
> thin-provisioning-tools. Hopefully, a newer version will help resolve
> this.
> 
> Would you mind providing the raw metadata image for me to look into? Thank you.
> 
> 


Hi

There might be an interesting problem - whether distros like  Debian/Ubuntu 
noticed the switch to:

https://github.com/device-mapper-utils/thin-provisioning-tools


I think you may possibly need to do a gentle ping to maintainers of the 
original tool based on now outdated:

https://github.com/jthornber/thin-provisioning-tools

started to package newer version of these tools.

(Possibly there should be a much much bigger reference to new repo :) if the 
current one is blindly ignored....)


Zdenek


^ permalink raw reply

* Re[2]: LVM-thin metadata corruption on Proxmox after RAID-5 issues – thin_repair/thin_dump fail
From: Ray Davis @ 2026-04-10  8:29 UTC (permalink / raw)
  To: Ming-Hung Tsai; +Cc: linux-lvm
In-Reply-To: <CAAYit8T1jQ-35GApTVRiq2wYd5G2R8fXSm1aEjNcgR8Oi=oC7g@mail.gmail.com>

Hi Ming-Hung,

Thanks for the reply!  The thin-provisioning-tools are version 0.9.0-2.
Maybe I need a better repository? Here is the sources.list:

deb http://ftp.de.debian.org/debian bookworm main contrib
deb http://ftp.de.debian.org/debian bookworm-updates main contrib
deb http://security.debian.org bookworm-security main contrib
deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription

The raw metadata is about 17 GB.  Would it be better for me to make it 
available for download or just let you log into the machine directly?

Thanks,
Ray


------ Original Message ------
From "Ming-Hung Tsai" <mingnus@gmail.com>
To "Ray Davis" <ray@carpe.net>
Cc linux-lvm@lists.linux.dev
Date 10.04.2026 05:43:35
Subject Re: LVM-thin metadata corruption on Proxmox after RAID-5 issues 
– thin_repair/thin_dump fail

>On Fri, Apr 10, 2026 at 12:08 AM Ray Davis <ray@carpe.net> wrote:
>>
>>  Hello,
>>
>>  I’m looking for advice on an LVM-thin metadata corruption case on a
>>  Proxmox host. I have stopped further repair attempts and preserved the
>>  current state for analysis.
>>
>>  Environment
>>
>>  * Proxmox VE host
>>  * Thin pool: `VMDATA0/VMDATA0`
>>  * Backing storage had RAID-5 issues involving two disks
>>  * After reseating the disks, the RAID came back online, but the thin
>>  pool would no longer activate
>>
>>  Original Proxmox/LVM error
>>  `activating LV 'VMDATA0/VMDATA0' failed: Check of pool VMDATA0/VMDATA0
>>  failed (status:1). Manual repair required!`
>>
>>  What I tried
>>
>>  1. `vgcfgbackup VMDATA0`
>>  2. `lvconvert --repair VMDATA0/VMDATA0`
>>
>>  This failed with:
>>  `value size mismatch: expected 8, but got 24 (block 13182)`
>>
>>  At that point I added temporary VG space on a USB disk so I could try
>>  manual metadata recovery.
>>
>>  Manual recovery steps attempted
>>
>>  * Created temporary metadata LVs
>>  * Used `lvconvert --swapmetadata` to extract the pool metadata
>>  * Activated the extracted metadata LV
>>  * Preserved a raw image of the extracted metadata
>>  * Ran `thin_check`
>>  * Ran `thin_repair`
>>  * Ran `thin_dump —repair`
>>
>>  Current results
>>
>>  `thin_check /dev/VMDATA0/meta_extract` reports:
>>
>>  * `missing devices: [0, -]`
>>  * `bad checksum in btree node (block 79511)`
>>  * `missing all mappings for devices: [0, -]`
>>  * `bad checksum in btree node (block 79506)`
>>
>>  `thin_repair -i /dev/VMDATA0/meta_extract -o /dev/VMDATA0/repair_meta`
>>  fails with:
>>  `value size mismatch: expected 8, but got 24 (block 13182)`
>>
>>  `thin_dump --repair -o /root/VMDATA0_repaired.xml
>>  /dev/VMDATA0/meta_extract` fails with the same:
>>  `value size mismatch: expected 8, but got 24 (block 13182)`
>>
>>  Current LV state at the time of extraction looked like this:
>>
>>  * thin pool `VMDATA0`
>>  * extracted metadata LV `meta_extract`
>>  * fresh target metadata LV `repair_meta`
>>
>>  What I have preserved
>>
>>  * Raw metadata image from the extracted metadata LV:
>>     `VMDATA0_meta_extract.raw`
>>  * A second raw copy from the extracted LV device
>>  * `vgcfgbackup` output and LVM archive files
>>  * Diagnostics bundle with:
>>
>>     * `pvs`, `vgs`, `lvs`
>>     * `thin_check` output
>>     * `dmesg`
>>     * kernel journal
>>     * `lvm version`
>>     * checksums
>>
>>  I can provide access to the full case directory or a tarball over HTTP
>>  if someone is willing to look at it. I would prefer to share the link
>>  privately with anyone interested.
>>
>>  My main questions are:
>>
>>  1. Is there any remaining offline recovery path worth trying with the
>>  standard dm-thin/LVM tools?
>>  2. Does this failure pattern usually indicate that the mapping metadata
>>  is beyond recovery by `thin_repair`/`thin_dump`?
>>  3. Would it be useful to inspect the raw metadata image further, and if
>>  so, what specific tools or commands would you recommend next?
>>
>>  I can provide any command output that would be useful.
>>
>>  Thanks very much for any guidance,
>>  Ray
>>
>
>Hi,
>
>The specific error you mentioned, "value size mismatch: expected 8,
>but got 24 (block 13182)", suggests it's an older version of
>thin-provisioning-tools. Hopefully, a newer version will help resolve
>this.
>
>Would you mind providing the raw metadata image for me to look into? Thank you.
>
>
>Hank

^ permalink raw reply

* Re[2]: LVM-thin metadata corruption on Proxmox after RAID-5 issues – thin_repair/thin_dump fail
From: Ray Davis @ 2026-04-10  8:19 UTC (permalink / raw)
  To: linux-lvm
In-Reply-To: <77f3dec3-7050-4d57-8928-7c7b68a2537e@gmail.com>

Hi Zdenek,

Thanks for the reply!  Here are the missing version number…

root@proxmox0:~# uname -a
Linux proxmox0 6.8.12-5-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-5 
(2024-12-03T10:26Z) x86_64 GNU/Linux

/etc/os-release says "Debian GNU/Linux 12 (bookworm)”.

root@proxmox0:~# lvm version
   LVM version:     2.03.16(2) (2022-05-18)
   Library version: 1.02.185 (2022-05-18)
   Driver version:  4.48.0
   Configuration:   ./configure --build=x86_64-linux-gnu --prefix=/usr 
--includedir=$/include --mandir=$/share/man 
--infodir=$/share/info --sysconfdir=/etc --localstatedir=/var 
--disable-option-checking --disable-silent-rules 
--libdir=$/lib/x86_64-linux-gnu --runstatedir=/run 
--disable-maintainer-mode --disable-dependency-tracking 
--libdir=/lib/x86_64-linux-gnu --sbindir=/sbin 
--with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 
--with-cache=internal --with-device-uid=0 --with-device-gid=6 
--with-device-mode=0660 --with-default-pid-dir=/run 
--with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm 
--with-thin=internal --with-thin-check=/usr/sbin/thin_check 
--with-thin-dump=/usr/sbin/thin_dump 
--with-thin-repair=/usr/sbin/thin_repair --with-udev-prefix=/ 
--enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd 
--enable-editline --enable-lvmlockd-dlm --enable-lvmlockd-sanlock 
--enable-lvmpolld --enable-notify-dbus --enable-pkgconfig 
--enable-udev_rules --enable-udev_sync --disable-readline

root@proxmox0:~# thin_repair -V
0.9.0

I did try to "apt install --only-upgrade  thin-provisioning-tools”, but 
it said "thin-provisioning-tools is already the newest version 
(0.9.0-2)”.  Maybe I need a better repository?  Here is the 
sources.list:

deb http://ftp.de.debian.org/debian bookworm main contrib
deb http://ftp.de.debian.org/debian bookworm-updates main contrib
deb http://security.debian.org bookworm-security main contrib
deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription

The damage was caused by two disks of a six disk raid-5 array going 
offline for some reason.  They are both back online and the controller 
rebuilt the raid-5 without any obvious errors.

Thanks,
Ray

------ Original Message ------
From "Zdenek Kabelac" <zdenek.kabelac@gmail.com>
To "Ray Davis" <ray@carpe.net>; linux-lvm@lists.linux.dev
Date 09.04.2026 22:32:56
Subject Re: LVM-thin metadata corruption on Proxmox after RAID-5 issues 
– thin_repair/thin_dump fail

>Dne 09. 04. 26 v 17:55 Ray Davis napsal(a):
>>Hello,
>>
>>I’m looking for advice on an LVM-thin metadata corruption case on a Proxmox host. I have stopped further repair attempts and preserved the current state for analysis.
>>
>>Environment
>>
>>* Proxmox VE host
>>* Thin pool: `VMDATA0/VMDATA0`
>>* Backing storage had RAID-5 issues involving two disks
>>* After reseating the disks, the RAID came back online, but the thin pool would no longer activate
>>>
>>* Created temporary metadata LVs
>>* Used `lvconvert --swapmetadata` to extract the pool metadata
>>* Activated the extracted metadata LV
>>* Preserved a raw image of the extracted metadata
>>* Ran `thin_check`
>>* Ran `thin_repair`
>>* Ran `thin_dump —repair`
>>
>>I can provide access to the full case directory or a tarball over HTTP if someone is willing to look at it. I would prefer to share the link privately with anyone interested.
>>
>>My main questions are:
>>
>>1. Is there any remaining offline recovery path worth trying with the standard dm-thin/LVM tools?
>
>Hi
>
>Very extensive report - but it looks like the key element here would be -
>what are the versions in use.
>
>
>Kernel ?
>
>lvm2 ?
>
>thin_repair -V  ?
>
>
>
>>2. Does this failure pattern usually indicate that the mapping metadata is beyond recovery by `thin_repair`/`thin_dump`?
>>3. Would it be useful to inspect the raw metadata image further, and if so, what specific tools or commands would you recommend next?
>>
>>I can provide any command output that would be useful.
>
>If you are using latest tools - then authors of this tool will need to have
>access to the damaged metadata  whether there is something that can be recovered.
>
>It'd also greatly help knowing what kind of 'damange' may have caused this.
>
>Regards
>
>Zdenek
>

^ permalink raw reply

* Re: LVM-thin metadata corruption on Proxmox after RAID-5 issues – thin_repair/thin_dump fail
From: Ming-Hung Tsai @ 2026-04-10  3:43 UTC (permalink / raw)
  To: Ray Davis; +Cc: linux-lvm
In-Reply-To: <em2752ec1c-6e02-42bf-80e8-bb92f63e8f73@carpe.net>

On Fri, Apr 10, 2026 at 12:08 AM Ray Davis <ray@carpe.net> wrote:
>
> Hello,
>
> I’m looking for advice on an LVM-thin metadata corruption case on a
> Proxmox host. I have stopped further repair attempts and preserved the
> current state for analysis.
>
> Environment
>
> * Proxmox VE host
> * Thin pool: `VMDATA0/VMDATA0`
> * Backing storage had RAID-5 issues involving two disks
> * After reseating the disks, the RAID came back online, but the thin
> pool would no longer activate
>
> Original Proxmox/LVM error
> `activating LV 'VMDATA0/VMDATA0' failed: Check of pool VMDATA0/VMDATA0
> failed (status:1). Manual repair required!`
>
> What I tried
>
> 1. `vgcfgbackup VMDATA0`
> 2. `lvconvert --repair VMDATA0/VMDATA0`
>
> This failed with:
> `value size mismatch: expected 8, but got 24 (block 13182)`
>
> At that point I added temporary VG space on a USB disk so I could try
> manual metadata recovery.
>
> Manual recovery steps attempted
>
> * Created temporary metadata LVs
> * Used `lvconvert --swapmetadata` to extract the pool metadata
> * Activated the extracted metadata LV
> * Preserved a raw image of the extracted metadata
> * Ran `thin_check`
> * Ran `thin_repair`
> * Ran `thin_dump —repair`
>
> Current results
>
> `thin_check /dev/VMDATA0/meta_extract` reports:
>
> * `missing devices: [0, -]`
> * `bad checksum in btree node (block 79511)`
> * `missing all mappings for devices: [0, -]`
> * `bad checksum in btree node (block 79506)`
>
> `thin_repair -i /dev/VMDATA0/meta_extract -o /dev/VMDATA0/repair_meta`
> fails with:
> `value size mismatch: expected 8, but got 24 (block 13182)`
>
> `thin_dump --repair -o /root/VMDATA0_repaired.xml
> /dev/VMDATA0/meta_extract` fails with the same:
> `value size mismatch: expected 8, but got 24 (block 13182)`
>
> Current LV state at the time of extraction looked like this:
>
> * thin pool `VMDATA0`
> * extracted metadata LV `meta_extract`
> * fresh target metadata LV `repair_meta`
>
> What I have preserved
>
> * Raw metadata image from the extracted metadata LV:
>    `VMDATA0_meta_extract.raw`
> * A second raw copy from the extracted LV device
> * `vgcfgbackup` output and LVM archive files
> * Diagnostics bundle with:
>
>    * `pvs`, `vgs`, `lvs`
>    * `thin_check` output
>    * `dmesg`
>    * kernel journal
>    * `lvm version`
>    * checksums
>
> I can provide access to the full case directory or a tarball over HTTP
> if someone is willing to look at it. I would prefer to share the link
> privately with anyone interested.
>
> My main questions are:
>
> 1. Is there any remaining offline recovery path worth trying with the
> standard dm-thin/LVM tools?
> 2. Does this failure pattern usually indicate that the mapping metadata
> is beyond recovery by `thin_repair`/`thin_dump`?
> 3. Would it be useful to inspect the raw metadata image further, and if
> so, what specific tools or commands would you recommend next?
>
> I can provide any command output that would be useful.
>
> Thanks very much for any guidance,
> Ray
>

Hi,

The specific error you mentioned, "value size mismatch: expected 8,
but got 24 (block 13182)", suggests it's an older version of
thin-provisioning-tools. Hopefully, a newer version will help resolve
this.

Would you mind providing the raw metadata image for me to look into? Thank you.


Hank

^ permalink raw reply

* Re: LVM-thin metadata corruption on Proxmox after RAID-5 issues – thin_repair/thin_dump fail
From: Zdenek Kabelac @ 2026-04-09 20:32 UTC (permalink / raw)
  To: Ray Davis, linux-lvm
In-Reply-To: <em2752ec1c-6e02-42bf-80e8-bb92f63e8f73@carpe.net>

Dne 09. 04. 26 v 17:55 Ray Davis napsal(a):
> Hello,
> 
> I’m looking for advice on an LVM-thin metadata corruption case on a Proxmox 
> host. I have stopped further repair attempts and preserved the current state 
> for analysis.
> 
> Environment
> 
> * Proxmox VE host
> * Thin pool: `VMDATA0/VMDATA0`
> * Backing storage had RAID-5 issues involving two disks
> * After reseating the disks, the RAID came back online, but the thin pool 
> would no longer activate
>> 
> * Created temporary metadata LVs
> * Used `lvconvert --swapmetadata` to extract the pool metadata
> * Activated the extracted metadata LV
> * Preserved a raw image of the extracted metadata
> * Ran `thin_check`
> * Ran `thin_repair`
> * Ran `thin_dump —repair`
> 
> I can provide access to the full case directory or a tarball over HTTP if 
> someone is willing to look at it. I would prefer to share the link privately 
> with anyone interested.
> 
> My main questions are:
> 
> 1. Is there any remaining offline recovery path worth trying with the standard 
> dm-thin/LVM tools?

Hi

Very extensive report - but it looks like the key element here would be -
what are the versions in use.


Kernel ?

lvm2 ?

thin_repair -V  ?



> 2. Does this failure pattern usually indicate that the mapping metadata is 
> beyond recovery by `thin_repair`/`thin_dump`?
> 3. Would it be useful to inspect the raw metadata image further, and if so, 
> what specific tools or commands would you recommend next?
> 
> I can provide any command output that would be useful.

If you are using latest tools - then authors of this tool will need to have
access to the damaged metadata  whether there is something that can be recovered.

It'd also greatly help knowing what kind of 'damange' may have caused this.

Regards

Zdenek


^ permalink raw reply

* LVM-thin metadata corruption on Proxmox after RAID-5 issues – thin_repair/thin_dump fail
From: Ray Davis @ 2026-04-09 15:55 UTC (permalink / raw)
  To: linux-lvm

Hello,

I’m looking for advice on an LVM-thin metadata corruption case on a 
Proxmox host. I have stopped further repair attempts and preserved the 
current state for analysis.

Environment

* Proxmox VE host
* Thin pool: `VMDATA0/VMDATA0`
* Backing storage had RAID-5 issues involving two disks
* After reseating the disks, the RAID came back online, but the thin 
pool would no longer activate

Original Proxmox/LVM error
`activating LV 'VMDATA0/VMDATA0' failed: Check of pool VMDATA0/VMDATA0 
failed (status:1). Manual repair required!`

What I tried

1. `vgcfgbackup VMDATA0`
2. `lvconvert --repair VMDATA0/VMDATA0`

This failed with:
`value size mismatch: expected 8, but got 24 (block 13182)`

At that point I added temporary VG space on a USB disk so I could try 
manual metadata recovery.

Manual recovery steps attempted

* Created temporary metadata LVs
* Used `lvconvert --swapmetadata` to extract the pool metadata
* Activated the extracted metadata LV
* Preserved a raw image of the extracted metadata
* Ran `thin_check`
* Ran `thin_repair`
* Ran `thin_dump —repair`

Current results

`thin_check /dev/VMDATA0/meta_extract` reports:

* `missing devices: [0, -]`
* `bad checksum in btree node (block 79511)`
* `missing all mappings for devices: [0, -]`
* `bad checksum in btree node (block 79506)`

`thin_repair -i /dev/VMDATA0/meta_extract -o /dev/VMDATA0/repair_meta` 
fails with:
`value size mismatch: expected 8, but got 24 (block 13182)`

`thin_dump --repair -o /root/VMDATA0_repaired.xml 
/dev/VMDATA0/meta_extract` fails with the same:
`value size mismatch: expected 8, but got 24 (block 13182)`

Current LV state at the time of extraction looked like this:

* thin pool `VMDATA0`
* extracted metadata LV `meta_extract`
* fresh target metadata LV `repair_meta`

What I have preserved

* Raw metadata image from the extracted metadata LV:
   `VMDATA0_meta_extract.raw`
* A second raw copy from the extracted LV device
* `vgcfgbackup` output and LVM archive files
* Diagnostics bundle with:

   * `pvs`, `vgs`, `lvs`
   * `thin_check` output
   * `dmesg`
   * kernel journal
   * `lvm version`
   * checksums

I can provide access to the full case directory or a tarball over HTTP 
if someone is willing to look at it. I would prefer to share the link 
privately with anyone interested.

My main questions are:

1. Is there any remaining offline recovery path worth trying with the 
standard dm-thin/LVM tools?
2. Does this failure pattern usually indicate that the mapping metadata 
is beyond recovery by `thin_repair`/`thin_dump`?
3. Would it be useful to inspect the raw metadata image further, and if 
so, what specific tools or commands would you recommend next?

I can provide any command output that would be useful.

Thanks very much for any guidance,
Ray

^ permalink raw reply

* Re: is it possible to use LVM AFTER system installation?
From: Christian Recktenwald @ 2026-02-21 20:20 UTC (permalink / raw)
  To: Frédéric Baldit; +Cc: linux-lvm
In-Reply-To: <20260221185816.34fab009@ThinkPadT15g.lan>

On Sat, Feb 21, 2026 at 06:58:16PM +0100, Frédéric Baldit wrote:
> 
> Hi everybody,
> 
> I'm a debian 12 user, running it on a system which was initially 
> installed without LVM (on a thinkpad laptop)n with classical netinst. My
> system is totally installed on a unique nvme disk, with distinct
> partitions, among which a /home partition for all my personal data.
> 
> I recently installed a second nvme disk (512G capacity). I would like
> to extend my /home ext4 partition (which is now exclusively on
> /dev/nvme0n1p6, 436G) so that I can keep my old data but have
> the possibility to split /home on the initial disk and the second
> recently added disk.
> 
> This seems to be possible with LVM but only when decided at the
> beginning of the installation. Right or wrong?
> 
> Or would it possible to:

create 3 partitions:
1MB for boot loader (grub)
500B for UEFI
rest: LVM PV

Something like:
  VG=vg01
  LVN=lvhome
  DRV=/dev/nvme....
  PART2=/dev/nvme....
  PART3=/dev/nvme....

  SZ=...G # slightly larger (like 1GB) than your current /home

  parted    $DRV mklabel gpt
  # those values are crucial:
  parted -s $DRV unit s mkpart primary      34     2047 set 1 bios_grub on
  parted -s $DRV unit s mkpart primary    2048  1050623 set 2 esp on
  parted -s $DRV unit s mkpart primary 1050624 100%FREE set 3 lvm on
  partprobe $DRV

  mkfs.fat -F32 $PART2
  pvcreate $PART3
  vgcreate $VG $PART3
  lvcreate $VG -n $LVN -L $SZ

  umount /home
  dd if=$OLDHOME of=/dev/$VG/$LVN bs=10M 
  resize2fs /dev/$VG/$LVN
  fsck -f /dev/$VG/$LVN
  $EDITOR /etc/fstab
  mount /home

  live happily ever after...

  cherry on top:
  - mount partition 2 somewhere
    copy the contents of your existing efi boot partition there
    umount partition 2
  - run grub to install on the new drive
  Y? If you ever come across replacing the smaller disk 
  the preparations to get the new disk bootable are already done
  
-- 
Christian Recktenwald      : voice +49 711 601 2091  : Böblinger Strasse 189
chris@citecs.de            : mobil +49 172 711 8104  : D-70199 Stuttgart

^ permalink raw reply

* Re: is it possible to use LVM AFTER system installation?
From: Roger Heflin @ 2026-02-21 18:35 UTC (permalink / raw)
  To: Frédéric Baldit; +Cc: linux-lvm
In-Reply-To: <20260221185816.34fab009@ThinkPadT15g.lan>

That seems pretty close to right.    Step 1  I would label the
partition as LVM, not that it really matters since the labels seem to
not be actively used in Linux) I did not look at every last step in
detail, but that is pretty close to what should work if executed
correctly.

On step 7 you probably need to execute it with the primary user logged
out and from a non-gui login(or a root login assuming root's home is
/root and not in /home)  to get a clean copy.

You may need to make sure your initramfs has lvm built into it, it may
not really matter for home given it can mount much later than say /
(if using lvm for /).

On Sat, Feb 21, 2026 at 11:58 AM Frédéric Baldit
<frederic.baldit@free.fr> wrote:
>
>
> Hi everybody,
>
> I'm a debian 12 user, running it on a system which was initially
> installed without LVM (on a thinkpad laptop)n with classical netinst. My
> system is totally installed on a unique nvme disk, with distinct
> partitions, among which a /home partition for all my personal data.
>
> I recently installed a second nvme disk (512G capacity). I would like
> to extend my /home ext4 partition (which is now exclusively on
> /dev/nvme0n1p6, 436G) so that I can keep my old data but have
> the possibility to split /home on the initial disk and the second
> recently added disk.
>
> This seems to be possible with LVM but only when decided at the
> beginning of the installation. Right or wrong?
>
> Or would it possible to:
>
> 0) install lvm2 on my system
> 1) create a new ext4 partition (512G size) on the new nvme disk (using
> all available space, 512G, on it)
> 2) create a new PV with this partition
> 3) create a new VG containing this unique PV
> 4) create a new LV on this VG
> 5) format this LV as an ext4 filesystem with mkfs.ext4
> 6) mount this LV on a new system directory, say /home1
> 7) migrate all the data on my old /home to the new /home1
> 8) modify the VG created before in order to add to it the
> /dev/nvme0n1p6 PV
> 9) resize the LV so that it can use the new PV added
> 10) rename /home1 to /home
>
> Is this globally correct??? Step 7 if crucial, as I really don't want
> any loss or corruption of personal data during the transfer.
>
> Thank's in advance for any help to my two questions,
>
> F. Baldit.
>
>
>
> --
>   Frédéric Baldit
>

^ permalink raw reply

* is it possible to use LVM AFTER system installation?
From: Frédéric Baldit @ 2026-02-21 17:58 UTC (permalink / raw)
  To: linux-lvm


Hi everybody,

I'm a debian 12 user, running it on a system which was initially 
installed without LVM (on a thinkpad laptop)n with classical netinst. My
system is totally installed on a unique nvme disk, with distinct
partitions, among which a /home partition for all my personal data.

I recently installed a second nvme disk (512G capacity). I would like
to extend my /home ext4 partition (which is now exclusively on
/dev/nvme0n1p6, 436G) so that I can keep my old data but have
the possibility to split /home on the initial disk and the second
recently added disk.

This seems to be possible with LVM but only when decided at the
beginning of the installation. Right or wrong?

Or would it possible to:

0) install lvm2 on my system
1) create a new ext4 partition (512G size) on the new nvme disk (using
all available space, 512G, on it)
2) create a new PV with this partition
3) create a new VG containing this unique PV
4) create a new LV on this VG
5) format this LV as an ext4 filesystem with mkfs.ext4
6) mount this LV on a new system directory, say /home1
7) migrate all the data on my old /home to the new /home1
8) modify the VG created before in order to add to it the
/dev/nvme0n1p6 PV
9) resize the LV so that it can use the new PV added
10) rename /home1 to /home

Is this globally correct??? Step 7 if crucial, as I really don't want
any loss or corruption of personal data during the transfer.

Thank's in advance for any help to my two questions, 

F. Baldit. 



--
  Frédéric Baldit

^ permalink raw reply

* Re: [RFC PATCH 2/2] swsusp: make it possible to hibernate to device mapper devices
From: Askar Safin @ 2026-01-14  7:27 UTC (permalink / raw)
  To: mpatocka
  Cc: Dell.Client.Kernel, agk, brauner, dm-devel, ebiggers, kix,
	linux-block, linux-btrfs, linux-crypto, linux-lvm, linux-mm,
	linux-pm, linux-raid, lvm-devel, milan, msnitzer, mzxreary,
	nphamcs, pavel, rafael, ryncsn, torvalds
In-Reply-To: <b32d0701-4399-9c5d-ecc8-071162df97a7@redhat.com>

Mikulas Patocka <mpatocka@redhat.com>:
> Askar Safin requires swap and hibernation on the dm-integrity device mapper
> target because he needs to protect his data.

Now I see that your approach is valid. (But some small changes are needed.)

[[ TL;DR: you approach is good. I kindly ask you to continue with this patch.
Needed changes are in section "Needed changes". ]]

Let me explain why I initially rejected your patch and why now I think it is good.


= Why I rejected =

In your patch "notify_swap_device" call located before "pm_restrict_gfp_mask".

But "pm_restrict_gfp_mask" is call, which forbids further swapping. I. e.
we still can swap till "pm_restrict_gfp_mask" call!

Thus "notify_swap_device" should be moved after "pm_restrict_gfp_mask" call.

But then I thought about more complex storage hierarchies. For example,
swap on top of some dm device on top of loop device on top of some filesystem
on top of some another dm device, etc.

If we have such hierarchy, then hibernating dm devices should be intertwined
with freezing of filesystems, which happens in "filesystems_freeze" call.

But "filesystems_freeze" call located before "pm_restrict_gfp_mask" call, so
here we got contradiction.

In other words, we should satisfy this 3 things at the same time:

- Hibernating of dm devices should happen inside "filesystems_freeze" call
intermixed with freezing of filesystems
- Hibernating of dm devices should happen after "pm_restrict_gfp_mask" call
- "pm_restrict_gfp_mask" is located after "filesystems_freeze" call in current
kernel

These 3 points obviously contradict to each other.

So in this point I gave up.

The only remaining solution (as I thought at that time) was to move
"filesystems_freeze" after "pm_restrict_gfp_mask" call (or to move
"pm_restrict_gfp_mask" before "filesystems_freeze").

But:
- Freezing of filesystem might require memory. It is bad idea to call
"filesystems_freeze" after we forbid to swap
- This would be pretty big change to the kernel. I'm not sure that my
small use case justifies such change

So in this point I totally gave up.


= Why now I think your patch is good =

But then I found this your email:
https://lore.kernel.org/all/3f3d871a-6a86-354f-f83d-a871793a4a47@redhat.com/ .

And now I see that complex hierarchies, such as described above, are not
supported anyway!

This fully ruins my argument above.

And this means that your patch in fact works!


= Needed changes =

Please, move "notify_swap_device" after "pm_restrict_gfp_mask".

Also: you introduced new operation to target_type: hibernate.
I'm not sure we need this operation, we already have presuspend
and postsuspend. In my personal hacky patch I simply added
"dm_bufio_client_reset" to the end of "dm_integrity_postsuspend",
and it worked. But I'm not sure about this point, i. e. if
you think that we need "hibernate", then go with it.


-- 
Askar Safin

^ permalink raw reply

* Re: [RFC PATCH 2/2] swsusp: make it possible to hibernate to device mapper devices
From: Askar Safin @ 2025-12-23  6:33 UTC (permalink / raw)
  To: gmazyland
  Cc: Dell.Client.Kernel, dm-devel, linux-block, linux-btrfs,
	linux-crypto, linux-lvm, linux-mm, linux-pm, linux-raid,
	lvm-devel, mpatocka, pavel, rafael, safinaskar
In-Reply-To: <86300955-72e4-42d5-892d-f49bdf14441e@gmail.com>

Milan Broz <gmazyland@gmail.com>:
> Anyway, my understanding is that all device-mapper targets use mempools,
> which should ensure that they can process even under memory pressure.

Okay, I just read some more code and docs.

dm-integrity fortunately uses bufio for checksums only.

And bufio allocates memory without __GFP_IO (thus allocation should not
lead to recursion). And bufio claims that "dm-bufio is resistant to allocation failures":
https://elixir.bootlin.com/linux/v6.19-rc2/source/drivers/md/dm-bufio.c#L1603 .

This still seems to be fragile.

So I will change mode to 'D' and hope for the best. :)

-- 
Askar Safin

^ permalink raw reply

* Re: [RFC PATCH 2/2] swsusp: make it possible to hibernate to device mapper devices
From: Askar Safin @ 2025-12-23  5:29 UTC (permalink / raw)
  To: gmazyland
  Cc: Dell.Client.Kernel, dm-devel, linux-block, linux-btrfs,
	linux-crypto, linux-lvm, linux-mm, linux-pm, linux-raid,
	lvm-devel, mpatocka, pavel, rafael
In-Reply-To: <86300955-72e4-42d5-892d-f49bdf14441e@gmail.com>

Milan Broz <gmazyland@gmail.com>:
> Anyway, my understanding is that all device-mapper targets use mempools,
> which should ensure that they can process even under memory pressure.

I used journal mode so far, but, as well as I understand, direct mode is
okay for my use case.

Okay, I spent some time carefully reading dm-integrity source code.

I have read v6.12.48, because this is kernel I use.

And I conclude that dm-integrity code never allocate (not even from mempool)...
...in main code paths (as opposed to initialization code paths)...
...in direct ('D') mode...
...if I/O doesn't fail and checksums match.

(As I said in previous letter, mempools are bad, too, as well as I understand.)

I found exactly one place, where we seem to allocate in main code path:
https://elixir.bootlin.com/linux/v6.12.48/source/drivers/md/dm-integrity.c#L1789
(i. e. these two kmalloc's).

But I think this okay, because:
- we pass GFP_NOIO, so, as well as I understand, this should not lead to
recursion
- we pass __GFP_NORETRY, so, as well as I understand, we will not block in
this kmalloc for too much time
- we gracefully handle possible failure

Other strange place I found is this:
https://elixir.bootlin.com/linux/v6.12.48/source/drivers/md/dm-integrity.c#L1704 .

But I think this is okay, because:
- integrity_recheck is only ever called from here:
https://elixir.bootlin.com/linux/v6.12.48/source/drivers/md/dm-integrity.c#L1857
- that integrity_recheck call is only ever happens if dm_integrity_rw_tag failed
- as well as I understand, dm_integrity_rw_tag can only fail if we got actual
I/O error or checksum mismatch

So, this mempool_alloc call is okay for my use case.

So: in 'D' mode everything should be okay for my use case.

Another note: I used very stupid way to search functions, which allocate:
if function has "alloc" in its name, then I consider it allocating. :)

And final note: there is an elephant in a room: bufio.

As well as I understand, when pages are swapped in my use case, they first
will get to dm-integrity bufio cache, and only after that, they will
actually hit disk.

This, of course, defeats whole purpose of swap.

And possibly can lead to deadlocks.

Is there a way to disable bufio?

Or maybe bufio is used for checksums and metadata only?

-- 
Askar Safin

^ permalink raw reply

* Re: [RFC PATCH 2/2] swsusp: make it possible to hibernate to device mapper devices
From: Askar Safin @ 2025-12-23  1:41 UTC (permalink / raw)
  To: gmazyland
  Cc: Dell.Client.Kernel, dm-devel, linux-block, linux-btrfs,
	linux-crypto, linux-lvm, linux-mm, linux-pm, linux-raid,
	lvm-devel, mpatocka, pavel, rafael
In-Reply-To: <86300955-72e4-42d5-892d-f49bdf14441e@gmail.com>

Milan Broz <gmazyland@gmail.com>:
> Anyway, my understanding is that all device-mapper targets use mempools,
> which should ensure that they can process even under memory pressure.

Also, I don't understand how mempools help here.

As well as I understand, allocation from mempool is still real allocation
if mempool's own reserve is over.

-- 
Askar Safin

^ permalink raw reply

* Re: [RFC PATCH 2/2] swsusp: make it possible to hibernate to device mapper devices
From: Askar Safin @ 2025-12-22 22:24 UTC (permalink / raw)
  To: gmazyland
  Cc: Dell.Client.Kernel, dm-devel, linux-block, linux-btrfs,
	linux-crypto, linux-lvm, linux-mm, linux-pm, linux-raid,
	lvm-devel, mpatocka, pavel, rafael
In-Reply-To: <86300955-72e4-42d5-892d-f49bdf14441e@gmail.com>

Milan Broz <gmazyland@gmail.com>:
> Anyway, my understanding is that all device-mapper targets use mempools,
> which should ensure that they can process even under memory pressure.

Let me give you more details.

Here is output of "free -h":

               total        used        free      shared  buff/cache   available
Mem:            62Gi        47Gi       924Mi       2.4Gi        17Gi        14Gi
Swap:          378Gi        95Gi       282Gi

Swap is located on dm-integrity on real partition.

As you can see, my data does not fit into physical memory, so swap is required
here.

But swap is big, so in theory allocations should always work.

I have a lot of Chromium windows opened (nearly 200).

My laptop is Dell Precision 7780. It is high speced expensive laptop.
I have 64 GiB ECC physical memory, btrfs raid on two 3.5 TiB partitions.
Everything is located on two 4 TiB NVMe SSD physical disks.

Sometimes whole system freezes for several minutes when I open new memory-hungry
Chromium tabs. In such cases I see in logs:

https://zerobin.net/?383b5c32b958aca8#yXmgYidkC8pUFixwQKB+v+O3bkbis4RHduz3gji4DxI=

Notice that all backtraces contain shmem_swapin_folio, so swap is involved here.

Hibernation works thanks to my patch
https://zerobin.net/?ad6142bd67df015a#68Az6yBUxHA3AXB7jY1+clSRnR745olFHAByxwPGM08= .

My kernel is 6.12.48 from Debian with my local patches.

Sometimes I see messages "page allocation failure" in my logs. This is very
strange: I already explained above, that there is a plenty of space in swap.

Here is output of "journalctl | grep -B 10 -A 100 'page allocation failure'":

https://zerobin.net/?4170949dd9a8b25c#p5Z73TfGgpem4O4UsiWllrMCLCoHzDEw+KwJ7n8LWPA=

Maybe my swap is fragmented?

I that logs I notice that:
- Allocation failures often happen immidiately after wake up from hibernation
or suspend
- We try to alloc page of order 4 (what this means? 2^4 pages?)
- GFP mask is "GFP_KERNEL|__GFP_COMP" or "GFP_NOIO|__GFP_COMP". Failure to
allocate in "GFP_NOIO|__GFP_COMP" case is somewhat understandable. But
what about "GFP_KERNEL|__GFP_COMP"? As well as I understand, we are allowed
to do I/O, so we can drop everything to swap. And swap is big. So why we
fail?
- In all backtraces "dell_smbios_call" is involved

Hibernation always works, but takes a lot of time. Usually several minutes.
When hibernating, I see in logs this:

Dec 20 10:02:18 comp kernel: PM: hibernation: Allocated 26015132 kbytes in 193.21 seconds (134.64 MB/s)

I. e. 3 minutes to allocate space in memory for hibernation image.

And sometimes even this:

Dec 11 08:34:26 comp kernel: PM: hibernation: Allocated 25942484 kbytes in 348.90 seconds (74.35 MB/s)

Also sometimes I notice that in browser background for one site is replaced
with black rectangle. So, I assume that browser failed to allocate something,
too, but I unable to find this in logs.

> Anyway, my understanding is that all device-mapper targets use mempools,
> which should ensure that they can process even under memory pressure.

This seems to be not true. I see a lot of words "alloc" in dm-integrity
code:

$ grep alloc drivers/md/dm-integrity.c

And it seems that allocation happens not only in initialization,
but also in normal operations (but I didn't looked at code carefully).

Also, I see a lot of mentions of bufio in dm-integrity code.
As well as I understand, this is some cache layer. But, as well as I understand,
in my case there should no be any caches, everything should be written
directly to partition.

So, how to debug this next?

Maybe there are some ioctls, etc, to avoid this problems or to enable
more verbose logging?

I even okay with inserting some printfs to kernel code, just send me patch.

-- 
Askar Safin

^ permalink raw reply

* Re: [RFC PATCH 2/2] swsusp: make it possible to hibernate to device mapper devices
From: Milan Broz @ 2025-12-22 15:03 UTC (permalink / raw)
  To: Askar Safin, mpatocka
  Cc: Dell.Client.Kernel, dm-devel, linux-block, linux-btrfs,
	linux-crypto, linux-lvm, linux-mm, linux-pm, linux-raid,
	lvm-devel, pavel, rafael
In-Reply-To: <20251217231837.157443-1-safinaskar@gmail.com>

On 12/18/25 12:18 AM, Askar Safin wrote:
> Mikulas Patocka <mpatocka@redhat.com>:
>> Askar Safin requires swap and hibernation on the dm-integrity device mapper
>> target because he needs to protect his data.
> 
> Hi, Mikulas, Milan and others.
> 
> I'm running swap on dm-integrity for 40 days.
> 
> It runs mostly without problems.
> 
> But yesterday my screen freezed for 4 minutes. And then continued to work
> normally.
> 
> So, may I ask again a question: is swap on dm-integrity supposed to work
> at all? (I. e. swap partition on top of dm-integrity partition on top of
> actual disk partition.) (I'm talking about swap here, not about hibernation.)

Hi,

I am not sure if Mikulas is available; maybe it's better to try again
in January...

Anyway, my understanding is that all device-mapper targets use mempools,
which should ensure that they can process even under memory pressure.

AFAIK, swap over a device-mapper target (any target!) with a real block device
should be ok. The problematic part is stacking over a filesystem (through a loop)
as Mikulas mentioned.

If I interpret Mikulas' answer correctly, it is the filesystem that could
allocate memory here, and it deadlocks because of it (as it is swap itself).
So I believe it can happen with other DM targets too.
(If I am mistaken, please correct me.)

I wish it could work, but I do not understand kernel details anymore here.
It seems we are still in "a little walled gardens" communication issues
among various kernel subsystems, as one of the former maintainers said :-)

But you asked about a real block device, so it should work.
I guess it is just another bug you see...

Milan
> 
> Mikulas Patocka said here https://lore.kernel.org/all/3f3d871a-6a86-354f-f83d-a871793a4a47@redhat.com/ :
> 
>> Encrypted swap file is not supposed to work. It uses the loop device that
>> routes the requests to a filesystem and the filesystem needs to allocate
>> memory to process requests.
> 
>> So, this is what happened to you - the machine runs out of memory, it
>> needs to swap out some pages, dm-crypt encrypts the pages and generates
>> write bios, the write bios are directed to the loop device, the loop
>> device directs them to the filesystem, the filesystem attempts to allocate
>> more memory => deadlock.
> 
> Does the same apply to dm-integrity?
> 
> I. e. is it possible that write to dm-integrity will lead to allocation?
> 


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox