* Help diagnosing lvm ioctl log spam
@ 2024-10-09 9:28 Fabricio Winter
2024-10-09 11:45 ` Zdenek Kabelac
2024-10-09 14:52 ` David Teigland
0 siblings, 2 replies; 5+ messages in thread
From: Fabricio Winter @ 2024-10-09 9:28 UTC (permalink / raw)
To: linux-lvm
Hello people, we have been experiencing an issue with lvm2-thin on
_some_ of our production servers where out of nowhere
lvm2/device-mapper starts spamming error logs and I can't really seem
to trace down the root cause.
This is what the logs look like;
Oct 9 06:25:02 U5bW8JT7 lvm[8020]: device-mapper: waitevent ioctl on
LVM-CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool
failed: Inappropriate ioctl for device
Oct 9 06:25:02 U5bW8JT7 lvm[8020]: waitevent: dm_task_run failed:
Inappropriate ioctl for device
It writes this to rsyslog at speeds so fast not even tail can keep up
over ssh. I'm really lost as to the reason or how to trace this back
to a process. We use lvm-thin to host virtual machines via libvirt, so
several thin disks under a thin pool one for each virtual machine. The
lvm2 pool is created on top of a vg that is on top of a md-raid1
device.
/dev/md127:
Version : 1.2
Creation Time : Wed May 5 16:56:09 2021
Raid Level : raid1
Array Size : 1953382464 (1862.89 GiB 2000.26 GB)
Used Dev Size : 1953382464 (1862.89 GiB 2000.26 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Oct 9 07:31:44 2024
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : U5bW8JT7:1 (local to host U5bW8JT7)
UUID : 5072120c:b114a0aa:51438cf8:8ebe58b8
Events : 70525
Number Major Minor RaidDevice State
0 259 0 0 active sync /dev/nvme0n1
1 259 1 1 active sync /dev/nvme1n1
I do not know what this
`CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool`
disk is on device mapper, I'm unable to find it, however the tpool
suffix suggests it's the thin pool so here is what lvdisplay says
about it
--- Logical volume ---
LV Name lvol1
VG Name lightning-nvme
LV UUID mOowip-TAFu-M7hq-zHz6-pRVv-UaNO-9FGzeq
LV Write Access read/write
LV Creation host, time U5bW8JT7, 2021-05-05 16:57:13 +0000
LV Pool metadata lvol1_tmeta
LV Pool data lvol1_tdata
LV Status available
# open 72
LV Size <1.80 TiB
Allocated pool data 81.30%
Allocated metadata 49.28%
Current LE 471040
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:2
The start of the device name suggests it's part of the vg so here's
also what vgdisplay says:
--- Volume group ---
VG Name lightning-nvme
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 52521
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 72
Open LV 70
Max PV 0
Cur PV 1
Act PV 1
VG Size <1.82 TiB
PE Size 4.00 MiB
Total PE 476899
Alloc PE / Size 471098 / <1.80 TiB
Free PE / Size 5801 / 22.66 GiB
VG UUID CP5Gw8-QrWL-qwhB-cJL8-7R1m-c9Q9-KTBtQQ
Any tips on how to trace this back to a process or a cause or a bug or
anything are highly appreciated. I literally only have these two log
lines to go by, everything else is working as expected until the
server crashes from running out of disk space from these logs.
Restarting the node fixes the issue/log spam until it starts happening
again randomly after several days.
Runnig Ubuntu 18.04.5 LTS - lvm2 2.02.176-4.1ubuntu3.18.04.3
-
--Fabricio Winter
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Help diagnosing lvm ioctl log spam
2024-10-09 9:28 Help diagnosing lvm ioctl log spam Fabricio Winter
@ 2024-10-09 11:45 ` Zdenek Kabelac
2024-10-09 14:52 ` David Teigland
1 sibling, 0 replies; 5+ messages in thread
From: Zdenek Kabelac @ 2024-10-09 11:45 UTC (permalink / raw)
To: Fabricio Winter, linux-lvm
Dne 09. 10. 24 v 11:28 Fabricio Winter napsal(a):
> Hello people, we have been experiencing an issue with lvm2-thin on
> _some_ of our production servers where out of nowhere
> lvm2/device-mapper starts spamming error logs and I can't really seem
> to trace down the root cause.
>
> This is what the logs look like;
> Oct 9 06:25:02 U5bW8JT7 lvm[8020]: device-mapper: waitevent ioctl on
> LVM-CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool
> failed: Inappropriate ioctl for device
> Oct 9 06:25:02 U5bW8JT7 lvm[8020]: waitevent: dm_task_run failed:
> Inappropriate ioctl for device
Hi
Have you considered to try using kernel & lvm2 from the more modern age where
the 'nvme' storage actually become a regularly available commodity hw ?
I'd have guess your issue likely already has been solved.
Otherwise if you do want to do a hunt for a particular IO - use 'blktrace'
tool - likely on tpool_tdata (or tpool_tmeta) volume.
Regards
Zdenek
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Help diagnosing lvm ioctl log spam
2024-10-09 9:28 Help diagnosing lvm ioctl log spam Fabricio Winter
2024-10-09 11:45 ` Zdenek Kabelac
@ 2024-10-09 14:52 ` David Teigland
2024-10-10 0:21 ` Erwin van Londen
1 sibling, 1 reply; 5+ messages in thread
From: David Teigland @ 2024-10-09 14:52 UTC (permalink / raw)
To: Fabricio Winter; +Cc: linux-lvm
On Wed, Oct 09, 2024 at 06:28:26AM -0300, Fabricio Winter wrote:
> Hello people, we have been experiencing an issue with lvm2-thin on
> _some_ of our production servers where out of nowhere
> lvm2/device-mapper starts spamming error logs and I can't really seem
> to trace down the root cause.
>
> This is what the logs look like;
> Oct 9 06:25:02 U5bW8JT7 lvm[8020]: device-mapper: waitevent ioctl on
> LVM-CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool
> failed: Inappropriate ioctl for device
> Oct 9 06:25:02 U5bW8JT7 lvm[8020]: waitevent: dm_task_run failed:
> Inappropriate ioctl for device
It appears related to dmeventd monitoring the thin pools, and the kernel
returning ENOTTY when dmeventd does the DM_DEV_WAIT ioctl. Maybe there's
a fast retry loop in dmeventd on that error case rather than quitting.
I wonder if there's a way you could kill dmeventd when this happens.
Dave
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Help diagnosing lvm ioctl log spam
2024-10-09 14:52 ` David Teigland
@ 2024-10-10 0:21 ` Erwin van Londen
2024-10-10 2:44 ` Fabricio Winter
0 siblings, 1 reply; 5+ messages in thread
From: Erwin van Londen @ 2024-10-10 0:21 UTC (permalink / raw)
To: David Teigland, Fabricio Winter; +Cc: linux-lvm
[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]
Could it not be that, given the fact this is a volume carved out of a
thin pool, the underlying free space threshold is near or at capacity
and the the device mapper is more busy doing garbage collection than
actual application IO's. I've seen many nightmares with thin provisioned
volumes if this is not managed properly.
Have you used the "fstrim" command or is the filesystem mounted with the
"-o discard" option?
Check some setting with, for example, "lvs -o
lv_full_name,lv_health_status,lv_when_full". If it is a space issue make
sure that you have enough free space in your volume groups and configure
the "thin_pool_auto_extend_threshold" and
"thin_pool_auto_extend_percentage". Also turn on "monitoring" if that is
not enabled by default.
Another thing that could be an issue is that the filesystem has the "-o
discard" mount option set, device mapper wants to propagate this to the
underlying hardware (or in your case a hypervisor which may also send
this through to the underlying storage array), and that is not supported
on that hardware. The "Inappropriate ioctl for device" message hints in
that direction. Has there been movements of volumes to other provisioned
disks or changes on the hypervisor?
Cheers
Erwin
On 10/10/24 00:52, David Teigland wrote:
> On Wed, Oct 09, 2024 at 06:28:26AM -0300, Fabricio Winter wrote:
>> Hello people, we have been experiencing an issue with lvm2-thin on
>> _some_ of our production servers where out of nowhere
>> lvm2/device-mapper starts spamming error logs and I can't really seem
>> to trace down the root cause.
>>
>> This is what the logs look like;
>> Oct 9 06:25:02 U5bW8JT7 lvm[8020]: device-mapper: waitevent ioctl on
>> LVM-CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool
>> failed: Inappropriate ioctl for device
>> Oct 9 06:25:02 U5bW8JT7 lvm[8020]: waitevent: dm_task_run failed:
>> Inappropriate ioctl for device
> It appears related to dmeventd monitoring the thin pools, and the kernel
> returning ENOTTY when dmeventd does the DM_DEV_WAIT ioctl. Maybe there's
> a fast retry loop in dmeventd on that error case rather than quitting.
> I wonder if there's a way you could kill dmeventd when this happens.
>
> Dave
>
>
[-- Attachment #2: OpenPGP_signature.asc --]
[-- Type: application/pgp-signature, Size: 677 bytes --]
[-- Attachment #3: erwin.vcf --]
[-- Type: text/vcard, Size: 172 bytes --]
BEGIN:VCARD
VERSION:4.0
N:van Londen;Erwin;;;
FN:Erwin van Londen
EMAIL;PREF=1;TYPE=work:erwin@erwinvanlonden.net
URL;TYPE=work:https://erwinvanlonden.net
END:VCARD
[-- Attachment #4: OpenPGP_0x985B90929D90E282.asc --]
[-- Type: application/pgp-keys, Size: 2511 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Help diagnosing lvm ioctl log spam
2024-10-10 0:21 ` Erwin van Londen
@ 2024-10-10 2:44 ` Fabricio Winter
0 siblings, 0 replies; 5+ messages in thread
From: Fabricio Winter @ 2024-10-10 2:44 UTC (permalink / raw)
To: Erwin van Londen; +Cc: David Teigland, linux-lvm
Interesting suggestions here, thanks everyone!
I inspected the libvirt config and indeed the virtual machines are not
configured properly to propagate trim (unless the libvirt default is
unmap, but I'm pretty sure it's ignore), this however is also the case
on the newer servers which have not seen this issue (yet?), I wonder
if this could be the ultimate underlying cause (eg: after a few years
of not running trim, slowed down by the eventual rotation of the lvm
disks as vms are deleted and created). I'll be addressing this asap of
course, however the underlying devices do support trim (they are nvmes
after all) so I believe if anything ended up calling trim it would not
cause an ioctl error as the operation would be supported by the
device.
As for dmeventd, that seems to have been the cause! I cleaned up
rsyslog, did a tail on the log and gave dmeventd a restart (dmeventd
-R) and the errors immediately stopped. Is dm-event merely a
monitoring/alert daemon or does it have any impact on functionality
(mostly wondering if restarting this service like this is a "safe"
operation), and lastly, what could cause it to get into this state? I
can understand a fail condition with a fast retry loop but what could
cause a disk to get into this failing condition in the first place -
what kind of operation was it doing that suddenly stopped working,
since no IO issues were happening on the pool (at least as far as I
can tell - disk reads/writes are working just fine). Is this condition
something that would warrant being reported as a bug, or at least the
fast-fail-loop?
Once again, thanks everyone!
-Fabricio Winter
On Wed, Oct 9, 2024 at 9:21 PM Erwin van Londen
<erwin@erwinvanlonden.net> wrote:
>
> Could it not be that, given the fact this is a volume carved out of a
> thin pool, the underlying free space threshold is near or at capacity
> and the the device mapper is more busy doing garbage collection than
> actual application IO's. I've seen many nightmares with thin provisioned
> volumes if this is not managed properly.
>
> Have you used the "fstrim" command or is the filesystem mounted with the
> "-o discard" option?
>
> Check some setting with, for example, "lvs -o
> lv_full_name,lv_health_status,lv_when_full". If it is a space issue make
> sure that you have enough free space in your volume groups and configure
> the "thin_pool_auto_extend_threshold" and
> "thin_pool_auto_extend_percentage". Also turn on "monitoring" if that is
> not enabled by default.
>
> Another thing that could be an issue is that the filesystem has the "-o
> discard" mount option set, device mapper wants to propagate this to the
> underlying hardware (or in your case a hypervisor which may also send
> this through to the underlying storage array), and that is not supported
> on that hardware. The "Inappropriate ioctl for device" message hints in
> that direction. Has there been movements of volumes to other provisioned
> disks or changes on the hypervisor?
>
> Cheers
> Erwin
>
> On 10/10/24 00:52, David Teigland wrote:
> > On Wed, Oct 09, 2024 at 06:28:26AM -0300, Fabricio Winter wrote:
> >> Hello people, we have been experiencing an issue with lvm2-thin on
> >> _some_ of our production servers where out of nowhere
> >> lvm2/device-mapper starts spamming error logs and I can't really seem
> >> to trace down the root cause.
> >>
> >> This is what the logs look like;
> >> Oct 9 06:25:02 U5bW8JT7 lvm[8020]: device-mapper: waitevent ioctl on
> >> LVM-CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool
> >> failed: Inappropriate ioctl for device
> >> Oct 9 06:25:02 U5bW8JT7 lvm[8020]: waitevent: dm_task_run failed:
> >> Inappropriate ioctl for device
> > It appears related to dmeventd monitoring the thin pools, and the kernel
> > returning ENOTTY when dmeventd does the DM_DEV_WAIT ioctl. Maybe there's
> > a fast retry loop in dmeventd on that error case rather than quitting.
> > I wonder if there's a way you could kill dmeventd when this happens.
> >
> > Dave
> >
> >
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-10-10 2:44 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-09 9:28 Help diagnosing lvm ioctl log spam Fabricio Winter
2024-10-09 11:45 ` Zdenek Kabelac
2024-10-09 14:52 ` David Teigland
2024-10-10 0:21 ` Erwin van Londen
2024-10-10 2:44 ` Fabricio Winter
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.