[linux-lvm] Why LVM metadata locations are not properly aligned

linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed

* [linux-lvm] Why LVM metadata locations are not properly aligned
@ 2016-04-21  4:08 Ming-Hung Tsai
  2016-04-21  9:54 ` Zdenek Kabelac
  2016-04-21 10:11 ` Alasdair G Kergon
  0 siblings, 2 replies; 6+ messages in thread
From: Ming-Hung Tsai @ 2016-04-21  4:08 UTC (permalink / raw)
  To: LVM general discussion and development

Hi,

I'm trying to find any opportunity to accelerate LVM metadata IO, in order to
take lvm-thin snapshots in a very short time. My scenario is connecting
lvm-thin volumes to a Windows host, then taking snapshots on those volumes for
Windows VSS (Volume Shadow Copy Service). Since that the Windows VSS can only
suspend IO for 10 seconds, LVM should finish taking snapshots within 10 seconds.

However, it's hard to achieve that if the PV is busy running IO. The major
overhead is LVM metadata IO. There are some issues:

1. The metadata locations (raw_locn::offset) are not properly aligned.
   Function _aligned_io() requires the IO to be logical-block aligned,
   but metadata locations returned by next_rlocn_offset() are 512-byte aligned.
   If a device's logical block size is greater than 512b, then LVM need to use
   bounce buffer to do the IO.
   How about setting raw_locn::offset to logical-block boundary?
   (or max(logical_block_size, physical_block_size) for 512-byte logical-/4KB
    physical-block drives?)

2. In most cases, the memory buffers passed to dev_read() and dev_write() are
   not aligned. (e.g, raw_read_mda_header(), _find_vg_rlocn())

3. Why LVM uses such complex process to update metadata?
   The are three operations to update metadata: write, pre-commit, then commit.
   Each operation requires one header read (raw_read_mda_header),
   one metadata checking (_find_vg_rlocn()), and metadata update via bounce
   buffer. So we need at least 9 reads and 3 writes for one PV.
   Could we simplify that?

4. Commit fb003cdf & a3686986 causes additional metadata read.
   Could we improve that? (We had checked the metadata in _find_vg_rlocn())

5. Feature request: could we take multiple snapshots in a batch, to reduce
   the number of metadata IO operations?
   e.g., lvcraete vg1/lv1 vg1/lv2 vg1/lv3 --snapshot
   (I know that it would be trouble for the --addtag options...)

   This post mentioned that lvresize will support resizing multiple volumes,
   but I think that taking multiple snapshots is also helpful.
   https://www.redhat.com/archives/linux-lvm/2016-February/msg00023.html
   > There is also some ongoing work on better lvresize support for more then 1
   > single LV. This will also implement better approach to resize of lvmetad
   > which is using different mechanism in kernel.

   Possible IOCTL sequence:
     dm-suspend origin0
     dm-message create_snap 3 0
     dm-message set_transaction_id 3 4
     dm-resume origin0
     dm-suspend origin1
     dm-message create_snap 4 1
     dm-message set_transaction_id 4 5
     dm-resume origin1
     dm-suspend origin2
     dm-message create_snap 5 2
     dm-message set_transaction_id 5 6
     dm-resume origin2
     ...

6. Is there any other way to accelerate LVM operation? I had enabled lvmetad,
   setting global_filter and md_component_detection=0 in lvm.conf.

Thanks,
Ming-Hung Tsai

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linux-lvm] Why LVM metadata locations are not properly aligned
  2016-04-21  4:08 [linux-lvm] Why LVM metadata locations are not properly aligned Ming-Hung Tsai
@ 2016-04-21  9:54 ` Zdenek Kabelac
  2016-04-21 13:22   ` Zdenek Kabelac
  2016-04-22  8:43   ` Ming-Hung Tsai
  2016-04-21 10:11 ` Alasdair G Kergon
  1 sibling, 2 replies; 6+ messages in thread
From: Zdenek Kabelac @ 2016-04-21  9:54 UTC (permalink / raw)
  To: linux-lvm

On 21.4.2016 06:08, Ming-Hung Tsai wrote:
> Hi,
>
> I'm trying to find any opportunity to accelerate LVM metadata IO, in order to
> take lvm-thin snapshots in a very short time. My scenario is connecting
> lvm-thin volumes to a Windows host, then taking snapshots on those volumes for
> Windows VSS (Volume Shadow Copy Service). Since that the Windows VSS can only
> suspend IO for 10 seconds, LVM should finish taking snapshots within 10 seconds.
>

Hmm do you observe taking a snapshot takes more then a second ?
IMHO the largest portion of time should be the 'disk' synchronization
when suspending  (full flush and fs sync)
Unless you have lvm2 metadata in range of MiB (and lvm2 was not designed for 
that) - you should be well bellow a second...

> However, it's hard to achieve that if the PV is busy running IO. The major

Changing disk scheduler to deadline ?
Lowering percentage of dirty-pages ?

> overhead is LVM metadata IO. There are some issues:

While your questions are valid points for discussion - you will save couple 
disk reads - but this will not save your time problem a lot if you have 
overloaded disk I/O system.
Note lvm2 is using direct I/O which is your trouble maker here I guess...

>
> 1. The metadata locations (raw_locn::offset) are not properly aligned.
>     Function _aligned_io() requires the IO to be logical-block aligned,
>     but metadata locations returned by next_rlocn_offset() are 512-byte aligned.
>     If a device's logical block size is greater than 512b, then LVM need to use
>     bounce buffer to do the IO.
>     How about setting raw_locn::offset to logical-block boundary?
>     (or max(logical_block_size, physical_block_size) for 512-byte logical-/4KB
>      physical-block drives?)

This looks like a bug - lvm2 should start to write metadata always on physical 
block aligned position.

> 2. In most cases, the memory buffers passed to dev_read() and dev_write() are
>     not aligned. (e.g, raw_read_mda_header(), _find_vg_rlocn())
>
> 3. Why LVM uses such complex process to update metadata?
>     The are three operations to update metadata: write, pre-commit, then commit.
>     Each operation requires one header read (raw_read_mda_header),
>     one metadata checking (_find_vg_rlocn()), and metadata update via bounce
>     buffer. So we need at least 9 reads and 3 writes for one PV.
>     Could we simplify that?

It's been already simplified once ;) and we have lost quite important property
of validation of written data during pre-commit - which is quite useful when
user is running on misconfigured multipath device...

Each state has its logic and with each state we need to be sure data are 
there.  This doesn't sound like a problem with a single PV - but in a server 
world of many different kind of misconfiguration and failing devices it may be 
more important then you might think.

The valid idea might be - to maybe support 'riskier' variant of metadata 
update, where lvm2 might skip some disk security checking, but may not catch 
all trouble associated - thus you may run for days with dm table you will not 
find then in your lvm2 metadata....

>
> 4. Commit fb003cdf & a3686986 causes additional metadata read.
>     Could we improve that? (We had checked the metadata in _find_vg_rlocn())

Fight with disk corruption and duplications is a major topic in lvm2....
But ATM are fishing for bigger fish :)
So yes this optimizations are in a queue - but not as top priority.

>
> 5. Feature request: could we take multiple snapshots in a batch, to reduce
>     the number of metadata IO operations?
>     e.g., lvcraete vg1/lv1 vg1/lv2 vg1/lv3 --snapshot
>     (I know that it would be trouble for the --addtag options...)

Yes another already existing and planned RFE - to have support for
atomic snapshot for multiple device at once - in a queue.

>
>     This post mentioned that lvresize will support resizing multiple volumes,

It's not about resizing mutliple volume with once command,
it's about resizing data & metadata in one command via policy more correctly/

>     but I think that taking multiple snapshots is also helpful.
>     https://www.redhat.com/archives/linux-lvm/2016-February/msg00023.html
>     > There is also some ongoing work on better lvresize support for more then 1
>     > single LV. This will also implement better approach to resize of lvmetad
>     > which is using different mechanism in kernel.
>
>     Possible IOCTL sequence:
>       dm-suspend origin0
>       dm-message create_snap 3 0
>       dm-message set_transaction_id 3 4

Every transaction update here - needs  lvm2 metadata confirmation - i.e. 
double-commit   lvm2 does not allow to jump by more then 1 transaction here,
and the error path also cleans 1 transaction.

>       dm-resume origin0
>       dm-suspend origin1
>       dm-message create_snap 4 1
>       dm-message set_transaction_id 4 5
>       dm-resume origin1
>       dm-suspend origin2
>       dm-message create_snap 5 2
>       dm-message set_transaction_id 5 6
>       dm-resume origin2
>       ...
>
> 6. Is there any other way to accelerate LVM operation? I had enabled lvmetad,
>     setting global_filter and md_component_detection=0 in lvm.conf.

Reducing number of PVs with metadata in case your VG has lots of PVs
(may reduce metadata resistance in case PVs with them are lost...)

Filters are magic - try to accept only devices which are potential PVs and 
reject everything else. (by default every device is accepted and scanned...)

Disabling archiving & backup in filesystem (in lvm.conf) may help a lot if you 
run lots of lvm2 commands and you do not care about archive.

Checking /etc/lvm/archive is not full of thousands of files.

Checking with  'strace -tttt' what delays your command.

And yes - there are always couple on going transmutation in lvm2 which may 
have introduced some performance regression - so open BZ is always useful if 
you spot such thing.

Regards

Zdenek

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linux-lvm] Why LVM metadata locations are not properly aligned
  2016-04-21  4:08 [linux-lvm] Why LVM metadata locations are not properly aligned Ming-Hung Tsai
  2016-04-21  9:54 ` Zdenek Kabelac
@ 2016-04-21 10:11 ` Alasdair G Kergon
  1 sibling, 0 replies; 6+ messages in thread
From: Alasdair G Kergon @ 2016-04-21 10:11 UTC (permalink / raw)
  To: LVM general discussion and development

On Thu, Apr 21, 2016 at 12:08:55PM +0800, Ming-Hung Tsai wrote:
> However, it's hard to achieve that if the PV is busy running IO. 

So flush your data in advance of running the snapshot commands so there is only
minimal data to sync during the snapshot process itself.

> The major overhead is LVM metadata IO.

Are you sure?  That would be unusual.  How many copies of the metadata have you
chosen to keep?  (metadata/vgmetadatacopies)  How big is this metadata?  (E.g.
size of /etc/lvm/backup/<vgname> file.)

Alasdair

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linux-lvm] Why LVM metadata locations are not properly aligned
  2016-04-21  9:54 ` Zdenek Kabelac
@ 2016-04-21 13:22   ` Zdenek Kabelac
  2016-04-22  8:43   ` Ming-Hung Tsai
  1 sibling, 0 replies; 6+ messages in thread
From: Zdenek Kabelac @ 2016-04-21 13:22 UTC (permalink / raw)
  To: LVM general discussion and development

On 21.4.2016 11:54, Zdenek Kabelac wrote:
> On 21.4.2016 06:08, Ming-Hung Tsai wrote:
>> Hi,
>>

>>
>> 1. The metadata locations (raw_locn::offset) are not properly aligned.
>>     Function _aligned_io() requires the IO to be logical-block aligned,
>>     but metadata locations returned by next_rlocn_offset() are 512-byte
>> aligned.
>>     If a device's logical block size is greater than 512b, then LVM need to use
>>     bounce buffer to do the IO.
>>     How about setting raw_locn::offset to logical-block boundary?
>>     (or max(logical_block_size, physical_block_size) for 512-byte logical-/4KB
>>      physical-block drives?)
>
> This looks like a bug - lvm2 should start to write metadata always on physical
> block aligned position.

Hi

I've opened RFE BZ for this one - https://bugzilla.redhat.com/1329234
It's not completely trivial to fix this in a backward compatible way - but I'm 
mostly 100% sure it's not cause your 10s delay unless.

Regards

Zdenek

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linux-lvm] Why LVM metadata locations are not properly aligned
  2016-04-21  9:54 ` Zdenek Kabelac
  2016-04-21 13:22   ` Zdenek Kabelac
@ 2016-04-22  8:43   ` Ming-Hung Tsai
  2016-04-22  9:49     ` Zdenek Kabelac
  1 sibling, 1 reply; 6+ messages in thread
From: Ming-Hung Tsai @ 2016-04-22  8:43 UTC (permalink / raw)
  To: LVM general discussion and development

2016-04-21 18:11 GMT+08:00 Alasdair G Kergon <agk@redhat.com>:
> On Thu, Apr 21, 2016 at 12:08:55PM +0800, Ming-Hung Tsai wrote:
>> However, it's hard to achieve that if the PV is busy running IO.
>
> So flush your data in advance of running the snapshot commands so there is only
> minimal data to sync during the snapshot process itself.
>
>> The major overhead is LVM metadata IO.
>
> Are you sure?  That would be unusual.  How many copies of the metadata have you
> chosen to keep?  (metadata/vgmetadatacopies)  How big is this metadata?  (E.g.
> size of /etc/lvm/backup/<vgname> file.)

My configurations:
- Only one PV in a volume group
- A thinpool with several thin volumes
- size of a metadata record is less than 16KB
- lvm.conf:
    metadata/vgmetadatacopies=1
    devices/md_component_detection=0 because it requires disk IO.
                                     Other filters are relatively faster.
    device/global_filter=[ "a/md/", "r/.*/" ]
    backup/retain_days=0 and backup/retain_min=30 so there are at most
30 backups

Despite there's no IO on the target volume to take snapshot, the system is
still doing IO on other volumes, which increases the latency of direct IOs
issued by LVM.

2016-04-21 17:54 GMT+08:00 Zdenek Kabelac <zkabelac@redhat.com>:
> On 21.4.2016 06:08, Ming-Hung Tsai wrote:
>
> Hmm do you observe taking a snapshot takes more then a second ?
> IMHO the largest portion of time should be the 'disk' synchronization
> when suspending  (full flush and fs sync)
> Unless you have lvm2 metadata in range of MiB (and lvm2 was not designed for
> that) - you should be well bellow a second...
>
> you will save couple
> disk reads - but this will not save your time problem a lot if you have
> overloaded disk I/O system.
> Note lvm2 is using direct I/O which is your trouble maker here I guess...

That's the point. I should not say "LVM metadata IO is the overhead".
LVM just suffered from the system loading, so it cannot finish metadata
direct IOs within seconds. I can try to manage data flushing and filesystem sync
before taking snapshots, but on the other hand, I wish to reduce
the number of IOs issued by LVM.

>
> Changing disk scheduler to deadline ?
> Lowering percentage of dirty-pages ?
>

In my previous testing on kernel 3.12, CFQ+ionice performs better than
deadline in this case, but now it seems that the schedulers for blk-mq are not
yet ready.
I also tried to use cgroup to do IO throttling when taking snapshots.
I can do some more testing.

>> 3. Why LVM uses such complex process to update metadata?
>>
> It's been already simplified once ;) and we have lost quite important
> property of validation of written data during pre-commit -
> which is quite useful when user is running on misconfigured multipath device...
>
> Each state has its logic and with each state we need to be sure data are
> there.
>
> The valid idea might be - to maybe support 'riskier' variant of metadata
> update

I'm not well understand the purpose of pre-commit. Why not write the metadata
then update the mda header immediately?. Could you give me an example?

>> 5. Feature request: could we take multiple snapshots in a batch, to reduce
>>     the number of metadata IO operations?
>
> Every transaction update here - needs  lvm2 metadata confirmation - i.e.
> double-commit   lvm2 does not allow to jump by more then 1 transaction here,
> and the error path also cleans 1 transaction.

How about setting the snapshots with same transaction_id ?

IOCTL sequence:
  LVM commit metadata with queued create_snap messages
  dm-suspend origin0
  dm-message create_snap 3 0
  dm-resume origin0
  dm-suspend origin1
  dm-message create_snap 4 1
  dm-resume origin1
  dm-message set_transaction_id 3 4
  LVM commit metadata with updated transaction_id

Related post: https://www.redhat.com/archives/dm-devel/2016-March/msg00071.html

>> 6. Is there any other way to accelerate LVM operation?
>>
> Reducing number of PVs with metadata in case your VG has lots of PVs
> (may reduce metadata resistance in case PVs with them are lost...)

There's only one PV in my case. For multiple PVs cases, I think I could
temporarily disable metadata writing on some PVs by setting --metadataignore.

> Filters are magic - try to accept only devices which are potential PVs and
> reject everything else. (by default every device is accepted and scanned...)

One more question: Why the filter cache is disabled when using lvmetad?
(comments in init_filters(): "... Also avoid it when lvmetad is enabled.")
Thus LVM needs to check all the devices under /dev when it start.

Alternatively, is there any way to let lvm_cache handles some specific
devices only, instead of check the entire directory?
(e.g, allow devices/scan=["/dev/md[0-9]*"], to filter devices at earlier
 stage. The current strategy is calling dev_cache_add_dir("/dev"),
 then checking individual devices, which requires a lot of unnecessary
stat() syscalls)

There's also an undocumented configuration devices/loopfiles. Seems for loop
loop device files.

> Disabling archiving & backup in filesystem (in lvm.conf) may help a lot if
> you run lots of lvm2 commands and you do not care about archive.

I know there's -An option in lvcreate, but now the system loading and direct IO
is the main issue.

Thanks,
Ming-Hung Tsai

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linux-lvm] Why LVM metadata locations are not properly aligned
  2016-04-22  8:43   ` Ming-Hung Tsai
@ 2016-04-22  9:49     ` Zdenek Kabelac
  0 siblings, 0 replies; 6+ messages in thread
From: Zdenek Kabelac @ 2016-04-22  9:49 UTC (permalink / raw)
  To: linux-lvm

On 22.4.2016 10:43, Ming-Hung Tsai wrote:
> 2016-04-21 18:11 GMT+08:00 Alasdair G Kergon <agk@redhat.com>:
>> On Thu, Apr 21, 2016 at 12:08:55PM +0800, Ming-Hung Tsai wrote:
>>> However, it's hard to achieve that if the PV is busy running IO.
>>
>> So flush your data in advance of running the snapshot commands so there is only
>> minimal data to sync during the snapshot process itself.
>>
>>> The major overhead is LVM metadata IO.
ote lvm2 is using direct I/O which is your trouble maker here I guess...
>
> That's the point. I should not say "LVM metadata IO is the overhead".
> LVM just suffered from the system loading, so it cannot finish metadata
> direct IOs within seconds. I can try to manage data flushing and filesystem sync
> before taking snapshots, but on the other hand, I wish to reduce
> the number of IOs issued by LVM.
>
>>
>> Changing disk scheduler to deadline ?
>> Lowering percentage of dirty-pages ?
>>
>
> In my previous testing on kernel 3.12, CFQ+ionice performs better than
> deadline in this case, but now it seems that the schedulers for blk-mq are not
> yet ready.
> I also tried to use cgroup to do IO throttling when taking snapshots.
> I can do some more testing.
>

yep - if simple set of  I/O do take several seconds - it's not really
a problem lvm2 can solve.

You should consider lowering the amount of dirty pages so you are
not using system with with  the extreme delay in write-queue.

Defaults are like 60% of RAM can be dirty and if you have a lot or RAM - it
may take quite while to sync all this to device - and that's
what will happen with 'suspend'

You may just try to measure it with plain 'dmsetup suspend/resume'
on a device you want to make a snapshot on your loaded hw.

Interesting thing to play with could be 'dmstats' (relatively recent addition)
for tracking latencies and i/o load on disk areas...

>>> 3. Why LVM uses such complex process to update metadata?
>>>
>> It's been already simplified once ;) and we have lost quite important
>> property of validation of written data during pre-commit -
>> which is quite useful when user is running on misconfigured multipath device...
>>
>> Each state has its logic and with each state we need to be sure data are
>> there.
>>
>> The valid idea might be - to maybe support 'riskier' variant of metadata
>> update
>
> I'm not well understand the purpose of pre-commit. Why not write the metadata
> then update the mda header immediately?. Could you give me an example?

You need to see  'command'  and 'activation/locking' part as 2 different
entities/processes - which may not have any common data.

Command knows data and does some operation on them.

Locking code then only sees data written on disk (+couple extra bits of passed 
info).

So in cluster one node runs command and different node might be activating
a device purely from written metadata - having no common structure with 
command code.
Now there are 'some' bypass code paths to avoid reread of info if it is a 
single command doing also locking part...

The 'magic' is a 'suspend' operation - which is the ONLY operation that
sees 'committed' & 'pre-commited'  metadata  (lvm2 has 2 slots)
If anything fails in  'pre-commit' -  metadata are dropped
and state remains at 'committed' state.
When pre-commit suspend is successful - then we may commit and resume
now committed metadata.

It's quite complicated state machine with many constrains and obviously still 
with some bugs and tweaks.

Sometime we do miss some bits of information and trying to remaining 
compatible is making it challenging....

>
>>> 5. Feature request: could we take multiple snapshots in a batch, to reduce
>>>      the number of metadata IO operations?
>>
>> Every transaction update here - needs  lvm2 metadata confirmation - i.e.
>> double-commit   lvm2 does not allow to jump by more then 1 transaction here,
>> and the error path also cleans 1 transaction.
>
> How about setting the snapshots with same transaction_id

Yes - that's how it will work - it's in plan....
It's the error path handling that needs some thinking.
First I want to improve check for free space in metadata to be matching
kernel logic more closely..

>> Filters are magic - try to accept only devices which are potential PVs and
>> reject everything else. (by default every device is accepted and scanned...)
>
> One more question: Why the filter cache is disabled when using lvmetad?
> (comments in init_filters(): "... Also avoid it when lvmetad is enabled.")
> Thus LVM needs to check all the devices under /dev when it start.

lvmetad is only "cache" for lvmetad - however we do not 'treat' lvmetad
is trustful source of info for many reason - primarily 'udevd' is toy-tool 
process with many unhandled corner cases - particularly whenever you have
duplicate/dead devices - it's getting useless...

So the purpose is avoid looking for metadata - but whenever we write new 
metadata - we grab protecting locks and need to be sure there are not racing 
commands - this can't be ensure by udev controlled lvmetad with completely 
unpredictable update timing and synchronization
(udev has built-in 30sec timeout for rule processing which might be far too 
small on loaded system...)

In other words - 'lvmetad' is somehow useful for 'lvs', but cannot be trusted 
for lvcreate/lvconvert...

> Alternatively, is there any way to let lvm_cache handles some specific
> devices only, instead of check the entire directory?
> (e.g, allow devices/scan=["/dev/md[0-9]*"], to filter devices at earlier
>   stage. The current strategy is calling dev_cache_add_dir("/dev"),
>   then checking individual devices, which requires a lot of unnecessary
> stat() syscalls)
>
> There's also an undocumented configuration devices/loopfiles. Seems for loop
> loop device files.

Always best opening  RHBZ for such items so they are not lost...

>> Disabling archiving & backup in filesystem (in lvm.conf) may help a lot if
>> you run lots of lvm2 commands and you do not care about archive.
>
> I know there's -An option in lvcreate, but now the system loading and direct IO
> is the main issue.

Direct IO is mostly mandatory - since many caching layers these day may ruin
everything - i.e. using   qemu over SAN - you may get completely unpredicatble
races without directio.
But maybe supporting some 'untrustful' cached write might be usable for
some users... not sure  - but I'd image an lvm.conf option for this.
Just such lvm2 would not be then supportable for customers...
(so we would need to track user has been using such option...)

Regards

Zdenek

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-04-22  9:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-21  4:08 [linux-lvm] Why LVM metadata locations are not properly aligned Ming-Hung Tsai
2016-04-21  9:54 ` Zdenek Kabelac
2016-04-21 13:22   ` Zdenek Kabelac
2016-04-22  8:43   ` Ming-Hung Tsai
2016-04-22  9:49     ` Zdenek Kabelac
2016-04-21 10:11 ` Alasdair G Kergon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).