All of lore.kernel.org
 help / color / mirror / Atom feed
* dm-cache: Can I change policy without suspending the cache?
@ 2015-12-29 23:41 Alex Sudakar
  2016-01-04 15:50 ` Joe Thornber
  0 siblings, 1 reply; 3+ messages in thread
From: Alex Sudakar @ 2015-12-29 23:41 UTC (permalink / raw)
  To: device-mapper development

Hi.  I've set up my system - using a Linux 4.3.3 kernel - to use a
dm-cache as the 'physical volume' for most of the LVM logical volumes
in the system, including the root filesystem.  This seems to be
working fine in daily operation.

During the night I run a couple of jobs which do reads of many of the
files in the system (for example, I run 'tripwire', which computes
checksums of files to see if any unauthorized changes have been made).
Ideally I don't want these night 'batch jobs' to affect the cache's
'daytime performance profile'.  I'd like the cache to be primed for
typical day use and have the night-time scans run without promoting
blocks into the cache which never see the light of day.  I've got a
couple of questions related to how I might do this which I'd like to
ask.  I've googled but haven't been able to find any answers
elsewhere; I hope it's okay to ask here.

My cache is running in writeback mode with the default smq policy.  To
my delight it seems that the 'cleaner' policy does *exactly* what I
want; not only does it immediately flush dirty blocks, as per the
documentation; it also appears to 'turn off' the promotion/demotion of
blocks in the cache.  In my tests of a stand-alone cache I dumped the
metadevice using 'cache_dump'; created the cache in writeback mode
using the cleaner policy; read and wrote blocks through the cache;
removed the cache and did another dump of the metadevice; finding that
the mapped blocks hadn't changed at all.  Which is brilliant!

So my plan is to have my writeback dm-cache running through the day
with the default 'smq' policy and then switch to the 'cleaner' policy
between midnight and 6am, say, allowing my batch jobs to run without
impacting the daytime cache mappings in the slightest.

My first question is to confirm that the cleaner policy does do what
I've observed it to do - deliberately stop all promotions/demotions,
leaving the block map static, as well as immediately flush dirty
blocks to the origin device.  In all my reading I've seen the latter
characteristic mentioned as the prime purpose of the policy - to flush
all dirty blocks - but nothing about the former.  But it's that
'freezing' of block migration into the cache which is exactly what I
want.

If there's documentation on that aspect of the cleaner policy's
operation I'd very much appreciate a reference, or otherwise any other
information about it.  Is the cleaner policy guaranteed to continue
this behavior?  :)

My second question is how I can do this; switching policies for a
dm-cache on a live system where the cache is the backing device for
the root filesystem.  With my test cache I was easily able to perform
the sequence of steps that all of the documentation says must be
performed to change policies:

  -  'dmsetup suspend' the cache
  -  'dmsetup reload' a new table with a change to the cleaner policy
  -  'dmsetup resume' the cache
  -  'dmsetup wait'

This worked fine for my test cache, because only my test scripts had
the cache open.

But when I had a simple shell script execute the steps above, in
sequence, on my real cache ... the entire system hung after the
'suspend'.  Because my cache is the backing device acting as the LVM
physical device for most of my system's LVM volumes, including the
root filesystem volume.  And I/O to the cache would block while the
cache is suspended, I guess, which hung the script between separate
'dmsetup' commands.  :(

So my second question is - how can I switch policies on a live
dm-cache device when the running script/program doing the switch is
itself using I/O through the device?

It would be great if the dmsetup command could take multiple commands,
so I could execute the suspend/reload/resume all in one invocation.
Or if it could read a series of commands from standard input, say.
Anything to allow the dmsetup to do all three steps in the one
process.  But I can't see anything that allows this.

The kernel cache.txt documentation talks about using 'dmsetup message'
to send messages to the device mapper driver, but only in the context
of altering policy tuning variables; I didn't see anything about how
one could change the policy itself using a message.  Otherwise I could
have a single process fire off a string of policy-switch commands.

The only way I can currently see to switch policies on a live dm-cache
which is a backing store for the root filesystem is to run the
'switch' script from a chrooted filesystem that isn't connected to the
cache device.  I may very well end up using the same initrd image that
I use to shut down the system; adding a 'switch' script and then
unpacking the initrd image into a tmpfs filesystem and chrooting to
that to switch between smq and cleaner policies at midnight and 6am.

But it would be nice if there was an easier, more elegant way to do it.

Any (1) confirmation on the cleaner policy's behavior in 'freezing'
the cache block map and (2) advice on how to change policies on a live
cache which is backing the live root filesystem would be most
gratefully received.

Thanks!

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: dm-cache: Can I change policy without suspending the cache?
  2015-12-29 23:41 dm-cache: Can I change policy without suspending the cache? Alex Sudakar
@ 2016-01-04 15:50 ` Joe Thornber
  2016-04-12  3:30   ` Alex Sudakar
  0 siblings, 1 reply; 3+ messages in thread
From: Joe Thornber @ 2016-01-04 15:50 UTC (permalink / raw)
  To: device-mapper development

On Wed, Dec 30, 2015 at 09:41:10AM +1000, Alex Sudakar wrote:
> Hi.  I've set up my system - using a Linux 4.3.3 kernel - to use a
> dm-cache as the 'physical volume' for most of the LVM logical volumes
> in the system, including the root filesystem.  This seems to be
> working fine in daily operation.
> 
> During the night I run a couple of jobs which do reads of many of the
> files in the system (for example, I run 'tripwire', which computes
> checksums of files to see if any unauthorized changes have been made).
> Ideally I don't want these night 'batch jobs' to affect the cache's
> 'daytime performance profile'.  I'd like the cache to be primed for
> typical day use and have the night-time scans run without promoting
> blocks into the cache which never see the light of day.  I've got a
> couple of questions related to how I might do this which I'd like to
> ask.  I've googled but haven't been able to find any answers
> elsewhere; I hope it's okay to ask here.
> 
> My cache is running in writeback mode with the default smq policy.  To
> my delight it seems that the 'cleaner' policy does *exactly* what I
> want; not only does it immediately flush dirty blocks, as per the
> documentation; it also appears to 'turn off' the promotion/demotion of
> blocks in the cache.

The smq policy is pretty reticent about promoting blocks to the fast
device unless there's evidence that those blocks are being hit more
frequently than those in the cache.  I suggest you do some experiments
to double check your batch jobs really are causing churn in the cache.

> So my plan is to have my writeback dm-cache running through the day
> with the default 'smq' policy and then switch to the 'cleaner' policy
> between midnight and 6am, say, allowing my batch jobs to run without
> impacting the daytime cache mappings in the slightest.

There is another option, which is to just turn the
'migration_threshold' tunable for smq down to zero.  Which will
practically stop any migrations.

> My first question is to confirm that the cleaner policy does do what
> I've observed it to do - deliberately stop all promotions/demotions,
> leaving the block map static, as well as immediately flush dirty
> blocks to the origin device.

Yes.  But it's pretty agressive about writing the dirty data back,
which may impact performance.

> My second question is how I can do this; switching policies for a
> dm-cache on a live system where the cache is the backing device for
> the root filesystem.  With my test cache I was easily able to perform
> the sequence of steps that all of the documentation says must be
> performed to change policies:
> 
>   -  'dmsetup suspend' the cache
>   -  'dmsetup reload' a new table with a change to the cleaner policy
>   -  'dmsetup resume' the cache
>   -  'dmsetup wait'
> 
> This worked fine for my test cache, because only my test scripts had
> the cache open.
> 
> But when I had a simple shell script execute the steps above, in
> sequence, on my real cache ... the entire system hung after the
> 'suspend'.  Because my cache is the backing device acting as the LVM
> physical device for most of my system's LVM volumes, including the
> root filesystem volume.  And I/O to the cache would block while the
> cache is suspended, I guess, which hung the script between separate
> 'dmsetup' commands.  :(

Yes, this is always going to be a problem.  If dmsetup is paged out,
you better hope it's not on one of the suspended devices.  LVM2
memlocks itself to avoid being paged out.  I think you have a few
options, in order of complexity:

- You don't have to suspend before you load the new table.  I think
  the sequence ...

  dmsetup load
  dmsetup resume  # implicit suspend, swap table, resume

  ... will do what you want, and may well avoid the hang.

- Put dmsetup and associated libraries somewhere where the IO is
  guaranteed to complete even though the root dev etc are
  suspended. (eg, a little ram disk).

- Switch from using dmsetup to use the new zodcache tool that was
  posted here last month.  If zodcache doesn't memlock, we'll patch to
  make sure it does.

> It would be great if the dmsetup command could take multiple commands,
> so I could execute the suspend/reload/resume all in one invocation.

See zodcache.

> Or if it could read a series of commands from standard input, say.
> Anything to allow the dmsetup to do all three steps in the one
> process.  But I can't see anything that allows this.

Yes, this has been talked about before.  I spent a bit of time
experimenting with a tool I called dmexec.  This implemented a little
stack based language that you could use to build your own sequence of
device mapper operations.  For example:

https://github.com/jthornber/dmexec/blob/master/language-tests/table-tests.dm

I really think something like this is the way forward, though possibly
with a less opaque language.  Volume managers would then be
implemented as a mix of low level dmexec libraries, and high level
calls into dmexec.

> The kernel cache.txt documentation talks about using 'dmsetup message'
> to send messages to the device mapper driver, but only in the context
> of altering policy tuning variables; I didn't see anything about how
> one could change the policy itself using a message.  Otherwise I could
> have a single process fire off a string of policy-switch commands.

You have to load the new table.

- Joe

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: dm-cache: Can I change policy without suspending the cache?
  2016-01-04 15:50 ` Joe Thornber
@ 2016-04-12  3:30   ` Alex Sudakar
  0 siblings, 0 replies; 3+ messages in thread
From: Alex Sudakar @ 2016-04-12  3:30 UTC (permalink / raw)
  To: device-mapper development; +Cc: Joe Thornber

On Tue, Jan 5, 2016 at 1:50 AM, Joe Thornber <thornber@redhat.com> wrote:
>
> On Wed, Dec 30, 2015 at 09:41:10AM +1000, Alex Sudakar wrote:
>>
>> My cache is running in writeback mode with the default smq policy.  To
>> my delight it seems that the 'cleaner' policy does *exactly* what I
>> want; not only does it immediately flush dirty blocks, as per the
>> documentation; it also appears to 'turn off' the promotion/demotion of
>> blocks in the cache.
>
> The smq policy is pretty reticent about promoting blocks to the fast
> device unless there's evidence that those blocks are being hit more
> frequently than those in the cache.  I suggest you do some experiments
> to double check your batch jobs really are causing churn in the cache.

Thank you for that advice.  I've since seen other messages here
mentioning the 'reticence' of the smq policy.  I admit it was just my
assumption that a complete single pass through the entire filesystem,
once a day, would thrown the cache statistics out of whack.  Maybe
merited/formed with the old 'mq' policy?

>> So my plan is to have my writeback dm-cache running through the day
>> with the default 'smq' policy and then switch to the 'cleaner' policy
>> between midnight and 6am, say, allowing my batch jobs to run without
>> impacting the daytime cache mappings in the slightest.
>
> There is another option, which is to just turn the
> 'migration_threshold' tunable for smq down to zero.  Which will
> practically stop any migrations.

I didn't think of that option at all, and it would be so easy to do on
the fly!  Thank you!

>> But when I had a simple shell script execute the steps above, in
>> sequence, on my real cache ... the entire system hung after the
>> 'suspend'.  Because my cache is the backing device acting as the LVM
>> physical device for most of my system's LVM volumes, including the
>> root filesystem volume.  And I/O to the cache would block while the
>> cache is suspended, I guess, which hung the script between separate
>> 'dmsetup' commands.  :(
>
> Yes, this is always going to be a problem.  If dmsetup is paged out,
> you better hope it's not on one of the suspended devices.  LVM2
> memlocks itself to avoid being paged out.  I think you have a few
> options, in order of complexity:
>
> - You don't have to suspend before you load the new table.  I think
>   the sequence ...
>
>   dmsetup load
>   dmsetup resume  # implicit suspend, swap table, resume
>
>   ... will do what you want, and may well avoid the hang.

This is brilliant suggestion #2.  :-)

From reading dmsetup(8) I just *assumed* that a 'resume' had to be on
the other side of a 'suspend', given that the first sentence of the
description for the command reads 'un-suspends a device'.  I'm sort of
stunned that a 'suspend' isn't necessary for a 'resume' to do what I
need and load a new table.  By just commenting out the 'suspend' in my
script everything worked exactly as I wanted.  *Thank you* for this
nugget of dmsetup wisdom.

> - Put dmsetup and associated libraries somewhere where the IO is
>   guaranteed to complete even though the root dev etc are
>   suspended. (eg, a little ram disk).

Yes, I was thinking of setting up a ram disk - using the dracut
module/commands which does exactly this for a system shutdown - if I
had to keep going down the path of doing a 'suspend'.

>> Or if it could read a series of commands from standard input, say.
>> Anything to allow the dmsetup to do all three steps in the one
>> process.  But I can't see anything that allows this.
>
> Yes, this has been talked about before.  I spent a bit of time
> experimenting with a tool I called dmexec.  This implemented a little
> stack based language that you could use to build your own sequence of
> device mapper operations.  For example:
>
> https://github.com/jthornber/dmexec/blob/master/language-tests/table-tests.dm
>
> I really think something like this is the way forward, though possibly
> with a less opaque language.  Volume managers would then be
> implemented as a mix of low level dmexec libraries, and high level
> calls into dmexec.

I had a shot at doing a cruder form of this; I hacked a copy of
dmsetup to read multiple commands from *argv[], each prefaced by a
number telling the 'command loop' how many values of *argv[] to use
for the next command; very basic stuff.  After finding one or two
global variables which were expected to be in their initial
program-load state this hacked version of dmsetup worked fine; on a
test standalone dm-cache device it would suspend, load, resume
perfectly.

But it still hung on doing it on my live dm-cache which provides the
LVM PV for the root and other filesystems.

My PC has 16GB of memory, and about 14GB of that was free.  Swap
wasn't being used at all.

My interest is only academic - you've solved my problem entirely with
your brilliant suggestions #1 & #2 above :-) - but I wouldn't mind
knowing why a resume on a dm-cache underpinning the root filesystem
still hung the executing hacked dmsetup program from doing a table
load and resume.  Memory of an executing process won't be swapped out
if there is a lot of RAM free, right?  Maybe dmsetup does something
else as part of a suspend which triggers these hangs.  Or the resume
needs something from the root filesystem.  Or something.  :-)

> - Switch from using dmsetup to use the new zodcache tool that was
>   posted here last month.  If zodcache doesn't memlock, we'll patch to
>   make sure it does.
>
> ...
>
>> It would be great if the dmsetup command could take multiple commands,
>> so I could execute the suspend/reload/resume all in one invocation.
>
> See zodcache.

I've looked at zodcache ... and wished I'd known about it earlier.
Instead of huffing and puffing and doing all my scripting of dracut
modules to pick up customised kernel directives as to the identity of
the devices to use for my dm-cache, and then building same, I see how
zodcache does a much more elegant job by leveraging the functionality
of udev together with using superblocks to identify the component
devices automatically.  Very nice; I think I've learned something just
by perusing its readme.pdf.  :-)  I'll definitely use zodcache next
time.

(The LVM cache seemed a bit cumbersome and overengineered for me,
which is why I decided to build my own more flexible and
direct/simpler dm-cache underpinning my various PVs and LVs.)

> - Joe

Joe, thank you very much for your advice, which saved the day two or
three different ways!  Your detailed response, and the time you spent
writing it, is much appreciated.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-04-12  3:30 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-29 23:41 dm-cache: Can I change policy without suspending the cache? Alex Sudakar
2016-01-04 15:50 ` Joe Thornber
2016-04-12  3:30   ` Alex Sudakar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.