LSF-MM Volatile Ranges Discussion Plans

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* LSF-MM Volatile Ranges Discussion Plans
@ 2013-04-17 17:56 John Stultz
  2013-04-17 20:12 ` Paul Turner
  2013-04-23  3:11 ` Summary of LSF-MM Volatile Ranges Discussion John Stultz
  0 siblings, 2 replies; 10+ messages in thread
From: John Stultz @ 2013-04-17 17:56 UTC (permalink / raw)
  To: lsf, linux-mm
  Cc: Minchan Kim, Dmitry Vyukov, Paul Turner, Robert Love, Dave Hansen,
	Taras Glek, Mike Hommey, Kostya Serebryany

LSF-MM Volatile Ranges Discussion Plans
=======================================

Just wanted to send this out to hopefully prime the discussion at
lsf-mm tomorrow (should the schedule hold). Much of it is background
material we won't have time to cover.

First of all, this is my (John's) perspective here, Minchan may
disagree with me on specifics here, but I think it covers the desired
behavior fairly well, and I've tried to call out the places where
we currently don't yet agree.

Volatile Ranges:
----------------

Idea is from Android's ashmem feature (originally by Robert Love),
which allows for unpinned ranges.

I've been told other OSes support similar functionality
(VM_FLAGS_PURGABLE and  MEM_RESET/MEM_RESET_UNDO).

Been slow going last 6-mo on my part, due to lots of adorable
SIGBABY interruptions & other work.

Concept in general:
-------------------

Applications marks memory as volatile, allowing kernel to purge
that memory if and when its needed. Applications can mark memory as
non-volatile, and kernel will return a value to notify them if memory
was purged while it was volatile.

Use cases:
----------

Allows for eviction of userspace cache by the kernel, which is nice
as applications don't have to tinker with optimizing cache sizes,
as the kernel which has the global view will optimize it for them.

Marking  obscured bitmaps of rendered image data volatile. Ie: Keep
compressed jpeg around, but mark volatile off-screen rendered bitmaps.

Marking non-visible web-browser tabs as volatile.

Lazy freeing of heap in malloc/free implementations.

Parallel ways of thinking about it:
-----------------------------------

Also similar to MADV_DONTNEED, but eviction is needs based, not
instantaneous. Also applications can cancel eviction if it hasn't
happened (by setting non-volatile).  So sort of delayed and cancel-able
MADV_DONTNEED.

Can consider it like swapping some pages to /dev/null ?

Rik's MADV_FREE was vary similar, but with implicit NON_VOLATILE
marking on page-dirtying.

Two basic usage-modes:
----------------------

1)  Application explicitly unmarks memory as volatile whenever it
uses it, never touching memory marked volatile.

     If memory is purged, applications is notified when it marks the
     area as non-volatile.

2) Applications may access memory marked volatile, but should it
access memory that was purged, it will receive SIGBUS

     On SIGBUS, application has to mark needed range as non-volatile,
     regenerate or re-fetch the data, and then can continue.

     This is a little more optimistic, but applications need to be
     able to handle getting a SIGBUS and fixing things up.

     This second optimistic method is desired by Mozilla folks.

Important Goals:
----------------

Applications using this likely to mark and unmark ranges
frequently (ideally only marking the data they immediately need as
nonvolatile). This makes it necessary for these operations to be cheap,
since applications won't volunteer their currently unused memory to
the kernel if it adds dramatic overhead.  Although this concerned is
lessened with the optimistic/SIGBUS usage-mode.

Overall, we try to push costs from the mark/unmark paths to the page
eviction side.

Two basic types of volatile memory:
-----------------------------------

1) File based memory

2) Anonymous memory

Volatile ranges on file memory:
-------------------------------

This allows for using volatile ranges on shared memory between
processes.

Very similar to ashmem's unpinned pages.

One example: Two processes can create a large circular buffer, where
any unused memory in that buffer is volatile. Producer marks memory
as non-volatile, writes to it. The consumer would read the data,
then mark it volatile.

An important distinction here is that the volatility is shared,
in the same way the file's data is shared. Its a property of the
file's pages, not a property of the process that marked the range as
volatile. Thus one application can mark file data as volatile, and
the pages could be purged from all applications mapping that data.
And a different application could mark it as non-volatile, and that
would keep it from being purged from all applications.

For this reason, the volatility is likely best to be stored on
address_space (or otherwise connected to the address_space/inode).

Another important semantic: Volatility is cleared when all fd's to
a file are closed.

     There's no really good way for volatility to persist when no one
     is using a file.

     It could cause confusion if an application died leaving some
     file data volatile, and then had that data disappear as it was
     starting up again.

     No volatility across reboots!

[TBD]: For the most-part, volatile ranges really only makes sense to
me on tmpfs files. Mostly due to semantics of purging data on files
is similar to hole punching, and I suspect having the resulting hole
punched pushed out to disk would cause additional io and load. Partial
range purging could have strange effects on resulting file.

[TBD]: Minchan disagrees and thinks fadvise(DONTNEED) has problems,
as it causes immediate writeout when there's plenty of free memory
(possibly unnecessary). Although we may defer so long that the hole
is never punched, which may be problematic.

Volatile ranges on anonymous/process memory:
--------------------------------------------

For anonymous memory, its mostly un-shared between processes (except
copy-on-write pages).

The only way to address anonymous memory is really relative to the
process address space (its anonymous: there's no named handle to it).

Same semantics as described above. Mark region of process memory
volatile, or non-volatile.

Volatility is a per-proecess (well mm_struct) state.

Kernel will only purge a memory page, if *all* the processes that
map that page in consider the page volatile.

Important semantics: Preserve volatility over a fork, but clear child
volatility on exec.

     So if a process marks a range as volatile then forks. Both
     the child and parent should see the same range as volatile.
     On memory pressure, kernel could purge those pages, since all of
     the processes that map that page consider it volatile.

     If the child writes to the pages, the COW links are broken, but
     both ranges ares still volatile, and can be purged until they
     are marked non-volatile or cleared.

     Then like mappings and the rest of memory, volatile ranges are
     cleared on exec.

Implementation history:
-----------------------

File-focused (John): Interval tree connected to address_space w/ global
LRU of unpurged volatile ranges. Used shrinker to trigger purging
off the lru. Numa folks complained that shrinker is numa-unaware and
would cause purging on nodes not under pressure.

File-focused (John): Checking volatility at page eviction time. Caused
problems on swap-free systems, since tmpfs pages are anonymous and
aren't aged/shrunk off lrus. In order to handle that we moved the
pages to a volatile lru list, but that causes volatile/non-volatile
operations to be very expensive O(n) for number of pages in the range.

Anon-focused (Minchan): Store volatility in VMA. Worked well for
anonymous ranges, but was problematic to extend to file ranges as
we need volatility state to be connected with the file, not the
process. Iterating across and splitting VMAs was somewhat costly.

Anon-focused (Minchan): Store anonymous volatility in interval tree
off of the mm_struct. Use global LRU of volatile ranges to use when
purging ranges via a shrinker. Also hooks into normal eviction to
make sure evicted pages are purged instead of swapped out. Very fast,
due to quick manipulations to a single interval tree.  File pages in
ranges are ignored.

Both (John): Same as above, but mostly extended so interval tree
of ranges can be hung off of the mm_struct OR an address_space.
Currently functionality is partitioned so volatile ranges on files and
on anonymous memory are created via separate syscalls (fvrange(fd,
start, len, ...) vs mvrange(start_addr, len,...)).  Roughly merges
the original first approach with the previous one.

Both (John): Currently working on above, further extending mvrange()
so it can also be used to set volatility on MAP_SHARED file mappings
in an address space. Has the problem that handling both file and
anonymous memory types in a single call requires iterating over vmas,
which makes the operation more expensive.

[TBD]: Cost impact of mvrange() supporting mapped file pages vs dev
confusion of it not supporting file pages

Current interfaces:
-------------------

Two current interfaces:
     fvrange(fd, start_off, length, mode, flags, &purged)

     mvrange(start_addr, length, mode, flags, &purged)

fd/start/length:
     Hopefully obvious :)

mode:
     VOLATILE: Sets range as volatile. Returns number of bytes marked
     volatile.

     NON_VOLATILE: Marks range as non-volatile. Returns number of bytes
     marked non-volatile, sets purged value to 1 if any memory in the
     bytes marked non-volatile were purged.

flags:
     VRANGE_FULL: On eviction, the entire range specified will be purged

     VRANGE_PARTIAL: On eviction, we may purge only part of the
     specified range.

     In earlier discussions, it was deemed that if any page in
     a volatile range was purged, we might as well purge the entire
     range, since if we mark any portion of that range as non-volatile,
     the application would have to regenerate the entire range. Thus
     we might as well reduce memory pressure by puring the entire range.

     However, with the SIGBUS semantics, applications may be able to
     continue accessing pages in a volatile range where one unused
     page is purged, so we may want to avoid purging the entire range
     to allow for optimistic continued use.

     Additionally partial purging is helpful so that we don't over-react
     when we have slight memory pressure. An example, if we have a
     64M vrange, and the kernel only needs 8M, its much cheaper to
     free 8M now and then later when the range is marked non-volatile,
     re-allocate only 8M (fault + allocation + zero-clearing) instead
     of the entire 64M.

     [TBD]: May consider merging flags w/ mode: ie: VOLATILE_FULL,
     VOLATILE_PARTIAL, NON_VOLATILE

     [TBD]: Might be able to simplify and go with VRANGE_PARTIAL all
     the time?

purged:
     Flag that returns 1 if any pages in the range marked
     NON_VOLATILE were purged. Is set to zero otherwise. Can be null
     if mode==VOLATILE.

     [TBD]: Might consider value passed to it will be |'ed with 1?.

     [TBD]: Might consider purged to be more of a status bitflag,
     allowing vrange(VOLATILE) calls to get some meaningful data like
     if memory pressure is currently going on.

Return value:
     Number of bytes marked VOLATILE or NON_VOLATILE. This is necessary
     as if we are to deal with setting ranges that cross anonymous and
     file backed pages, we have to split the operations up into multiple
     operations against the respective mm_struct or addess_space, and
     there's a possibility that we could run out of memory mid-way
     through an operation.  If we do run out of memory mid way, we
     simply return the number of bytes successfully marked, and we
     can return an error on the next invocation if we hit the ENOMEM
     right away.

     [TBD]: If mvrange() doesn't affect mapped file pages, then the
     return value can be simpler.

Current TODOs:
--------------

Add proper SIGBUS signaling when accessing purged file ranges.

Working on handling mvrange() ranges that cross anonymous and mapped
file regions.

Handle errors mid-way through operations.

Cleanups and better function names.

[TBD] Contentious interface issues:
-----------------------------------

Does handling mvrange() calls that cross anonymous & file pages
increase costs too much for ebizzy workload Minchan likes?

     Have to take mmap_sem and traverse vmas.

     Could mvrange() on file pages not be shared in the same way as
     in fvrange()

     Sane interface vs Speed?

Minchan's idea of mvrange(VOLATILE_FILE|VOLATILE_ANON|VOLATILE_BOTH):

     Avoid traversing vmas on VOLATILE_ANON flag, regardless of if
     range covers mapped file pages

     Not sure we can throw sane errors without checking vmas?

Do we really need a new syscall interface?

     Can we maybe go back to using madvise?

     Should mvrange be prioritized over fvrange, if mvrange can create
     volatile ranges on files.

Some folks still don't like SIGBUS on accessing a purged volatile page,
instead want standard zero-fill fault.

     Need some way to know page was dropped (zero is a valid data value)

     After marking non-volatile, it can be zero-fill fault.

[TBD] Contentious implementation issues:
----------------------------------------

Still using shrinker for purging, got early complaints from NUMA folks

     Can make sure we check first page in each range and purge only
     ranges where some page is in the zone being shrinked?

     Still use shrinker, but also use normal page shrinking path,
     but check for volatility. (swapless still needs shrinker)

Probably don't want to actually hang vrange interval tree (vrange_root)
off of address_space and struct_mm.

     In earlier attempts I used a hashtable to avoid this
         http://thread.gmane.org/gmane.linux.kernel/1278541/focus=1278542

     I assume this is still a concern?

Older non-contentious points:
-----------------------------

Coalescing of ranges: Don't do it unless the ranges overlaps

Range granular vs page granular purging: Resolved with _FULL/_PARTIAL
flags

Other ideas/use-cases proposed:
-------------------------------

PTurner: Marking deep user-stack-frames as volatile to return that
memory?

Dmitry Vyukov: 20-80TB allocation, marked volatile right away. Never
marking non-volatile.

     Wants zero-fill and doesn't want SIGBUG

     https://code.google.com/p/thread-sanitizer/wiki/VolatileRanges

Misc:
----
Previous discussion: https://lwn.net/Articles/518130/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: LSF-MM Volatile Ranges Discussion Plans
  2013-04-17 17:56 LSF-MM Volatile Ranges Discussion Plans John Stultz
@ 2013-04-17 20:12 ` Paul Turner
  2013-04-23  3:11 ` Summary of LSF-MM Volatile Ranges Discussion John Stultz
  1 sibling, 0 replies; 10+ messages in thread
From: Paul Turner @ 2013-04-17 20:12 UTC (permalink / raw)
  To: John Stultz
  Cc: lsf, linux-mm@kvack.org, Minchan Kim, Dmitry Vyukov, Robert Love,
	Dave Hansen, Taras Glek, Mike Hommey, Kostya Serebryany

[-- Attachment #1: Type: text/plain, Size: 15386 bytes --]

On Wed, Apr 17, 2013 at 10:56 AM, John Stultz <john.stultz@linaro.org>wrote:

> LSF-MM Volatile Ranges Discussion Plans
> ==============================**=========
>
> Just wanted to send this out to hopefully prime the discussion at
> lsf-mm tomorrow (should the schedule hold). Much of it is background
> material we won't have time to cover.
>
> First of all, this is my (John's) perspective here, Minchan may
> disagree with me on specifics here, but I think it covers the desired
> behavior fairly well, and I've tried to call out the places where
> we currently don't yet agree.
>
>
> Volatile Ranges:
> ----------------
>
> Idea is from Android's ashmem feature (originally by Robert Love),
> which allows for unpinned ranges.
>
> I've been told other OSes support similar functionality
> (VM_FLAGS_PURGABLE and  MEM_RESET/MEM_RESET_UNDO).
>
> Been slow going last 6-mo on my part, due to lots of adorable
> SIGBABY interruptions & other work.
>
>
> Concept in general:
> -------------------
>
> Applications marks memory as volatile, allowing kernel to purge
> that memory if and when its needed. Applications can mark memory as
> non-volatile, and kernel will return a value to notify them if memory
> was purged while it was volatile.
>
>
> Use cases:
> ----------
>
> Allows for eviction of userspace cache by the kernel, which is nice
> as applications don't have to tinker with optimizing cache sizes,
> as the kernel which has the global view will optimize it for them.
>
> Marking  obscured bitmaps of rendered image data volatile. Ie: Keep
> compressed jpeg around, but mark volatile off-screen rendered bitmaps.
>
> Marking non-visible web-browser tabs as volatile.
>
> Lazy freeing of heap in malloc/free implementations.
>
>
> Parallel ways of thinking about it:
> ------------------------------**-----
>
> Also similar to MADV_DONTNEED, but eviction is needs based, not
> instantaneous. Also applications can cancel eviction if it hasn't
> happened (by setting non-volatile).  So sort of delayed and cancel-able
> MADV_DONTNEED.
>
> Can consider it like swapping some pages to /dev/null ?
>
> Rik's MADV_FREE was vary similar, but with implicit NON_VOLATILE
> marking on page-dirtying.
>
>
> Two basic usage-modes:
> ----------------------
>
> 1)  Application explicitly unmarks memory as volatile whenever it
> uses it, never touching memory marked volatile.
>
>     If memory is purged, applications is notified when it marks the
>     area as non-volatile.
>
> 2) Applications may access memory marked volatile, but should it
> access memory that was purged, it will receive SIGBUS
>
>     On SIGBUS, application has to mark needed range as non-volatile,
>     regenerate or re-fetch the data, and then can continue.
>
>     This is a little more optimistic, but applications need to be
>     able to handle getting a SIGBUS and fixing things up.
>
>     This second optimistic method is desired by Mozilla folks.
>
>
> Important Goals:
> ----------------
>
> Applications using this likely to mark and unmark ranges
> frequently (ideally only marking the data they immediately need as
> nonvolatile). This makes it necessary for these operations to be cheap,
> since applications won't volunteer their currently unused memory to
> the kernel if it adds dramatic overhead.  Although this concerned is
> lessened with the optimistic/SIGBUS usage-mode.
>
> Overall, we try to push costs from the mark/unmark paths to the page
> eviction side.
>
>
>
> Two basic types of volatile memory:
> ------------------------------**-----
>
> 1) File based memory
>
> 2) Anonymous memory
>
>
> Volatile ranges on file memory:
> ------------------------------**-
>
> This allows for using volatile ranges on shared memory between
> processes.
>
> Very similar to ashmem's unpinned pages.
>
> One example: Two processes can create a large circular buffer, where
> any unused memory in that buffer is volatile. Producer marks memory
> as non-volatile, writes to it. The consumer would read the data,
> then mark it volatile.
>
> An important distinction here is that the volatility is shared,
> in the same way the file's data is shared. Its a property of the
> file's pages, not a property of the process that marked the range as
> volatile. Thus one application can mark file data as volatile, and
> the pages could be purged from all applications mapping that data.
> And a different application could mark it as non-volatile, and that
> would keep it from being purged from all applications.
>
> For this reason, the volatility is likely best to be stored on
> address_space (or otherwise connected to the address_space/inode).
>
> Another important semantic: Volatility is cleared when all fd's to
> a file are closed.
>
>     There's no really good way for volatility to persist when no one
>     is using a file.
>
>     It could cause confusion if an application died leaving some
>     file data volatile, and then had that data disappear as it was
>     starting up again.
>
>     No volatility across reboots!
>
>
> [TBD]: For the most-part, volatile ranges really only makes sense to
> me on tmpfs files. Mostly due to semantics of purging data on files
> is similar to hole punching, and I suspect having the resulting hole
> punched pushed out to disk would cause additional io and load. Partial
> range purging could have strange effects on resulting file.
>
> [TBD]: Minchan disagrees and thinks fadvise(DONTNEED) has problems,
> as it causes immediate writeout when there's plenty of free memory
> (possibly unnecessary). Although we may defer so long that the hole
> is never punched, which may be problematic.
>
>
>
> Volatile ranges on anonymous/process memory:
> ------------------------------**--------------
>
> For anonymous memory, its mostly un-shared between processes (except
> copy-on-write pages).
>
> The only way to address anonymous memory is really relative to the
> process address space (its anonymous: there's no named handle to it).
>
> Same semantics as described above. Mark region of process memory
> volatile, or non-volatile.
>
> Volatility is a per-proecess (well mm_struct) state.
>
> Kernel will only purge a memory page, if *all* the processes that
> map that page in consider the page volatile.
>
> Important semantics: Preserve volatility over a fork, but clear child
> volatility on exec.
>
>     So if a process marks a range as volatile then forks. Both
>     the child and parent should see the same range as volatile.
>     On memory pressure, kernel could purge those pages, since all of
>     the processes that map that page consider it volatile.
>
>     If the child writes to the pages, the COW links are broken, but
>     both ranges ares still volatile, and can be purged until they
>     are marked non-volatile or cleared.
>
>     Then like mappings and the rest of memory, volatile ranges are
>     cleared on exec.
>
>
> Implementation history:
> -----------------------
>
> File-focused (John): Interval tree connected to address_space w/ global
> LRU of unpurged volatile ranges. Used shrinker to trigger purging
> off the lru. Numa folks complained that shrinker is numa-unaware and
> would cause purging on nodes not under pressure.
>
> File-focused (John): Checking volatility at page eviction time. Caused
> problems on swap-free systems, since tmpfs pages are anonymous and
> aren't aged/shrunk off lrus. In order to handle that we moved the
> pages to a volatile lru list, but that causes volatile/non-volatile
> operations to be very expensive O(n) for number of pages in the range.
>
> Anon-focused (Minchan): Store volatility in VMA. Worked well for
> anonymous ranges, but was problematic to extend to file ranges as
> we need volatility state to be connected with the file, not the
> process. Iterating across and splitting VMAs was somewhat costly.
>
> Anon-focused (Minchan): Store anonymous volatility in interval tree
> off of the mm_struct. Use global LRU of volatile ranges to use when
> purging ranges via a shrinker. Also hooks into normal eviction to
> make sure evicted pages are purged instead of swapped out. Very fast,
> due to quick manipulations to a single interval tree.  File pages in
> ranges are ignored.
>
> Both (John): Same as above, but mostly extended so interval tree
> of ranges can be hung off of the mm_struct OR an address_space.
> Currently functionality is partitioned so volatile ranges on files and
> on anonymous memory are created via separate syscalls (fvrange(fd,
> start, len, ...) vs mvrange(start_addr, len,...)).  Roughly merges
> the original first approach with the previous one.
>
> Both (John): Currently working on above, further extending mvrange()
> so it can also be used to set volatility on MAP_SHARED file mappings
> in an address space. Has the problem that handling both file and
> anonymous memory types in a single call requires iterating over vmas,
> which makes the operation more expensive.
>
> [TBD]: Cost impact of mvrange() supporting mapped file pages vs dev
> confusion of it not supporting file pages
>
>
>
> Current interfaces:
> -------------------
>
> Two current interfaces:
>     fvrange(fd, start_off, length, mode, flags, &purged)
>
>     mvrange(start_addr, length, mode, flags, &purged)
>
>
> fd/start/length:
>     Hopefully obvious :)
>
> mode:
>     VOLATILE: Sets range as volatile. Returns number of bytes marked
>     volatile.
>
>     NON_VOLATILE: Marks range as non-volatile. Returns number of bytes
>     marked non-volatile, sets purged value to 1 if any memory in the
>     bytes marked non-volatile were purged.
>
> flags:
>     VRANGE_FULL: On eviction, the entire range specified will be purged
>
>     VRANGE_PARTIAL: On eviction, we may purge only part of the
>     specified range.
>
>     In earlier discussions, it was deemed that if any page in
>     a volatile range was purged, we might as well purge the entire
>     range, since if we mark any portion of that range as non-volatile,
>     the application would have to regenerate the entire range. Thus
>     we might as well reduce memory pressure by puring the entire range.
>
>     However, with the SIGBUS semantics, applications may be able to
>     continue accessing pages in a volatile range where one unused
>     page is purged, so we may want to avoid purging the entire range
>     to allow for optimistic continued use.
>
>     Additionally partial purging is helpful so that we don't over-react
>     when we have slight memory pressure. An example, if we have a
>     64M vrange, and the kernel only needs 8M, its much cheaper to
>     free 8M now and then later when the range is marked non-volatile,
>     re-allocate only 8M (fault + allocation + zero-clearing) instead
>     of the entire 64M.
>
>     [TBD]: May consider merging flags w/ mode: ie: VOLATILE_FULL,
>     VOLATILE_PARTIAL, NON_VOLATILE
>
>     [TBD]: Might be able to simplify and go with VRANGE_PARTIAL all
>     the time?
>
> purged:
>     Flag that returns 1 if any pages in the range marked
>     NON_VOLATILE were purged. Is set to zero otherwise. Can be null
>     if mode==VOLATILE.
>
>     [TBD]: Might consider value passed to it will be |'ed with 1?.
>
>     [TBD]: Might consider purged to be more of a status bitflag,
>     allowing vrange(VOLATILE) calls to get some meaningful data like
>     if memory pressure is currently going on.
>
>
> Return value:
>     Number of bytes marked VOLATILE or NON_VOLATILE. This is necessary
>     as if we are to deal with setting ranges that cross anonymous and
>     file backed pages, we have to split the operations up into multiple
>     operations against the respective mm_struct or addess_space, and
>     there's a possibility that we could run out of memory mid-way
>     through an operation.  If we do run out of memory mid way, we
>     simply return the number of bytes successfully marked, and we
>     can return an error on the next invocation if we hit the ENOMEM
>     right away.
>
>     [TBD]: If mvrange() doesn't affect mapped file pages, then the
>     return value can be simpler.
>
>
>
> Current TODOs:
> --------------
>
> Add proper SIGBUS signaling when accessing purged file ranges.
>
> Working on handling mvrange() ranges that cross anonymous and mapped
> file regions.
>
> Handle errors mid-way through operations.
>
> Cleanups and better function names.
>
>
>
> [TBD] Contentious interface issues:
> ------------------------------**-----
>
> Does handling mvrange() calls that cross anonymous & file pages
> increase costs too much for ebizzy workload Minchan likes?
>
>     Have to take mmap_sem and traverse vmas.
>
>     Could mvrange() on file pages not be shared in the same way as
>     in fvrange()
>
>     Sane interface vs Speed?
>
> Minchan's idea of mvrange(VOLATILE_FILE|**VOLATILE_ANON|VOLATILE_BOTH):
>
>     Avoid traversing vmas on VOLATILE_ANON flag, regardless of if
>     range covers mapped file pages
>
>     Not sure we can throw sane errors without checking vmas?
>
> Do we really need a new syscall interface?
>
>     Can we maybe go back to using madvise?
>
>     Should mvrange be prioritized over fvrange, if mvrange can create
>     volatile ranges on files.
>
> Some folks still don't like SIGBUS on accessing a purged volatile page,
> instead want standard zero-fill fault.
>
>     Need some way to know page was dropped (zero is a valid data value)
>
>     After marking non-volatile, it can be zero-fill fault.
>
>
> [TBD] Contentious implementation issues:
> ------------------------------**----------
>
> Still using shrinker for purging, got early complaints from NUMA folks
>
>     Can make sure we check first page in each range and purge only
>     ranges where some page is in the zone being shrinked?
>
>     Still use shrinker, but also use normal page shrinking path,
>     but check for volatility. (swapless still needs shrinker)
>
> Probably don't want to actually hang vrange interval tree (vrange_root)
> off of address_space and struct_mm.
>
>     In earlier attempts I used a hashtable to avoid this
>         http://thread.gmane.org/gmane.**linux.kernel/1278541/focus=**
> 1278542 <http://thread.gmane.org/gmane.linux.kernel/1278541/focus=1278542>
>
>     I assume this is still a concern?
>
>
> Older non-contentious points:
> -----------------------------
>
> Coalescing of ranges: Don't do it unless the ranges overlaps
>
> Range granular vs page granular purging: Resolved with _FULL/_PARTIAL
> flags
>
>
> Other ideas/use-cases proposed:
> ------------------------------**-
>
> PTurner: Marking deep user-stack-frames as volatile to return that
> memory?
>
>
Great write-up John.

Since there's a question mark I thought I'd add a qualifier:
I think this would be specifically useful with segmented stacks.  As we
cross region boundaries we could then mark the previous region as volatile
to allow reclaim without a large re-use penalty if the stack quickly grows
again.  This is a trade-off that is typically difficult to manage.


Dmitry Vyukov: 20-80TB allocation, marked volatile right away. Never
> marking non-volatile.
>
>     Wants zero-fill and doesn't want SIGBUG
>
>     https://code.google.com/p/**thread-sanitizer/wiki/**VolatileRanges<https://code.google.com/p/thread-sanitizer/wiki/VolatileRanges>
>
>
> Misc:
> ----
> Previous discussion: https://lwn.net/Articles/**518130/<https://lwn.net/Articles/518130/>
>
>

[-- Attachment #2: Type: text/html, Size: 17569 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Summary of LSF-MM Volatile Ranges Discussion
  2013-04-17 17:56 LSF-MM Volatile Ranges Discussion Plans John Stultz
  2013-04-17 20:12 ` Paul Turner
@ 2013-04-23  3:11 ` John Stultz
  2013-04-23  6:51   ` Dmitry Vyukov
                     ` (2 more replies)
  1 sibling, 3 replies; 10+ messages in thread
From: John Stultz @ 2013-04-23  3:11 UTC (permalink / raw)
  To: lsf, linux-mm
  Cc: Minchan Kim, Dmitry Vyukov, Paul Turner, Robert Love, Dave Hansen,
	Taras Glek, Mike Hommey, Kostya Serebryany, Hugh Dickins,
	Michel Lespinasse, KOSAKI Motohiro, Johannes Weiner, gthelen,
	Rik van Riel, glommer, mhocko

Just wanted to send out this quick summary of the Volatile Ranges 
discussion at LSF-MM.

Again, this is my recollection and perspective of the discussion, and 
while I'm trying to also provide Minchan's perspective on some of the 
problems as best I can, there likely may be details that were 
misunderstood, or mis-remembered. So if I've gotten anything wrong, 
please step in and reply to correct me. :)

Prior to the discussion, I sent out some background and discussion plans 
which you can read here:
http://permalink.gmane.org/gmane.linux.kernel.mm/98676

First of all, we quickly reviewed the generalized use cases and proposed 
interfaces:

1) madvise style interface:
	mvrange(start_addr, length, mode, flags, &purged)

2) fadvise/fallocate style interface:
	fvrange(fd, start_off, length, mode, flags, &purged)

Also noting (per the background summary) the desired semantics for 
volatile ranges on files is that the volatility is shared (just like the 
data is), thus we need to store that volatility off of the 
address_space. Thus only one process needs to mark the open file pages 
as volatile for them to be purged.

Where as with anonymous memory, we really want to store the volatility 
off of the mm_struct (in some way), and only if all the processes that 
map a page consider it volatile, do purging.

I tried to quickly describe the issue that as performance is a concern, 
we want the action of marking and umarking of volatile ranges to be as 
fast as possible. This is of particular concern to Minchan and his 
ebizzy test case, as taking the mmap_sem hurts performance too much.

However, this strong performance concern causes some complexity in the 
madvise style interface, as since a volatile range could cross both 
anonymous and file pages.

Particularly the question of "What happens if a user calls mvrange() 
over MMAP_SHARED file pages?". I think we should push that volatility 
down into the file volatility, but to do this we have to walk the vmas 
and take the mmap_sem, which hurts Minchan's use case too drastically.

Minchan had earlier proposed having a VOLATILE_ANON | VOLATILE_FILE | 
VOLATILE_BOTH mode flag, where we'd skip traversing the vmas in the 
VOLATILE_ANON case, just adding the range to the process. Where as 
VOLATILE_FILE or VOLATILE_BOTH we'd do the traversing.

However, there is still the problem of the case where someone marks 
VOLATILE_ANON on mapped file pages. In this case, I'd expect we'd report 
an error, however, in order to detect the error case, we'd have to still 
traverse the vmas (otherwise we can't know if the range covers files or 
not), which again would be too costly. And to me, Minchan's suggestion 
of not providing an error on this case, seemed a bit too unintuitive for 
a public interface.

The morning of the discussion, I realized we could instead of thinking 
of volatility only on anonymous and file pages, we could instead think 
of volatility as shared or private, much as file mappings are.

This would allow for the same functional behavior of Minchan's 
VOLATILE_ANON vs VOLATILE_FILE modes, but instead we'd have 
VOLATILE_PRIVATE and VOLATILE_SHARED. And only in the VOLATILE_SHARED 
case would we need to traverse the VMAs in order to make sure that any 
file backed pages had the volatility added to their address_space. And 
private volatility on files would then not be considered an error mode, 
so we could avoid having to do the scan to validate the input.

Minchan seemed to be in agreement with this concept. Though when I asked 
for reactions from the folks in the room, it seemed to be mostly tepid 
agreement mixed maybe with a bit of confusion.

One issue raised was the concern that by keeping the private/anonymous 
volatility state separately from the VMAs might cause cases where things 
got "out-of-sync". For instance, if a range is marked volatile, then say 
some pages are unmapped or a hole is punched in that range and other 
pages are mapped in, what are the semantics of the resulting volatility? 
Is the volatility inherited to future ranges? The example was given of 
mlock, where a range can be locked, but should any new pages be mapped 
into that range, the new pages are not locked. In other words, only the 
pages mapped at that time are affected by the call to mlock.

Stumped by this, I agreed that was a fair critique we hadn't considered, 
and that the in current implementation any new mappings in an existing 
volatile range would be considered volatile, and that is inconsistent 
with existing precedent.

It was pointed out that we could also make sure that on any unmapping or 
new mapping that we clear the private/anonymous volatility, and that 
might keep things in sync. and still allowing for the fast non-vma 
traversing calls to mark and unmark voltile ranges. But we'll have to 
look into that.

It was also noted that vmas are specifically designed to manage ranges 
of memory, so it seemed maybe a bit duplicative to have a separate tree 
tracking volatile ranges. And again we discussed the performance impact 
of taking the mmap_sem and traversing the vmas, and how avoiding that is 
particularly important to Minchan's use case.

I also noted that one difficulty with the earlier approach that did use 
vmas was that for volatile ranges on files (ie: shared volatile 
mappings), there are no similar shared vma type structure for files. 
Thus its nice to be able to use the same volatile root structure to 
store volatile ranges on both the private per-process(well, 
per-mm_struct) and shared per-inode/address_space basis. Otherwise the 
code paths for anonymous and file volatility have to be significantly 
different, which would make it more complex to understand and maintain.

At this point, it was asked if the shared-volatility semantics on the 
shared mapped file is actually desired. And if instead we could keep 
file volatility in the vmas, only purging should every process that maps 
that file agree that the page is volatile.

The problem with this, as I see it is that it is inconsistent with the 
semantics of shared mapped files. If a file is mapped by multiple 
processes, and zeros are written to that file by one processes, all the 
processes will see this change and they need to coordinate access if 
such a change would be problematic. In the case of volatility, when we 
purge pages, the kernel is in-effect doing this on-behalf of the process 
that marked the range volatile. It just is a delayed action and can be 
canceled (by the process that marks it volatile, or by any other process 
with that range mapped).  I re-iterated the example of a large circular 
buffer in a shared file, which is initialized as entirely volatile. Then 
a producer process would mark a region after the head as non-volatile, 
then fill it with data. And a consumer process, then consumes data from 
the tail, and mark those consumed ranges as volatile.

It was pointed out that the same could maybe be done by both processes 
marking the entire range, except what is between the current head and 
tail as volatile each iteration. So while pages wouldn't be truly 
volatile right after they were consumed, eventually the producer would 
run (well, hopefully) and update its view of volatility so that it 
agreed with the consumer with respect to those pages.

I noted that first of all, the shared volatility is needed to match the 
Android ashmem semantics. So there's at least an existing user. And that 
while this method pointed out could be used, I still felt it is fairly 
awkward, and again inconsistent with how shared mapped files normally 
behave. After all, applications could "share" file data by coordinating 
such that they all writing the same data to their own private mapping, 
but that loses much of the usefulness of shared mappings (to be fair, I 
didn't have such a sharp example at the time of the discussion, but its 
the same point I rambled around). Thus I feel having shared volatility 
for file pages is similarly useful.

It was also asked about the volatility semantics would be for non-mapped 
files, given the fvrange() interface could be used there. In that case, 
I don't have a strong opinion. If mvrange can create shared volatile 
ranges on mmaped files, I'm fine leaving fvrange() out. There may be an 
in-kerenl equivalent of fvrange() to make it easier to support Android's 
ashmem, but volatility on non-mmapped files doesn't seem like it would 
be too useful to me. But I'd probably want to go with what would be 
least surprising to users.

It was hard to gauge the overall reaction in the room at this point. 
There was some assorted nodding by various folks who seemed to be 
following along and positive of the basic approach. There were also some 
less positive confused squinting that had me worried.

With time running low, Minchan reminded me that the shrinker was on the 
to-be-discussed list. Basically earlier versions of my patch used a 
shrinker to trigger range purging, and this was critiqued because 
shrinkers were numa-unaware, and might cause bad behavior where we might 
purge lots of ranges on a node that isn't under any memory pressure if 
one node is under pressure.  However, using normal LRU page eviction 
doesn't work for volatile ranges, as with swapless systems, we don't LRU 
age/evict anonymous memory.

Minchan's patch currently does two approaches, where it can use the 
normal LRU eviction to trigger purging, but it also uses a shrinker to 
force anonymous pages onto a page list which can then be evicted in 
vmscan. This allows purging of anonymous pages when swapless, but also 
allows the normal eviction process to work.

This brought up lots of discussion around what the ideal method would 
be. Since because the marking and unmarking of pages as volatile has to 
be done quickly, so we cannot iterate over pages at mark/unmark time 
creating a new list. Aging and evicting all anonymous memory on swapless 
systems also seems wasteful.

Ideally, I think we'd purge pages from volatile ranges in the global LRU 
eviction order. This would hopefully avoid purging data when we see lots 
of single-use streaming data.

Minchan however seems to feel volatile data should be purged earlier 
then other pages, since they're a source of easily free-able memory 
(I've also argued for this in the past, but have since changed my mind). 
So he'd like a way to pruge pages earlier, and unfortunately the 
shrinker runs later then he'd like.

It was noted that there are now patches to make the shrinkers numa 
aware, so the older complains might be solvable. But still the issue of 
shrinkers having their own eviction logic separate from the global LRU 
is less then ideal to me.

It was past time, and there didn't seem to be much consensus or 
resolution on this issue, so we had to leave it there. That said, the 
volatile purging logic is up to the kernel, and can be tweaked as needed 
in the future, where as the basic interface semantics were more 
important to hash out, and I think I got mostly nodding on the majority 
of the interface issues.

Hopefully with the next patch iteration, we'll have things cleaned up a 
bit more and better unified between Minchn's and my approaches so 
further details can be concretely worked out on the list. It was also 
requested that a manpage document be provided with the next patch set, 
which I'll make a point to provide.

Thanks so much to Minchan, Kosaki-san, Hugh, Michel, Johannes, Greg, 
Michal, Glauber, and everyone else for providing an active discussion 
and great feedback despite my likely over-caffeinated verbal wanderings.

Thanks again,
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Summary of LSF-MM Volatile Ranges Discussion
  2013-04-23  3:11 ` Summary of LSF-MM Volatile Ranges Discussion John Stultz
@ 2013-04-23  6:51   ` Dmitry Vyukov
  2013-04-24  0:26     ` John Stultz
  2013-04-24  8:14     ` Minchan Kim
  2013-04-24  8:11   ` Minchan Kim
  2013-05-16 17:24   ` Andrea Arcangeli
  2 siblings, 2 replies; 10+ messages in thread
From: Dmitry Vyukov @ 2013-04-23  6:51 UTC (permalink / raw)
  To: John Stultz
  Cc: lsf, linux-mm, Minchan Kim, Paul Turner, Robert Love, Dave Hansen,
	Taras Glek, Mike Hommey, Kostya Serebryany, Hugh Dickins,
	Michel Lespinasse, KOSAKI Motohiro, Johannes Weiner, gthelen,
	Rik van Riel, glommer, mhocko

On Tue, Apr 23, 2013 at 7:11 AM, John Stultz <john.stultz@linaro.org> wrote:
> Just wanted to send out this quick summary of the Volatile Ranges discussion
> at LSF-MM.
>
> Again, this is my recollection and perspective of the discussion, and while
> I'm trying to also provide Minchan's perspective on some of the problems as
> best I can, there likely may be details that were misunderstood, or
> mis-remembered. So if I've gotten anything wrong, please step in and reply
> to correct me. :)
>
>
> Prior to the discussion, I sent out some background and discussion plans
> which you can read here:
> http://permalink.gmane.org/gmane.linux.kernel.mm/98676
>
>
> First of all, we quickly reviewed the generalized use cases and proposed
> interfaces:
>
> 1) madvise style interface:
>         mvrange(start_addr, length, mode, flags, &purged)
>
> 2) fadvise/fallocate style interface:
>         fvrange(fd, start_off, length, mode, flags, &purged)
>
>
> Also noting (per the background summary) the desired semantics for volatile
> ranges on files is that the volatility is shared (just like the data is),
> thus we need to store that volatility off of the address_space. Thus only
> one process needs to mark the open file pages as volatile for them to be
> purged.
>
> Where as with anonymous memory, we really want to store the volatility off
> of the mm_struct (in some way), and only if all the processes that map a
> page consider it volatile, do purging.
>
> I tried to quickly describe the issue that as performance is a concern, we
> want the action of marking and umarking of volatile ranges to be as fast as
> possible. This is of particular concern to Minchan and his ebizzy test case,
> as taking the mmap_sem hurts performance too much.
>
> However, this strong performance concern causes some complexity in the
> madvise style interface, as since a volatile range could cross both
> anonymous and file pages.
>
> Particularly the question of "What happens if a user calls mvrange() over
> MMAP_SHARED file pages?". I think we should push that volatility down into
> the file volatility, but to do this we have to walk the vmas and take the
> mmap_sem, which hurts Minchan's use case too drastically.
>
> Minchan had earlier proposed having a VOLATILE_ANON | VOLATILE_FILE |
> VOLATILE_BOTH mode flag, where we'd skip traversing the vmas in the
> VOLATILE_ANON case, just adding the range to the process. Where as
> VOLATILE_FILE or VOLATILE_BOTH we'd do the traversing.
>
> However, there is still the problem of the case where someone marks
> VOLATILE_ANON on mapped file pages. In this case, I'd expect we'd report an
> error, however, in order to detect the error case, we'd have to still
> traverse the vmas (otherwise we can't know if the range covers files or
> not), which again would be too costly. And to me, Minchan's suggestion of
> not providing an error on this case, seemed a bit too unintuitive for a
> public interface.
>
> The morning of the discussion, I realized we could instead of thinking of
> volatility only on anonymous and file pages, we could instead think of
> volatility as shared or private, much as file mappings are.
>
> This would allow for the same functional behavior of Minchan's VOLATILE_ANON
> vs VOLATILE_FILE modes, but instead we'd have VOLATILE_PRIVATE and
> VOLATILE_SHARED. And only in the VOLATILE_SHARED case would we need to
> traverse the VMAs in order to make sure that any file backed pages had the
> volatility added to their address_space. And private volatility on files
> would then not be considered an error mode, so we could avoid having to do
> the scan to validate the input.
>
> Minchan seemed to be in agreement with this concept. Though when I asked for
> reactions from the folks in the room, it seemed to be mostly tepid agreement
> mixed maybe with a bit of confusion.
>
> One issue raised was the concern that by keeping the private/anonymous
> volatility state separately from the VMAs might cause cases where things got
> "out-of-sync". For instance, if a range is marked volatile, then say some
> pages are unmapped or a hole is punched in that range and other pages are
> mapped in, what are the semantics of the resulting volatility? Is the
> volatility inherited to future ranges? The example was given of mlock, where
> a range can be locked, but should any new pages be mapped into that range,
> the new pages are not locked. In other words, only the pages mapped at that
> time are affected by the call to mlock.
>
> Stumped by this, I agreed that was a fair critique we hadn't considered, and
> that the in current implementation any new mappings in an existing volatile
> range would be considered volatile, and that is inconsistent with existing
> precedent.
>
> It was pointed out that we could also make sure that on any unmapping or new
> mapping that we clear the private/anonymous volatility, and that might keep
> things in sync. and still allowing for the fast non-vma traversing calls to
> mark and unmark voltile ranges. But we'll have to look into that.
>
> It was also noted that vmas are specifically designed to manage ranges of
> memory, so it seemed maybe a bit duplicative to have a separate tree
> tracking volatile ranges. And again we discussed the performance impact of
> taking the mmap_sem and traversing the vmas, and how avoiding that is
> particularly important to Minchan's use case.
>
> I also noted that one difficulty with the earlier approach that did use vmas
> was that for volatile ranges on files (ie: shared volatile mappings), there
> are no similar shared vma type structure for files. Thus its nice to be able
> to use the same volatile root structure to store volatile ranges on both the
> private per-process(well, per-mm_struct) and shared per-inode/address_space
> basis. Otherwise the code paths for anonymous and file volatility have to be
> significantly different, which would make it more complex to understand and
> maintain.
>
> At this point, it was asked if the shared-volatility semantics on the shared
> mapped file is actually desired. And if instead we could keep file
> volatility in the vmas, only purging should every process that maps that
> file agree that the page is volatile.
>
> The problem with this, as I see it is that it is inconsistent with the
> semantics of shared mapped files. If a file is mapped by multiple processes,
> and zeros are written to that file by one processes, all the processes will
> see this change and they need to coordinate access if such a change would be
> problematic. In the case of volatility, when we purge pages, the kernel is
> in-effect doing this on-behalf of the process that marked the range
> volatile. It just is a delayed action and can be canceled (by the process
> that marks it volatile, or by any other process with that range mapped).  I
> re-iterated the example of a large circular buffer in a shared file, which
> is initialized as entirely volatile. Then a producer process would mark a
> region after the head as non-volatile, then fill it with data. And a
> consumer process, then consumes data from the tail, and mark those consumed
> ranges as volatile.
>
> It was pointed out that the same could maybe be done by both processes
> marking the entire range, except what is between the current head and tail
> as volatile each iteration. So while pages wouldn't be truly volatile right
> after they were consumed, eventually the producer would run (well,
> hopefully) and update its view of volatility so that it agreed with the
> consumer with respect to those pages.
>
> I noted that first of all, the shared volatility is needed to match the
> Android ashmem semantics. So there's at least an existing user. And that
> while this method pointed out could be used, I still felt it is fairly
> awkward, and again inconsistent with how shared mapped files normally
> behave. After all, applications could "share" file data by coordinating such
> that they all writing the same data to their own private mapping, but that
> loses much of the usefulness of shared mappings (to be fair, I didn't have
> such a sharp example at the time of the discussion, but its the same point I
> rambled around). Thus I feel having shared volatility for file pages is
> similarly useful.
>
> It was also asked about the volatility semantics would be for non-mapped
> files, given the fvrange() interface could be used there. In that case, I
> don't have a strong opinion. If mvrange can create shared volatile ranges on
> mmaped files, I'm fine leaving fvrange() out. There may be an in-kerenl
> equivalent of fvrange() to make it easier to support Android's ashmem, but
> volatility on non-mmapped files doesn't seem like it would be too useful to
> me. But I'd probably want to go with what would be least surprising to
> users.
>
> It was hard to gauge the overall reaction in the room at this point. There
> was some assorted nodding by various folks who seemed to be following along
> and positive of the basic approach. There were also some less positive
> confused squinting that had me worried.
>
> With time running low, Minchan reminded me that the shrinker was on the
> to-be-discussed list. Basically earlier versions of my patch used a shrinker
> to trigger range purging, and this was critiqued because shrinkers were
> numa-unaware, and might cause bad behavior where we might purge lots of
> ranges on a node that isn't under any memory pressure if one node is under
> pressure.  However, using normal LRU page eviction doesn't work for volatile
> ranges, as with swapless systems, we don't LRU age/evict anonymous memory.
>
> Minchan's patch currently does two approaches, where it can use the normal
> LRU eviction to trigger purging, but it also uses a shrinker to force
> anonymous pages onto a page list which can then be evicted in vmscan. This
> allows purging of anonymous pages when swapless, but also allows the normal
> eviction process to work.
>
> This brought up lots of discussion around what the ideal method would be.
> Since because the marking and unmarking of pages as volatile has to be done
> quickly, so we cannot iterate over pages at mark/unmark time creating a new
> list. Aging and evicting all anonymous memory on swapless systems also seems
> wasteful.
>
> Ideally, I think we'd purge pages from volatile ranges in the global LRU
> eviction order. This would hopefully avoid purging data when we see lots of
> single-use streaming data.
>
> Minchan however seems to feel volatile data should be purged earlier then
> other pages, since they're a source of easily free-able memory (I've also
> argued for this in the past, but have since changed my mind). So he'd like a
> way to pruge pages earlier, and unfortunately the shrinker runs later then
> he'd like.
>
> It was noted that there are now patches to make the shrinkers numa aware, so
> the older complains might be solvable. But still the issue of shrinkers
> having their own eviction logic separate from the global LRU is less then
> ideal to me.
>
> It was past time, and there didn't seem to be much consensus or resolution
> on this issue, so we had to leave it there. That said, the volatile purging
> logic is up to the kernel, and can be tweaked as needed in the future, where
> as the basic interface semantics were more important to hash out, and I
> think I got mostly nodding on the majority of the interface issues.
>
> Hopefully with the next patch iteration, we'll have things cleaned up a bit
> more and better unified between Minchn's and my approaches so further
> details can be concretely worked out on the list. It was also requested that
> a manpage document be provided with the next patch set, which I'll make a
> point to provide.
>
> Thanks so much to Minchan, Kosaki-san, Hugh, Michel, Johannes, Greg, Michal,
> Glauber, and everyone else for providing an active discussion and great
> feedback despite my likely over-caffeinated verbal wanderings.


Hi,

Just want to make sure our case does not fall out of the discussion:
https://code.google.com/p/thread-sanitizer/wiki/VolatileRanges

While reading your email, I remembered that we actually have some
pages mapped from a file inside the range. So it's like 70TB of ANON
mapping + few pages in the middle mapped from FILE. The file is mapped
with MAP_PRIVATE + PROT_READ, it's read-only and not shared.
But we want to mark the volatile range only once on startup, so
performance is not a serious concern (while the function in executed
in say no more than 10ms).
If the mixed ANON+FILE ranges becomes a serious problem, we are ready
to remove FILE mappings, because it's only an optimization. I.e. we
can make it pure ANON mapping.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Summary of LSF-MM Volatile Ranges Discussion
  2013-04-23  6:51   ` Dmitry Vyukov
@ 2013-04-24  0:26     ` John Stultz
  2013-04-24  6:11       ` Dmitry Vyukov
  2013-04-24  8:14     ` Minchan Kim
  1 sibling, 1 reply; 10+ messages in thread
From: John Stultz @ 2013-04-24  0:26 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: lsf, linux-mm, Minchan Kim, Paul Turner, Robert Love, Dave Hansen,
	Taras Glek, Mike Hommey, Kostya Serebryany, Hugh Dickins,
	Michel Lespinasse, KOSAKI Motohiro, Johannes Weiner, gthelen,
	Rik van Riel, glommer, mhocko

On 04/22/2013 11:51 PM, Dmitry Vyukov wrote:
> Just want to make sure our case does not fall out of the discussion:
> https://code.google.com/p/thread-sanitizer/wiki/VolatileRanges

Yes, while I forgot to mention it in the summary, I did bring it up 
briefly, but I cannot claim to have done it justice.

Personally, while I suspect we might be able to support your desired 
semantics (ie: mark once volatile, always zero-fill, no sigbus) via a 
mode flag

> While reading your email, I remembered that we actually have some
> pages mapped from a file inside the range. So it's like 70TB of ANON
> mapping + few pages in the middle mapped from FILE. The file is mapped
> with MAP_PRIVATE + PROT_READ, it's read-only and not shared.
> But we want to mark the volatile range only once on startup, so
> performance is not a serious concern (while the function in executed
> in say no more than 10ms).
> If the mixed ANON+FILE ranges becomes a serious problem, we are ready
> to remove FILE mappings, because it's only an optimization. I.e. we
> can make it pure ANON mapping.
Well, in my mind, the MAP_PRIVATE mappings are semantically the same as 
anonymous memory with regards to volatility. So I hope this wouldn't be 
an issue.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Summary of LSF-MM Volatile Ranges Discussion
  2013-04-24  0:26     ` John Stultz
@ 2013-04-24  6:11       ` Dmitry Vyukov
  0 siblings, 0 replies; 10+ messages in thread
From: Dmitry Vyukov @ 2013-04-24  6:11 UTC (permalink / raw)
  To: John Stultz
  Cc: lsf, linux-mm, Minchan Kim, Paul Turner, Robert Love, Dave Hansen,
	Taras Glek, Mike Hommey, Kostya Serebryany, Hugh Dickins,
	Michel Lespinasse, KOSAKI Motohiro, Johannes Weiner, Greg Thelen,
	Rik van Riel, glommer, mhocko

On Wed, Apr 24, 2013 at 4:26 AM, John Stultz <john.stultz@linaro.org> wrote:
> On 04/22/2013 11:51 PM, Dmitry Vyukov wrote:
>>
>> Just want to make sure our case does not fall out of the discussion:
>> https://code.google.com/p/thread-sanitizer/wiki/VolatileRanges
>
>
> Yes, while I forgot to mention it in the summary, I did bring it up briefly,
> but I cannot claim to have done it justice.

Thanks!

> Personally, while I suspect we might be able to support your desired
> semantics (ie: mark once volatile, always zero-fill, no sigbus) via a mode
> flag
>
>
>> While reading your email, I remembered that we actually have some
>> pages mapped from a file inside the range. So it's like 70TB of ANON
>> mapping + few pages in the middle mapped from FILE. The file is mapped
>> with MAP_PRIVATE + PROT_READ, it's read-only and not shared.
>> But we want to mark the volatile range only once on startup, so
>> performance is not a serious concern (while the function in executed
>> in say no more than 10ms).
>> If the mixed ANON+FILE ranges becomes a serious problem, we are ready
>> to remove FILE mappings, because it's only an optimization. I.e. we
>> can make it pure ANON mapping.
>
> Well, in my mind, the MAP_PRIVATE mappings are semantically the same as
> anonymous memory with regards to volatility. So I hope this wouldn't be an
> issue.

Ah, I see, so you more concerned about SHARED rather than FILE. We do
NOT have any SHARED regions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Summary of LSF-MM Volatile Ranges Discussion
  2013-04-23  6:51   ` Dmitry Vyukov
  2013-04-24  0:26     ` John Stultz
@ 2013-04-24  8:14     ` Minchan Kim
  1 sibling, 0 replies; 10+ messages in thread
From: Minchan Kim @ 2013-04-24  8:14 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: John Stultz, lsf, linux-mm, Paul Turner, Robert Love, Dave Hansen,
	Taras Glek, Mike Hommey, Kostya Serebryany, Hugh Dickins,
	Michel Lespinasse, KOSAKI Motohiro, Johannes Weiner, gthelen,
	Rik van Riel, glommer, mhocko

Hello Dmitry,

On Tue, Apr 23, 2013 at 10:51:10AM +0400, Dmitry Vyukov wrote:
> On Tue, Apr 23, 2013 at 7:11 AM, John Stultz <john.stultz@linaro.org> wrote:
> > Just wanted to send out this quick summary of the Volatile Ranges discussion
> > at LSF-MM.
> >
> > Again, this is my recollection and perspective of the discussion, and while
> > I'm trying to also provide Minchan's perspective on some of the problems as
> > best I can, there likely may be details that were misunderstood, or
> > mis-remembered. So if I've gotten anything wrong, please step in and reply
> > to correct me. :)
> >
> >
> > Prior to the discussion, I sent out some background and discussion plans
> > which you can read here:
> > http://permalink.gmane.org/gmane.linux.kernel.mm/98676
> >
> >
> > First of all, we quickly reviewed the generalized use cases and proposed
> > interfaces:
> >
> > 1) madvise style interface:
> >         mvrange(start_addr, length, mode, flags, &purged)
> >
> > 2) fadvise/fallocate style interface:
> >         fvrange(fd, start_off, length, mode, flags, &purged)
> >
> >
> > Also noting (per the background summary) the desired semantics for volatile
> > ranges on files is that the volatility is shared (just like the data is),
> > thus we need to store that volatility off of the address_space. Thus only
> > one process needs to mark the open file pages as volatile for them to be
> > purged.
> >
> > Where as with anonymous memory, we really want to store the volatility off
> > of the mm_struct (in some way), and only if all the processes that map a
> > page consider it volatile, do purging.
> >
> > I tried to quickly describe the issue that as performance is a concern, we
> > want the action of marking and umarking of volatile ranges to be as fast as
> > possible. This is of particular concern to Minchan and his ebizzy test case,
> > as taking the mmap_sem hurts performance too much.
> >
> > However, this strong performance concern causes some complexity in the
> > madvise style interface, as since a volatile range could cross both
> > anonymous and file pages.
> >
> > Particularly the question of "What happens if a user calls mvrange() over
> > MMAP_SHARED file pages?". I think we should push that volatility down into
> > the file volatility, but to do this we have to walk the vmas and take the
> > mmap_sem, which hurts Minchan's use case too drastically.
> >
> > Minchan had earlier proposed having a VOLATILE_ANON | VOLATILE_FILE |
> > VOLATILE_BOTH mode flag, where we'd skip traversing the vmas in the
> > VOLATILE_ANON case, just adding the range to the process. Where as
> > VOLATILE_FILE or VOLATILE_BOTH we'd do the traversing.
> >
> > However, there is still the problem of the case where someone marks
> > VOLATILE_ANON on mapped file pages. In this case, I'd expect we'd report an
> > error, however, in order to detect the error case, we'd have to still
> > traverse the vmas (otherwise we can't know if the range covers files or
> > not), which again would be too costly. And to me, Minchan's suggestion of
> > not providing an error on this case, seemed a bit too unintuitive for a
> > public interface.
> >
> > The morning of the discussion, I realized we could instead of thinking of
> > volatility only on anonymous and file pages, we could instead think of
> > volatility as shared or private, much as file mappings are.
> >
> > This would allow for the same functional behavior of Minchan's VOLATILE_ANON
> > vs VOLATILE_FILE modes, but instead we'd have VOLATILE_PRIVATE and
> > VOLATILE_SHARED. And only in the VOLATILE_SHARED case would we need to
> > traverse the VMAs in order to make sure that any file backed pages had the
> > volatility added to their address_space. And private volatility on files
> > would then not be considered an error mode, so we could avoid having to do
> > the scan to validate the input.
> >
> > Minchan seemed to be in agreement with this concept. Though when I asked for
> > reactions from the folks in the room, it seemed to be mostly tepid agreement
> > mixed maybe with a bit of confusion.
> >
> > One issue raised was the concern that by keeping the private/anonymous
> > volatility state separately from the VMAs might cause cases where things got
> > "out-of-sync". For instance, if a range is marked volatile, then say some
> > pages are unmapped or a hole is punched in that range and other pages are
> > mapped in, what are the semantics of the resulting volatility? Is the
> > volatility inherited to future ranges? The example was given of mlock, where
> > a range can be locked, but should any new pages be mapped into that range,
> > the new pages are not locked. In other words, only the pages mapped at that
> > time are affected by the call to mlock.
> >
> > Stumped by this, I agreed that was a fair critique we hadn't considered, and
> > that the in current implementation any new mappings in an existing volatile
> > range would be considered volatile, and that is inconsistent with existing
> > precedent.
> >
> > It was pointed out that we could also make sure that on any unmapping or new
> > mapping that we clear the private/anonymous volatility, and that might keep
> > things in sync. and still allowing for the fast non-vma traversing calls to
> > mark and unmark voltile ranges. But we'll have to look into that.
> >
> > It was also noted that vmas are specifically designed to manage ranges of
> > memory, so it seemed maybe a bit duplicative to have a separate tree
> > tracking volatile ranges. And again we discussed the performance impact of
> > taking the mmap_sem and traversing the vmas, and how avoiding that is
> > particularly important to Minchan's use case.
> >
> > I also noted that one difficulty with the earlier approach that did use vmas
> > was that for volatile ranges on files (ie: shared volatile mappings), there
> > are no similar shared vma type structure for files. Thus its nice to be able
> > to use the same volatile root structure to store volatile ranges on both the
> > private per-process(well, per-mm_struct) and shared per-inode/address_space
> > basis. Otherwise the code paths for anonymous and file volatility have to be
> > significantly different, which would make it more complex to understand and
> > maintain.
> >
> > At this point, it was asked if the shared-volatility semantics on the shared
> > mapped file is actually desired. And if instead we could keep file
> > volatility in the vmas, only purging should every process that maps that
> > file agree that the page is volatile.
> >
> > The problem with this, as I see it is that it is inconsistent with the
> > semantics of shared mapped files. If a file is mapped by multiple processes,
> > and zeros are written to that file by one processes, all the processes will
> > see this change and they need to coordinate access if such a change would be
> > problematic. In the case of volatility, when we purge pages, the kernel is
> > in-effect doing this on-behalf of the process that marked the range
> > volatile. It just is a delayed action and can be canceled (by the process
> > that marks it volatile, or by any other process with that range mapped).  I
> > re-iterated the example of a large circular buffer in a shared file, which
> > is initialized as entirely volatile. Then a producer process would mark a
> > region after the head as non-volatile, then fill it with data. And a
> > consumer process, then consumes data from the tail, and mark those consumed
> > ranges as volatile.
> >
> > It was pointed out that the same could maybe be done by both processes
> > marking the entire range, except what is between the current head and tail
> > as volatile each iteration. So while pages wouldn't be truly volatile right
> > after they were consumed, eventually the producer would run (well,
> > hopefully) and update its view of volatility so that it agreed with the
> > consumer with respect to those pages.
> >
> > I noted that first of all, the shared volatility is needed to match the
> > Android ashmem semantics. So there's at least an existing user. And that
> > while this method pointed out could be used, I still felt it is fairly
> > awkward, and again inconsistent with how shared mapped files normally
> > behave. After all, applications could "share" file data by coordinating such
> > that they all writing the same data to their own private mapping, but that
> > loses much of the usefulness of shared mappings (to be fair, I didn't have
> > such a sharp example at the time of the discussion, but its the same point I
> > rambled around). Thus I feel having shared volatility for file pages is
> > similarly useful.
> >
> > It was also asked about the volatility semantics would be for non-mapped
> > files, given the fvrange() interface could be used there. In that case, I
> > don't have a strong opinion. If mvrange can create shared volatile ranges on
> > mmaped files, I'm fine leaving fvrange() out. There may be an in-kerenl
> > equivalent of fvrange() to make it easier to support Android's ashmem, but
> > volatility on non-mmapped files doesn't seem like it would be too useful to
> > me. But I'd probably want to go with what would be least surprising to
> > users.
> >
> > It was hard to gauge the overall reaction in the room at this point. There
> > was some assorted nodding by various folks who seemed to be following along
> > and positive of the basic approach. There were also some less positive
> > confused squinting that had me worried.
> >
> > With time running low, Minchan reminded me that the shrinker was on the
> > to-be-discussed list. Basically earlier versions of my patch used a shrinker
> > to trigger range purging, and this was critiqued because shrinkers were
> > numa-unaware, and might cause bad behavior where we might purge lots of
> > ranges on a node that isn't under any memory pressure if one node is under
> > pressure.  However, using normal LRU page eviction doesn't work for volatile
> > ranges, as with swapless systems, we don't LRU age/evict anonymous memory.
> >
> > Minchan's patch currently does two approaches, where it can use the normal
> > LRU eviction to trigger purging, but it also uses a shrinker to force
> > anonymous pages onto a page list which can then be evicted in vmscan. This
> > allows purging of anonymous pages when swapless, but also allows the normal
> > eviction process to work.
> >
> > This brought up lots of discussion around what the ideal method would be.
> > Since because the marking and unmarking of pages as volatile has to be done
> > quickly, so we cannot iterate over pages at mark/unmark time creating a new
> > list. Aging and evicting all anonymous memory on swapless systems also seems
> > wasteful.
> >
> > Ideally, I think we'd purge pages from volatile ranges in the global LRU
> > eviction order. This would hopefully avoid purging data when we see lots of
> > single-use streaming data.
> >
> > Minchan however seems to feel volatile data should be purged earlier then
> > other pages, since they're a source of easily free-able memory (I've also
> > argued for this in the past, but have since changed my mind). So he'd like a
> > way to pruge pages earlier, and unfortunately the shrinker runs later then
> > he'd like.
> >
> > It was noted that there are now patches to make the shrinkers numa aware, so
> > the older complains might be solvable. But still the issue of shrinkers
> > having their own eviction logic separate from the global LRU is less then
> > ideal to me.
> >
> > It was past time, and there didn't seem to be much consensus or resolution
> > on this issue, so we had to leave it there. That said, the volatile purging
> > logic is up to the kernel, and can be tweaked as needed in the future, where
> > as the basic interface semantics were more important to hash out, and I
> > think I got mostly nodding on the majority of the interface issues.
> >
> > Hopefully with the next patch iteration, we'll have things cleaned up a bit
> > more and better unified between Minchn's and my approaches so further
> > details can be concretely worked out on the list. It was also requested that
> > a manpage document be provided with the next patch set, which I'll make a
> > point to provide.
> >
> > Thanks so much to Minchan, Kosaki-san, Hugh, Michel, Johannes, Greg, Michal,
> > Glauber, and everyone else for providing an active discussion and great
> > feedback despite my likely over-caffeinated verbal wanderings.
> 
> 
> Hi,
> 
> Just want to make sure our case does not fall out of the discussion:
> https://code.google.com/p/thread-sanitizer/wiki/VolatileRanges
> 
> While reading your email, I remembered that we actually have some
> pages mapped from a file inside the range. So it's like 70TB of ANON
> mapping + few pages in the middle mapped from FILE. The file is mapped
> with MAP_PRIVATE + PROT_READ, it's read-only and not shared.
> But we want to mark the volatile range only once on startup, so
> performance is not a serious concern (while the function in executed
> in say no more than 10ms).
> If the mixed ANON+FILE ranges becomes a serious problem, we are ready
> to remove FILE mappings, because it's only an optimization. I.e. we
> can make it pure ANON mapping.

As I mentioned by private mail, there are no issue to support your requirement.
What we need is just voice of customer and you are giving the voice now. :)
So no problem, IMO.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Summary of LSF-MM Volatile Ranges Discussion
  2013-04-23  3:11 ` Summary of LSF-MM Volatile Ranges Discussion John Stultz
  2013-04-23  6:51   ` Dmitry Vyukov
@ 2013-04-24  8:11   ` Minchan Kim
  2013-05-16 17:24   ` Andrea Arcangeli
  2 siblings, 0 replies; 10+ messages in thread
From: Minchan Kim @ 2013-04-24  8:11 UTC (permalink / raw)
  To: John Stultz
  Cc: lsf, linux-mm, Dmitry Vyukov, Paul Turner, Robert Love,
	Dave Hansen, Taras Glek, Mike Hommey, Kostya Serebryany,
	Hugh Dickins, Michel Lespinasse, KOSAKI Motohiro, Johannes Weiner,
	gthelen, Rik van Riel, glommer, mhocko

Hello John,

On Mon, Apr 22, 2013 at 08:11:39PM -0700, John Stultz wrote:
> Just wanted to send out this quick summary of the Volatile Ranges
> discussion at LSF-MM.
> 
> Again, this is my recollection and perspective of the discussion,
> and while I'm trying to also provide Minchan's perspective on some
> of the problems as best I can, there likely may be details that were
> misunderstood, or mis-remembered. So if I've gotten anything wrong,
> please step in and reply to correct me. :)

Sure. Thanks for your amazing summary!

> 
> 
> Prior to the discussion, I sent out some background and discussion
> plans which you can read here:
> http://permalink.gmane.org/gmane.linux.kernel.mm/98676
> 
> 
> First of all, we quickly reviewed the generalized use cases and
> proposed interfaces:
> 
> 1) madvise style interface:
> 	mvrange(start_addr, length, mode, flags, &purged)
> 
> 2) fadvise/fallocate style interface:
> 	fvrange(fd, start_off, length, mode, flags, &purged)
> 
> 
> Also noting (per the background summary) the desired semantics for
> volatile ranges on files is that the volatility is shared (just like
> the data is), thus we need to store that volatility off of the
> address_space. Thus only one process needs to mark the open file
> pages as volatile for them to be purged.
> 
> Where as with anonymous memory, we really want to store the
> volatility off of the mm_struct (in some way), and only if all the
> processes that map a page consider it volatile, do purging.
> 
> I tried to quickly describe the issue that as performance is a
> concern, we want the action of marking and umarking of volatile
> ranges to be as fast as possible. This is of particular concern to
> Minchan and his ebizzy test case, as taking the mmap_sem hurts
> performance too much.

FYI, the reason why it's a concern on anon-vrange is I'd like to use
vrange in userspace allocator instead of using madvise(DONTNEED)/munmap.
Userspace allocator should work well in multi-threaded environment
but if we hold mmap_sem in vrange system, its hurt concurrent page fault
when one of thread try to mmap.

> 
> However, this strong performance concern causes some complexity in
> the madvise style interface, as since a volatile range could cross
> both anonymous and file pages.
> 
> Particularly the question of "What happens if a user calls mvrange()
> over MMAP_SHARED file pages?". I think we should push that
> volatility down into the file volatility, but to do this we have to
> walk the vmas and take the mmap_sem, which hurts Minchan's use case
> too drastically.

True. it made the ebizzy performance hurt about 3 times AFAIRC.

> 
> Minchan had earlier proposed having a VOLATILE_ANON | VOLATILE_FILE
> | VOLATILE_BOTH mode flag, where we'd skip traversing the vmas in
> the VOLATILE_ANON case, just adding the range to the process. Where
> as VOLATILE_FILE or VOLATILE_BOTH we'd do the traversing.

Right.

> 
> However, there is still the problem of the case where someone marks
> VOLATILE_ANON on mapped file pages. In this case, I'd expect we'd
> report an error, however, in order to detect the error case, we'd
> have to still traverse the vmas (otherwise we can't know if the
> range covers files or not), which again would be too costly. And to
> me, Minchan's suggestion of not providing an error on this case,
> seemed a bit too unintuitive for a public interface.

Frankly speaking, I am not convinced that we should return error in
such case. Now I think vrange isn't related to vma. User can regard
some ranges of address space to volatile regardless of that it has
already mmaped vmas or not.

> 
> The morning of the discussion, I realized we could instead of
> thinking of volatility only on anonymous and file pages, we could
> instead think of volatility as shared or private, much as file
> mappings are.
> 
> This would allow for the same functional behavior of Minchan's
> VOLATILE_ANON vs VOLATILE_FILE modes, but instead we'd have
> VOLATILE_PRIVATE and VOLATILE_SHARED. And only in the
> VOLATILE_SHARED case would we need to traverse the VMAs in order to
> make sure that any file backed pages had the volatility added to
> their address_space. And private volatility on files would then not
> be considered an error mode, so we could avoid having to do the scan
> to validate the input.
> 
> Minchan seemed to be in agreement with this concept. Though when I
> asked for reactions from the folks in the room, it seemed to be
> mostly tepid agreement mixed maybe with a bit of confusion.

I am not strong against your suggestion.
But still, my preference is VOLATILE_[ANON|FILE] rather than
MMAP_[PRIVATE|SHARED] because it's looks straight forward
to me. Anyway, It's nothing really. :)

> 
> One issue raised was the concern that by keeping the
> private/anonymous volatility state separately from the VMAs might
> cause cases where things got "out-of-sync". For instance, if a range
> is marked volatile, then say some pages are unmapped or a hole is
> punched in that range and other pages are mapped in, what are the
> semantics of the resulting volatility? Is the volatility inherited
> to future ranges? The example was given of mlock, where a range can
> be locked, but should any new pages be mapped into that range, the
> new pages are not locked. In other words, only the pages mapped at
> that time are affected by the call to mlock.
> 
> Stumped by this, I agreed that was a fair critique we hadn't
> considered, and that the in current implementation any new mappings
> in an existing volatile range would be considered volatile, and that
> is inconsistent with existing precedent.

Honestly speaking, I did consider it and concluded current sematic is
more sane. For example, someone want to make big range with volatile
although there are not any mapped page in the range at the moment.
Then, he want to make new allocator based on the range with mmap(MMAP_FIXED)
so he can make new vma into the volatile range anytime and kernel can
purge them anytime. I couldn't image concrete exmaple at the moment
but it could give good flexibility to user and It's not bad for vrange
semantic which covers big ranges even mixed by anon + file.

We are creating new system call so we don't have to be tied with
another system call semantic strongly. Yeb. but at least, I hope we
can give some example which is useful in real usecases.

> 
> It was pointed out that we could also make sure that on any
> unmapping or new mapping that we clear the private/anonymous
> volatility, and that might keep things in sync. and still allowing
> for the fast non-vma traversing calls to mark and unmark voltile
> ranges. But we'll have to look into that.
> 
> It was also noted that vmas are specifically designed to manage
> ranges of memory, so it seemed maybe a bit duplicative to have a
> separate tree tracking volatile ranges. And again we discussed the
> performance impact of taking the mmap_sem and traversing the vmas,
> and how avoiding that is particularly important to Minchan's use
> case.
> 
> I also noted that one difficulty with the earlier approach that did
> use vmas was that for volatile ranges on files (ie: shared volatile
> mappings), there are no similar shared vma type structure for files.
> Thus its nice to be able to use the same volatile root structure to
> store volatile ranges on both the private per-process(well,
> per-mm_struct) and shared per-inode/address_space basis. Otherwise
> the code paths for anonymous and file volatility have to be
> significantly different, which would make it more complex to
> understand and maintain.

Fair enough.

> 
> At this point, it was asked if the shared-volatility semantics on
> the shared mapped file is actually desired. And if instead we could
> keep file volatility in the vmas, only purging should every process
> that maps that file agree that the page is volatile.
> 
> The problem with this, as I see it is that it is inconsistent with
> the semantics of shared mapped files. If a file is mapped by
> multiple processes, and zeros are written to that file by one
> processes, all the processes will see this change and they need to
> coordinate access if such a change would be problematic. In the case
> of volatility, when we purge pages, the kernel is in-effect doing
> this on-behalf of the process that marked the range volatile. It
> just is a delayed action and can be canceled (by the process that
> marks it volatile, or by any other process with that range mapped).
> I re-iterated the example of a large circular buffer in a shared
> file, which is initialized as entirely volatile. Then a producer
> process would mark a region after the head as non-volatile, then
> fill it with data. And a consumer process, then consumes data from
> the tail, and mark those consumed ranges as volatile.
> 
> It was pointed out that the same could maybe be done by both
> processes marking the entire range, except what is between the
> current head and tail as volatile each iteration. So while pages
> wouldn't be truly volatile right after they were consumed,
> eventually the producer would run (well, hopefully) and update its
> view of volatility so that it agreed with the consumer with respect
> to those pages.
> 
> I noted that first of all, the shared volatility is needed to match
> the Android ashmem semantics. So there's at least an existing user.
> And that while this method pointed out could be used, I still felt
> it is fairly awkward, and again inconsistent with how shared mapped
> files normally behave. After all, applications could "share" file
> data by coordinating such that they all writing the same data to
> their own private mapping, but that loses much of the usefulness of
> shared mappings (to be fair, I didn't have such a sharp example at
> the time of the discussion, but its the same point I rambled
> around). Thus I feel having shared volatility for file pages is
> similarly useful.

Agreed.

> 
> It was also asked about the volatility semantics would be for
> non-mapped files, given the fvrange() interface could be used there.
> In that case, I don't have a strong opinion. If mvrange can create
> shared volatile ranges on mmaped files, I'm fine leaving fvrange()
> out. There may be an in-kerenl equivalent of fvrange() to make it
> easier to support Android's ashmem, but volatility on non-mmapped
> files doesn't seem like it would be too useful to me. But I'd
> probably want to go with what would be least surprising to users.
> 
> It was hard to gauge the overall reaction in the room at this point.
> There was some assorted nodding by various folks who seemed to be
> following along and positive of the basic approach. There were also
> some less positive confused squinting that had me worried.
> 
> With time running low, Minchan reminded me that the shrinker was on
> the to-be-discussed list. Basically earlier versions of my patch
> used a shrinker to trigger range purging, and this was critiqued
> because shrinkers were numa-unaware, and might cause bad behavior
> where we might purge lots of ranges on a node that isn't under any
> memory pressure if one node is under pressure.  However, using
> normal LRU page eviction doesn't work for volatile ranges, as with
> swapless systems, we don't LRU age/evict anonymous memory.
> 
> Minchan's patch currently does two approaches, where it can use the
> normal LRU eviction to trigger purging, but it also uses a shrinker
> to force anonymous pages onto a page list which can then be evicted
> in vmscan. This allows purging of anonymous pages when swapless, but

Exactly speaking, not shrinker but uses kswapd hook. But I have a plan
to move it from kswapd to new kvrangd because kswapd is very fragile
these days so I'd like to keep kvranged until kswapd is very stable,
otherwise, we might maintain vranged without unifying with kswapd.

> also allows the normal eviction process to work.
> 
> This brought up lots of discussion around what the ideal method
> would be. Since because the marking and unmarking of pages as
> volatile has to be done quickly, so we cannot iterate over pages at
> mark/unmark time creating a new list. Aging and evicting all
> anonymous memory on swapless systems also seems wasteful.
> 
> Ideally, I think we'd purge pages from volatile ranges in the global
> LRU eviction order. This would hopefully avoid purging data when we
> see lots of single-use streaming data.
> 
> Minchan however seems to feel volatile data should be purged earlier
> then other pages, since they're a source of easily free-able memory
> (I've also argued for this in the past, but have since changed my
> mind). So he'd like a way to pruge pages earlier, and unfortunately
> the shrinker runs later then he'd like.

Why I consider that volatile pages are top candidate to reclaim is
if we don't support vrange system call, maybe users are likely to use
munmap or madvise(DONTNEED) instead of vrange. It means the pages
in the range were already freed if we don't give new vrange system call
so they were freed earlier other than pages like streaming data.

But I agree streaming data is more useless than volatile pages.
I will consider this part more and others really want to handle
volatile pages by normal LRU order, I can do it easily.

Another idea is if we makes sure some pages is really useless,
we can make new LRU list(aka, ezReclaimLRU) and put the pages
into the LRU list when some advise system call happens. Then,
reclaimer peek ezReclaimLRU list prio to purging volatile pages
and reclaim them first.

> 
> It was noted that there are now patches to make the shrinkers numa
> aware, so the older complains might be solvable. But still the issue
> of shrinkers having their own eviction logic separate from the
> global LRU is less then ideal to me.
> 
> It was past time, and there didn't seem to be much consensus or
> resolution on this issue, so we had to leave it there. That said,
> the volatile purging logic is up to the kernel, and can be tweaked
> as needed in the future, where as the basic interface semantics were
> more important to hash out, and I think I got mostly nodding on the
> majority of the interface issues.
> 
> Hopefully with the next patch iteration, we'll have things cleaned
> up a bit more and better unified between Minchn's and my approaches
> so further details can be concretely worked out on the list. It was
> also requested that a manpage document be provided with the next
> patch set, which I'll make a point to provide.

I think currently most important thing is how we define vrange sematic.
Expecially, the part "out-of-sync", we need agreement by top prioity.

> 
> Thanks so much to Minchan, Kosaki-san, Hugh, Michel, Johannes, Greg,
> Michal, Glauber, and everyone else for providing an active
> discussion and great feedback despite my likely over-caffeinated
> verbal wanderings.

John, I am looking forward to seeing our progression.
Thanks a million, again!


> 
> Thanks again,
> -john
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Summary of LSF-MM Volatile Ranges Discussion
  2013-04-23  3:11 ` Summary of LSF-MM Volatile Ranges Discussion John Stultz
  2013-04-23  6:51   ` Dmitry Vyukov
  2013-04-24  8:11   ` Minchan Kim
@ 2013-05-16 17:24   ` Andrea Arcangeli
  2013-05-21  3:50     ` John Stultz
  2 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2013-05-16 17:24 UTC (permalink / raw)
  To: John Stultz
  Cc: lsf, linux-mm, Minchan Kim, Dmitry Vyukov, Paul Turner,
	Robert Love, Dave Hansen, Taras Glek, Mike Hommey,
	Kostya Serebryany, Hugh Dickins, Michel Lespinasse,
	KOSAKI Motohiro, Johannes Weiner, gthelen, Rik van Riel, glommer,
	mhocko

Hi John,

On Mon, Apr 22, 2013 at 08:11:39PM -0700, John Stultz wrote:
> with that range mapped).  I re-iterated the example of a large circular 
> buffer in a shared file, which is initialized as entirely volatile. Then 
> a producer process would mark a region after the head as non-volatile, 
> then fill it with data. And a consumer process, then consumes data from 
> the tail, and mark those consumed ranges as volatile.

If the backing filesystem isn't tmpfs: what is the point of shrinking
the pagecache of the circular buffer before other pagecache? How can
you be sure the LRU isn't going to do a better job?

If the pagecache of the circular buffer is evicted, the next time the
circular buffer overflows and you restart from the head of the buffer,
you risk to hit a page-in from disk, instead of working in RAM without
page-ins.

Or do you trigger a sigbus for filebacked pages too, and somehow avoid
the suprious page-in caused by the volatile pagecache eviction?

And if this is tmpfs and you keep the semantics the same for all
filesystems: unmapping the page won't free memory and it won't provide
any relevant benefit. It might help a bit if you drop the dirty bit
but only during swapping.

It would be a whole lot different if you created an _hole_ in the
file.

It also would make more sense if you only worked at the
pagetable/process level (not at the inode/pagecache level) and you
didn't really control which pages are evicted, but you only unmapped
the pages and let the LRU decide later, just like if it was anonymous
memory.

If you only unmap the filebacked pages without worrying about their
freeing, then it behaves the same as MADV_DONTNEED, and it'd drop the
dirty bit, the mapping and that's it. After the pagecache is unmapped,
it is also freed much quicker than mapped pagecache, so it would make
sense for your objectives.

If you associate the volatility to the inode and not to the process
"mm", I think you need to create an hole when the pagecache is
evicted, so it becomes more useful with tmpfs and the above circular
buffer example.

If you don't create an hole in the file, and you alter the LRU order
in actually freeing the pagecache, this becomes an userland hint to
the VM, that overrides the LRU order of pagecache shrinking which may
backfire. I doubt userland knows better which pagecache should be
evicted first to avoid spurious page-ins on next fault. I mean you at
least need to be sure the next fault won't trigger a spurious swap-in.

> I noted that first of all, the shared volatility is needed to match the 
> Android ashmem semantics. So there's at least an existing user. And that 
> while this method pointed out could be used, I still felt it is fairly 

Could you get in more detail of how Android is using the file
volatility?

The MADV_USERFAULT feature to offload anonymous memory to remote nodes
in combination with remap_anon_pages (to insert/remove memory)
resembles somewhat the sigbus fault triggered by evicted volatile
pages. So ideally the sigbus entry points should be shared by both
missing volatile pages and MADV_USERFAULT, to have a single branch in
the fast paths.

You can see the MADV_USERFAULT page fault entry points here in 1/4:

    http://thread.gmane.org/gmane.comp.emulators.qemu/210231

(I actually intended to add linux-mm, I'll fix the CC list at the next
submit :)

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Summary of LSF-MM Volatile Ranges Discussion
  2013-05-16 17:24   ` Andrea Arcangeli
@ 2013-05-21  3:50     ` John Stultz
  0 siblings, 0 replies; 10+ messages in thread
From: John Stultz @ 2013-05-21  3:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: lsf, linux-mm, Minchan Kim, Dmitry Vyukov, Paul Turner,
	Robert Love, Dave Hansen, Taras Glek, Mike Hommey,
	Kostya Serebryany, Hugh Dickins, Michel Lespinasse,
	KOSAKI Motohiro, Johannes Weiner, gthelen, Rik van Riel, glommer,
	mhocko

On 05/16/2013 10:24 AM, Andrea Arcangeli wrote:
> Hi John,
>
> On Mon, Apr 22, 2013 at 08:11:39PM -0700, John Stultz wrote:
>> with that range mapped).  I re-iterated the example of a large circular
>> buffer in a shared file, which is initialized as entirely volatile. Then
>> a producer process would mark a region after the head as non-volatile,
>> then fill it with data. And a consumer process, then consumes data from
>> the tail, and mark those consumed ranges as volatile.
> If the backing filesystem isn't tmpfs: what is the point of shrinking
> the pagecache of the circular buffer before other pagecache? How can
> you be sure the LRU isn't going to do a better job?
So, tmpfs is really the main target for shared volatile ranges in my 
mind. But if you were using non-tmpfs files, you could end up possibly 
saving disk writes by purging dirty data instead of writing it out. Now, 
we'd still need to punch a hole in the file in order to be consistent 
(don't want to old data to persist there if we purged it), but depending 
on the fs it may be cheaper to punch a hole then write out lots of dirty 
data.

But again, tmpfs is really the main target here.


> If the pagecache of the circular buffer is evicted, the next time the
> circular buffer overflows and you restart from the head of the buffer,
> you risk to hit a page-in from disk, instead of working in RAM without
> page-ins.
>
> Or do you trigger a sigbus for filebacked pages too, and somehow avoid
> the suprious page-in caused by the volatile pagecache eviction?

There would be a SIGBUS, but after the range is marked non-volatile, if 
a read is done immediately after, that could trigger a page-in. If it 
was written to immediately, I suspect we'd avoid it. But this example 
isn't one I've looked at in particular.


> And if this is tmpfs and you keep the semantics the same for all
> filesystems: unmapping the page won't free memory and it won't provide
> any relevant benefit. It might help a bit if you drop the dirty bit
> but only during swapping.
>
> It would be a whole lot different if you created an _hole_ in the
> file.
Right. When we purge pages it should be the same as punching a hole 
(we're using truncate_inode_pages_range).


> It also would make more sense if you only worked at the
> pagetable/process level (not at the inode/pagecache level) and you
> didn't really control which pages are evicted, but you only unmapped
> the pages and let the LRU decide later, just like if it was anonymous
> memory.
>
> If you only unmap the filebacked pages without worrying about their
> freeing, then it behaves the same as MADV_DONTNEED, and it'd drop the
> dirty bit, the mapping and that's it. After the pagecache is unmapped,
> it is also freed much quicker than mapped pagecache, so it would make
> sense for your objectives.

Hmmm. I'll have to consider this further. Ideally I think we'd like the 
purging to be done by the LRU (the one problem is that anonymous pages 
aren't normally aged off the lru when we don't have swap - thus 
Minchan's use of a shrinker to force anonymous page purging). But it 
sounds like you're suggesting we do it in two steps. One, purge via 
shrinker and unmap the pages, then allow the eviction to be done by the 
LRU.  I'm not sure how that would work with the hole-punching, but I'll 
have to look closer.


> If you associate the volatility to the inode and not to the process
> "mm", I think you need to create an hole when the pagecache is
> evicted, so it becomes more useful with tmpfs and the above circular
> buffer example.
So, for shared volatility, we do associate it with the address_space. 
For private volatility, its associated with the mm.


> If you don't create an hole in the file, and you alter the LRU order
> in actually freeing the pagecache, this becomes an userland hint to
> the VM, that overrides the LRU order of pagecache shrinking which may
> backfire. I doubt userland knows better which pagecache should be
> evicted first to avoid spurious page-ins on next fault. I mean you at
> least need to be sure the next fault won't trigger a spurious swap-in.
>
>> I noted that first of all, the shared volatility is needed to match the
>> Android ashmem semantics. So there's at least an existing user. And that
>> while this method pointed out could be used, I still felt it is fairly
> Could you get in more detail of how Android is using the file
> volatility?
>
> The MADV_USERFAULT feature to offload anonymous memory to remote nodes
> in combination with remap_anon_pages (to insert/remove memory)
> resembles somewhat the sigbus fault triggered by evicted volatile
> pages. So ideally the sigbus entry points should be shared by both
> missing volatile pages and MADV_USERFAULT, to have a single branch in
> the fast paths.
>
> You can see the MADV_USERFAULT page fault entry points here in 1/4:
>
>      http://thread.gmane.org/gmane.comp.emulators.qemu/210231

As far as the entry-points, I suspect you mean just the vma_flag check? 
I'm somewhat skeptical. Minchan's trick of checking a pte flag on fault 
to see if the page was purged seems pretty nice to me (though I haven't 
managed to work out the flag for file pages yet - currently using a 
stupid lookup on fault instead for now, as we work out the interface 
semantics). Though maybe Minchan's pte flag approach might work for your 
case?

But I'll have to look closer at this. Taras @ Mozilla pointed me to it 
earlier and I thought the notification was vaguely similar.

MikeH: Do you have any thoughts as to if the file polling done in the 
description below make sense instead of using SIGBUS?
http://lists.gnu.org/archive/html/qemu-devel/2012-10/msg05274.html

I worry the handling is somewhat cross-process w/ the poling method, it 
might make it too complex, esp with private volatility on anonymous 
pages (ie: what backs that isn't going to be known by a different process).

thanks
-john


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-05-21  3:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-17 17:56 LSF-MM Volatile Ranges Discussion Plans John Stultz
2013-04-17 20:12 ` Paul Turner
2013-04-23  3:11 ` Summary of LSF-MM Volatile Ranges Discussion John Stultz
2013-04-23  6:51   ` Dmitry Vyukov
2013-04-24  0:26     ` John Stultz
2013-04-24  6:11       ` Dmitry Vyukov
2013-04-24  8:14     ` Minchan Kim
2013-04-24  8:11   ` Minchan Kim
2013-05-16 17:24   ` Andrea Arcangeli
2013-05-21  3:50     ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).