* notes from LSF/MM 2016 memory management track
@ 2016-04-22 16:09 Rik van Riel
0 siblings, 0 replies; only message in thread
From: Rik van Riel @ 2016-04-22 16:09 UTC (permalink / raw)
To: lsf; +Cc: linux-mm@kvack.org
[-- Attachment #1: Type: text/plain, Size: 19053 bytes --]
Here are my notes from the LSF/MM 2016 MM track.
I expect LWN.net to have nicely readable articles
on most of these discussions.
LSF/MM 2016
Memory Management track notes
Transparent Huge Pages
Kirill & Hugh have different implementations of tmpfs transparent
huge pages
Kirill can split 4k pages out of huge pages, to avoid splits
(refcounting implementation, compound pages)
Hugh's implementation: get it up and running quickly and
unobtrusively (team pages)
Kirill's implementation can dirty 4kB inside a huge page on write()
Kirill wants to get huge pages in page cache to work for ext4
cannot be transparent to the filesystem
Hugh: what about small files? huge pages would be wasted space
Kirill: madvise/madvise for THP, or file size based policy
at write time allocate 4kB pages, khugepaged can collapse them
Andrea: what is the advantage of using huge pages for small files?
Hugh: 2MB initial allocation is shrinkable, not charged to memcg
Kirill: for tmpfs, also need to check against tmpfs filesystem size
when deciding what page size to allocate
Kirill: does not like how tmpfs is growing more and more special
cases (radix tree exception entries, etc)
Aneesh,Andrea: also not happy that kernel would grow yet another
kind of huge page
Hugh: Kirill can probably use the same mlock logic my code uses
Kirill: I do not mlock pages, just VMAs, prevent pageout that way
Hugh: Kirill has some stuff working better than I realized, maybe
can still use some of my code
Kirill: on split hugepmd Hugh has a split with ptes, Kirill just
blows away PMD and lets faults fill in PTEs
Hugh: what Kirill's code does is not quite correct for mlock
Kirill: mlock does not guarantee lack of minor faults
Aneesh: PPC64 needs deposited page tables
hardware page table hashed on actual page size, huge page is only
logical not HW supported
last level page table stores slot/hash information
Andrea: do not worry too much about memory consumption with THP
if worried, do small allocations and let khugepaged collapse them
use same model for THP file cache as used for THP anonymous memory
Andrea/Kirill/Hugh:
no need to use special radix tree entries for huge pages, in
general
at hole punch time could be useful later, as an optimization
might want a way to mark 4kB pages dirty on radix tree side, inside
a compound page (or use page flags on tail page struct)
Hugh: how about two radix trees?
Everybody else: yuck :)
Andrea: with the compound model, I see no benefit to multiple radix
trees
First preparation series (by Hugh) already went upstream
Kirill can use some of Hugh's code
DAX needs some of the same code, too
Hugh: compount pages could be extended to offer my functionality,
would like to integrate what he has
settling on sysfs/mount options before freezing
then add compount pages on top
Hugh: current show stopper with Kirill's code:
small files, hole punching
khugepaged -> task_work
Advantage: concentrate thp on tasks that use most CPU and could
benefit from them the most
Hugh: having one single scanner/compacter might have advantages
When to trigger scanning?
Hugh: observe at page fault time? Vlastimil: if there are no faults
because the memory is already present, there would not be an
observation event
Johannes: wait for someone to free a THP?
maybe background scanning still best?
merge plans
Hugh would like to merge team pages now, switch to compound pages
later
Kirill would like to get compound pages into shape first, then
merge things
Andrea: if we go with team pages, we should ensure it is the right
solution for both anonymous memory and ext4
Andrea: can we integrate the best parts of both code bases and
merge that?
Mel: one of my patch series is heavily colliding with team pages
(moving accounting from zones to nodes)
Andrew: need a decision on team pages vs compound pages
Hugh: if compound pages went in first, we would not replace it with
team pages later - but the other way around might happen
merge blockers
Compound pages issues: small files memory waste, fast recovery for
small files, get khugepaged into shape, maybe deposit/withdrawal,
demonstrate recovery, demonstrate robustness (or Hugh demonstrates
brokenness)
Team page issues: recovery (khugepaged cannot collapse team pages),
anonymous memory support (Hugh: pretty sure it is possible), API
compatible to test compound, don't use page->private, path forward for
other filesystems
revert team page patches from MMOTM util blockers addressed
GFP flags
__GFP_REPEAT
fuzzy semantics, keep retrying until an allocation succeeds
for higher order allocations
but most used for order 0... (not useful)
can be cleaned up, and get a useful semantic for higher order
allocations
"can fail, try hard to be successful, but could still fail in the
end"
__GFP_NORETRY - fail after single attempt to reclaim something, not
very helpful except for optimistic/opportunistic allocations
maybe have __GFP_BEST_EFFORT, try until a certain point then give
up? (retry until OOM, then fail?)
remove __GFP_REPEAT from non-costly allocations
introduce new flag, use it where useful
can the allocator know compaction was deferred?
more explicit flags? NORECLAIM NOKSWAPD NOCOMPACT NO_OOM etc...
use explicit flags to switch stuff off
clameter: have default definitions with all the "normal stuff"
enabled
flags inconsistent - sometimes positive, sometimes negative,
sometimes for common things, sometimes for uncommon things
THP allocation not explicit, but inferred from certain flags
concensus on cleaning up GFP usage
CMA
KVM on PPC64 runs into a strange hardware requirements
needs contiguous memory for certain data structures
tried to reduce fragmentation/allocation issues with ZONE_CMA
atomic 0 order allocations fail early, due to kswapd not kicking in
on time
taking pages out of CMA zone first
compaction does not move movable compound pages (eg. THP), breaking
CMA in ZONE_CMA
mlock and other things pinning allocated-as-movable pages also
break CMA
what to do instead of ZONE_CMA?
how to keep things movable? sticky MIGRATE_MOVABLE zones?
do not allow reclaimable & unmovable allocations in sticky
MIGRATE_MOVABLE zones
memory hotplug has similar requirements to CMA, no need for a new
name
need something like physical memory linear reclaim, finding sticky
MIGRATE_MOVABLE zones and reclaiming everything inside
Mel: would like to see ZONE_CMA and ZONE_MOVABLE go away
FOLL_MIGRATE get_user_pages flag to move pages away from movable
region when being pinned
should be handled by core code, get_user_pages
Compaction, higher order allocations
compaction not invoked from THP allocations with delayed
fragmentation patch set
kcompactd daemon for background compaction
should kcompactd do fast direct reclaim? lets see
cooperation with OOM
compaction - hard to get useful feedback about
compaction "does random things, returns with random answer"
no notion of "costly allocations"
compaction can keep indefinitely deferring action, even for smaller
allocations (eg. order 2)
sometimes compaction finds too many page blocks with the skip bit
set
success rate of compaction skyrocketed with skip bits ignored
(stale skip bits?)
migrate skips over MIGRATE_UNMOVABLE page blocks found during order
9 compaction
page block may be perfectly suitable for smaller order compaction
have THP skip more aggressively, while order 2 scans inside more
page blocks
priority for compaction code? aggressiveness of diving into blocks
vs skipping
order 9 allocators:
THP - wants allocation to fail quickyl if no order 9 available
hugetlbfs - really wants allocations to succeed
VM containers
VM imply more memory consumption than what application that runs in
it need
How to pressure guest to give back memory to host ?
Adding new shrinker did not seem to perform well
Move page cache to the host so it would be easier to reclaim memory
for all guest
Move memory management from guest kernel to host, some kind of
memory controller
Have the guest tell the host how to reclaim, sharing LRU for
instance
mmu_notifier is already sharing some informations with access bit
(young), but mmu_notifier is to coarse
DAX (in the guest) should be fine to solve filesystem memory
if not DAX backed on the host, needs new mechanism for IO barriers,
etc
FUSE driver in the guest and move filesystem to the host
Exchange memory pressure btw guest and host so that host can ask
guest to adjust its pressure depending on overall situation of the host
Generic page-pool recycle facility
found bottlenecks in both page allocator and DMA APIs
"packet-page" / explicit data path API
make it generic across multiple use cases
get rid of open coded driver approaches
Mel: make per-cpu allocator fast enough to act as the page pool
gets NUMA locality, shrinking, etc all for free
needs pool sizing for used pool items, too - can't keep collecting
incoming packets without handling them
allow page allocator to reclaim memory
Address Space Mirroring
Haswell-EX allows memory mirroring, partial or all memory
goal: improve high availability by avoiding uncorrectable errors in
kernel memory
partial has higher remaining memory capacity, but not software
transparent
some memory mirrored, some not
mirrored memory set up in BIOS, amount in each NUMA node
proportional to amount of memory in each node
mirror range info in EFI memory map
avoid kernel allocations from non-mirrored memory ranges, avoid
ZONE_MOVABLE allocations
put user allocations in non-mirrored memory, avoid ZONE_NORMAL
allocations
MADV_MIRROR to put certain user memory in mirrored memory
problem: to put a whole program in mirrored memory, need to
relocate libraries into mirrored memory
what is the value proposition of mirroring user space memory?
policy: when mirrored memory is requested, do not fall back to non-
mirrored memory
Michal: is this desired?
Aneesh: how should we represent mirrored memory? zones? something
else?
Michal: we are back to highmem problem
lesson from highmem era: keep ratio of kernel to non-kernel memory
low enough, below 1:4
how much userspace needs to be in mirrored memory, in order to be
able to restart applications?
should we have opt-out for mirrored instead of opt-in?
proposed interface: prctl
kcore mirror code upstream already
Mel: systems using lots of ZONE_MOVABLE have problems, and are
often unstable
Mel: assuming userspace can figure out the right thing to choose
what needs to be mirrored is not safe
Vlastimil: use non-mirrored memory as frontswap only, put all
managed memory in mirrored memory
dwmw2: for workload of "guest we care about, guests we don't care
about", we can allocate only guest memory for unimportant guests in
non-mirrored memory
Mel: even in that scenario a non-important guest's kernel
allocations could exhaust mirrored memory
Mel: partial mirroring makes a promise of reliability that it
cannot deliver on
false hope
complex configuration makes the system less reliable
Andrea: memory hotplug & other zone_movable users already cause the
same problems today
Heterogenious Memory Management
used for GPU, CAPI, other kinds of offload engines
GPU has much faster memory than system RAM
to get performance, GPU offload data needs to sit in VRAM
shared address space creates an easier programming model
needs ability to migrate memory between system RAM and VRAM
CPU cannot access VRAM
GPU can access system RAM ... very very slowly
hardware is coming up real soon (this year)
without HMM
GPU stuff running 10/100x slower
need to pin lots of system memory (16GB per device?)
use of mmu_notifier spreading to device drivers, instead of one
common solution
special swap type to handle migration
future openCL API wants address space sharing
HMM has some core VM impact, but relatively contained
how to get HMM upstream? does anybody have objections to anything
in HMM?
split up in several series
Andrew: put more info in the changelogs
space for future optimizations
dwmw2: svm API, should move to a generic API
intel_svm_bind_mm - bind the current process to a PASID
MM validation & debugging
Sasha using KASAN on locking, trap missed locks
requires annotation of what memory is locked by a lock
how to annotate what memory is protected by a lock?
Kirill: what about a struct with a lock inside?
annotate struct members with which lock protects it?
too much work
trying to improve hugepage testing
split_all_huge_pages
expose list of huge pages through debugfs, allow splitting
arbirarily chosen ones
fuzzer to open, close, read & write random files in sysfs & debugfs
how to coordinate security(?) issues with zero-day security folks?
Memory cgroups
how to figure out the memory a cgroup needs (as opposed to
currently used)?
memory pressure is not enough to determine the needs of a cgroup
cgroups scanned in equal portion
unfair, streaming file IO can result in using lots of memory, even
when the cgroup has mostly inactive file pages
potential solution:
dynamically balance the cgroups
adjust limits dynamically, based on their memory pressure
problem: how to detect memory pressure?
when to increase memory? when to decrease memory?
real time aging of various LRU lists
only for active / anon lists, not inactine file list
"keep cgroup data in memory if its working set is younger than X
seconds"
refault info: distinguish between refaults (working set faulted
back in), and evictions of data that is only used once
can be used to know when to grow a cgroup, but not when to shrink
it
vmpressure API: does not work well on very large systems, only on
smaller ones
quickly reaches "critical" levels on large systems, that are not
even that busy
Johannes: time-based statistic to measure how much time processes
wait for IO
not iowait, which measures how long the _system_ waits, but per-
task
add refault info in, only count time spent on refaults
wait time above threshold? grow cgroup
wait time under threshold? shrink cgroup, but not below lower limit
Larry: Docker people want per-cgroup vmstat info
TLB flush optimizations
mmu_gather side of tlb flush
collect invalidations, gather items to flush
patch: increase size of mmu_gather, and try to flush more at once
Andrea - rmap length scalability issues
too many KSM pages merged together, rmap chain becomes too long
put upper limit on number of shares of a KSM page (256 share limit)
mmu_notifiers batch flush interface?
limit max_page_sharing to reduce KSM rmap chain length
OOM killer
goal: make OOM invocation more deterministic
currently: reclaim until there is nothing left to reclaim, then
invoke OOM killer
problem: sometimes reclaim gets stuck, and OOM killer is not
invoked when it should
one single page free resets the OOM counter, causing livelock
thrashing not detected, on the contrary helps thrashing happen
make things more conservative?
OOM killer invoked on heavy thrashing and no progress made in the
VM
OOM reaper - to free resources before OOM killed task can exit by
itself
timeout based solution is not trivial, doable, but not preferred by
Michal
if Johannes can make a timeout scheme deterministic, Michal has no
objections
Michal: I think we can do better without a timer solution
need deterministic way to put system into a consistent state
tmpfs vs OOM killer
OOM killer cannot discard tmpfs files
with cgroups, reap giant tmpfs file anyway in special cases at
Google
restart whole container, dump container's tmpfs contents
MM tree workflow
most of Andrew's job: sollicit feedback from people
-mm git tree helps many people
Michal: would like email message IDs references in patches, both
for original patches and fixes
the value of -fix patches is that previous reviews do not need to
get re-done
sometimes a replacement patch is easier
Kirill: sometimes difficult to get patch sets reviewed
generally adds acked-by and reviewed-by lines by hand
Michal: -mm tree is maintainer tree of last resort
Andrew: carrying those extra patches isn't too much work
SLUB optimizations lightning talk
bulk APIs for SLUB + SLAB
kmem_cache_{alloc,free}_bulk()
kfree_bulk()
60%speedup measured
can be used from network, rcu free, ...
per CPU freelist per page
nice speedup, but still suffers from a race condition
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2016-04-22 16:09 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-22 16:09 notes from LSF/MM 2016 memory management track Rik van Riel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).