* Re: API/syscall to alleviate page/memory problem when quickly accessing memory?
[not found] <20230403023135.E0D9E1777C2@bolin>
@ 2023-04-03 6:30 ` Vlastimil Babka
0 siblings, 0 replies; only message in thread
From: Vlastimil Babka @ 2023-04-03 6:30 UTC (permalink / raw)
To: Levo D, linux-api, linux-mm@kvack.org
On 4/3/23 04:31, Levo D wrote:
> I optimized and profile my program to a point that it seems like it's spending more time in kernel than in userspace (likely not true but I'll explain).
>
> Here's one run. I spawn many threads (6 at minimum, more depending on flags). As you can see more than half of the total time is in sys. Is the kernel running on multiple cores simultaneously to give my program pages?
>
> real 0m0.954s
> user 0m6.442s
> sys 0m0.607s
>
> The test below is using -test-flags which gets me these numbers, sys is 51% of total time
>
> real 0m0.733s
> user 0m3.476s
> sys 0m0.378s
>
>
> perf record -F 5000 ./myapp -test-flags shows me 61% of the app is in my biggest function and 6% is in `clear_page_rep`. When I record cache misses using `perf record -F 5000 --call-graph=fp -e cache-misses ./myapp -test-flags` I can see that
>
> clear_page_rep takes 40%
> clear_huge_page takes 1.2%
> My big function self is 8%, while total is 25.5%. the remaining is mostly asm_exc_page_fault (12%) and asm_sysvec_apic_timer_interrupt (2.7%)
> That's about 56% (of all misses and waiting) in the kernel
>
> I believe if I can reduce work being done in the kernel
That's not possible, the kernel must clear the pages before giving them to a
process, for security reasons.
> and have pages be ready before I fault
That it possible if you use MAP_POPULATE flag of mmap(). Or just write once
in each page before starting your large function to pre-fault it.
In that case it may also make sense not to measure time of your whole
program execution, but only between initialization (including the
pre-faulting) and cleanup. The whole runtime already seems very short to
profit from further optimizations if the init/cleanup is involved each time.
If the runtime of "large function" is important because it would be run many
times in practice, then it could also make sense to keep the initialized
process running and reusing the allocated memory instead of repeated
executions of new processes that include the free and reallocation costs.
> I'll have less cache misses in my large function and I could be significantly faster. I measured how long my large function takes in single threaded compared to multi. Multithreaded at minimum is 1.5x slower to 2x slower. I spawn 1 thread per core (I'm testing on a zen2, it has 6cores with 12threads, spawning more than 6 threads slow the program down). Each thread is using <100MB.
This seems to be more about how hyperthreading (SMT) doesn't always really
results in speed ups, so that's about the CPU vs workload rather than kernel.
> Is there an API I should look into? What can I do here?
>
>
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2023-04-03 6:31 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20230403023135.E0D9E1777C2@bolin>
2023-04-03 6:30 ` API/syscall to alleviate page/memory problem when quickly accessing memory? Vlastimil Babka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).