From: Gregory Price <gregory.price@memverge.com>
To: linux-cxl@vger.kernel.org
Cc: Dan Williams <dan.j.williams@intel.com>,
Dave Jiang <dave.jiang@intel.com>
Subject: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
Date: Wed, 12 Apr 2023 14:43:33 -0400 [thread overview]
Message-ID: <ZDb71ZXGtzz0ttQT@memverge.com> (raw)
I was looking to validate mlock-ability of various pages when CXL is in
different states (numa, dax, etc), and I discovered a page_table_check
BUG when accessing MemExp memory while a device is in daxdev mode.
this happens essentially on a fault of the first accessed page
int dax_fd = open(device_path, O_RDWR);
void *mapped_memory = mmap(NULL, (1024*1024*2), PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
((char*)mapped_memory)[0] = 1;
Full details of my test here:
Step 1) Test that memory onlined in NUMA node works
[user@host0 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 63892 MB
node 0 free: 59622 MB
node 1 cpus:
node 1 size: 129024 MB
node 1 free: 129024 MB
node distances:
node 0 1
0: 10 50
1: 255 10
[user@host0 ~]# numactl --preferred=1 memhog 128G
... snip ...
Passes no problem, all memory is accessible and used.
Next, reconfigure the device to daxdev mode
[user@host0 ~]# daxctl list
[
{
"chardev":"dax0.0",
"size":137438953472,
"target_node":1,
"align":2097152,
"mode":"system-ram",
"online_memblocks":63,
"total_memblocks":63,
"movable":true
}
]
[user@host0 ~]# daxctl offline-memory dax0.0
offlined memory for 1 device
[user@host0 ~]# daxctl reconfigure-device --human --mode=devdax dax0.0
{
"chardev":"dax0.0",
"size":"128.00 GiB (137.44 GB)",
"target_node":1,
"align":2097152,
"mode":"devdax"
}
reconfigured 1 device
[user@host0 mapping0]# daxctl list -M -u
{
"chardev":"dax0.0",
"size":"128.00 GiB (137.44 GB)",
"target_node":1,
"align":2097152,
"mode":"devdax",
"mappings":[
{
"page_offset":"0",
"start":"0x1050000000",
"end":"0x304fffffff",
"size":"128.00 GiB (137.44 GB)"
}
]
}
Now map and access the memory via /dev/dax0.0 (test program attached)
[ 1028.430734] kernel BUG at mm/page_table_check.c:53!
[ 1028.430753] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 1028.430763] CPU: 14 PID: 5292 Comm: daxmemtest Not tainted 6.3.0-rc6-dirty #22
[ 1028.430774] Hardware name: AMD Corporation ONYX/ONYX, BIOS ROX1006C 03/01/2023
[ 1028.430785] RIP: 0010:page_table_check_set.part.0+0x89/0xf0
[ 1028.430798] Code: 75 65 44 89 c2 f0 0f c1 10 83 c2 01 83 fa 01 7e 04 84 db 75 6d 48 83 c1 01 48 03 3d 21 09 52 05 4c 39 e1 74 52 48 85 ff 75 c2 <0f> 0b 8b 10 85 d2 75 37 44 89 c2 f0 0f c1 50 04 83 c2 01 79 d6 0f
[ 1028.430820] RSP: 0000:ff6120b001417d30 EFLAGS: 00010246
[ 1028.430829] RAX: ff115d297d82b128 RBX: 0000000000000001 RCX: 00000000001e0414
[ 1028.430838] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 1028.430847] RBP: fff3676981400000 R08: 0000000000000002 R09: 0000000032458015
[ 1028.430857] R10: 0000000000000001 R11: 000000001ad9d129 R12: 0000000000000200
[ 1028.430867] R13: fff3676981400000 R14: 84000010500008e7 R15: 0000000000000001
[ 1028.430876] FS: 00007f0a660c4740(0000) GS:ff115d37c1200000(0000) knlGS:0000000000000000
[ 1028.430887] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1028.430895] CR2: 00007f0a65a00000 CR3: 00000001ab212005 CR4: 0000000000771ee0
[ 1028.430905] PKRU: 55555554
[ 1028.430909] Call Trace:
[ 1028.430914] <TASK>
[ 1028.430919] vmf_insert_pfn_pmd_prot+0x2b4/0x360
[ 1028.430929] dev_dax_huge_fault+0x181/0x400 [device_dax]
[ 1028.430941] __handle_mm_fault+0x806/0xfe0
[ 1028.430951] handle_mm_fault+0x189/0x460
[ 1028.430958] do_user_addr_fault+0x1e0/0x730
[ 1028.430968] exc_page_fault+0x7e/0x200
[ 1028.430977] asm_exc_page_fault+0x22/0x30
[ 1028.430984] RIP: 0033:0x401262
[ 1028.430990] Code: 00 b8 00 00 00 00 e8 1d fe ff ff 8b 45 f4 89 c7 e8 23 fe ff ff b8 01 00 00 00 eb 2a bf 7e 20 40 00 e8 e2 fd ff ff 48 8b 45 e0 <c6> 00 01 8b 45 f4 89 c7 e8 01 fe ff ff bf 85 20 40 00 e8 c7 fd ff
[ 1028.431011] RSP: 002b:00007fff888d8fc0 EFLAGS: 00010206
[ 1028.431019] RAX: 00007f0a65a00000 RBX: 00007fff888d90f8 RCX: 00007f0a65f01c37
[ 1028.431242] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00007f0a65ffaa70
[ 1028.431446] RBP: 00007fff888d8fe0 R08: 0000000000000003 R09: 0000000000000000
[ 1028.431651] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[ 1028.431852] R13: 00007fff888d9108 R14: 0000000000403e18 R15: 00007f0a6610d000
[ 1028.432053] </TASK>
[ 1028.432247] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink xt_addrtype nft_compat br_netfilter bridge rpcsec_gss_krb5 stp llc auth_rpcgss overlay nfsv4 dns_resolver nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr sunrpc vfat fat intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm kmem device_dax dax_cxl irqbypass rapl wmi_bmof pcspkr dax_hmem cxl_mem ipmi_ssif cxl_port acpi_ipmi ipmi_si ipmi_devintf i2c_piix4 cxl_pci k10temp ipmi_msghandler cxl_acpi cxl_core acpi_cpufreq fuse zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 nvme nvme_core ast tg3 i2c_algo_bit nvme_common sp5100_tco ccp wmi
[ 1028.434042] ---[ end trace 0000000000000000 ]---
[ 1028.434278] RIP: 0010:page_table_check_set.part.0+0x89/0xf0
[ 1028.434518] Code: 75 65 44 89 c2 f0 0f c1 10 83 c2 01 83 fa 01 7e 04 84 db 75 6d 48 83 c1 01 48 03 3d 21 09 52 05 4c 39 e1 74 52 48 85 ff 75 c2 <0f> 0b 8b 10 85 d2 75 37 44 89 c2 f0 0f c1 50 04 83 c2 01 79 d6 0f
[ 1028.435009] RSP: 0000:ff6120b001417d30 EFLAGS: 00010246
[ 1028.435251] RAX: ff115d297d82b128 RBX: 0000000000000001 RCX: 00000000001e0414
[ 1028.435501] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 1028.435744] RBP: fff3676981400000 R08: 0000000000000002 R09: 0000000032458015
[ 1028.435985] R10: 0000000000000001 R11: 000000001ad9d129 R12: 0000000000000200
[ 1028.436220] R13: fff3676981400000 R14: 84000010500008e7 R15: 0000000000000001
[ 1028.436454] FS: 00007f0a660c4740(0000) GS:ff115d37c1200000(0000) knlGS:0000000000000000
[ 1028.436683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1028.436910] CR2: 00007f0a65a00000 CR3: 00000001ab212005 CR4: 0000000000771ee0
[ 1028.437140] PKRU: 55555554
[ 1028.437375] note: daxmemtest[5292] exited with preempt_count 1
Test program:
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
int main() {
// Open the DAX device
const char *device_path = "/dev/dax0.0"; // Replace with your DAX device path
int dax_fd = open(device_path, O_RDWR);
if (dax_fd < 0) {
printf("Error: Unable to open DAX device: %s\n", strerror(errno));
return 1;
}
printf("file opened\n");
// Memory-map the DAX device
size_t size = 1024*1024*2; // 2MB
void *mapped_memory = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
if (mapped_memory == MAP_FAILED) {
printf("Error: Unable to mmap DAX device: %s\n", strerror(errno));
close(dax_fd);
return 1;
}
printf("mmaped\n");
((char*)mapped_memory)[0] = 1;
/*
// Lock the memory region using mlock
int result = mlock(mapped_memory, size);
if (result != 0) {
printf("Error: Unable to lock memory using mlock: %s\n", strerror(errno));
munmap(mapped_memory, size);
close(dax_fd);
return 1;
}
printf("mlocked\n");
// Use the mapped_memory for your application
// Remember to unlock the memory using munlock before unmapping it
result = munlock(mapped_memory, size);
if (result != 0) {
printf("Error: Unable to unlock memory using munlock: %s\n", strerror(errno));
}
printf("munlocked\n");
munmap(mapped_memory, size);
*/
close(dax_fd);
printf("success\n");
return 0;
}
CXL topology at time of error:
[user@host0 ~]# ./cxl list -vvvv
[
{
"bus":"root0",
"provider":"ACPI.CXL",
"nr_dports":1,
"dports":[
{
"dport":"pci0000:3f",
"alias":"ACPI0016:00",
"id":4
}
],
"endpoints:root0":[
{
"endpoint":"endpoint1",
"host":"mem0",
"depth":1,
"memdev":{
"memdev":"mem0",
"ram_size":137438953472,
"health":{
"maintenance_needed":true,
"performance_degraded":false,
"hw_replacement_needed":false,
"media_normal":false,
"media_not_ready":false,
"media_persistence_lost":true,
"media_data_lost":false,
"media_powerloss_persistence_loss":false,
"media_shutdown_persistence_loss":false,
"media_persistence_loss_imminent":false,
"media_powerloss_data_loss":false,
"media_shutdown_data_loss":false,
"media_data_loss_imminent":false,
"ext_life_used":"unknown",
"ext_temperature":"normal",
"ext_corrected_volatile":"normal",
"ext_corrected_persistent":"normal",
"life_used_percent":4,
"temperature":0,
"dirty_shutdowns":0,
"volatile_errors":0,
"pmem_errors":0
},
"alert_config":{
"life_used_prog_warn_threshold_valid":false,
"dev_over_temperature_prog_warn_threshold_valid":true,
"dev_under_temperature_prog_warn_threshold_valid":false,
"corrected_volatile_mem_err_prog_warn_threshold_valid":false,
"corrected_pmem_err_prog_warn_threshold_valid":false,
"life_used_prog_warn_threshold_writable":false,
"dev_over_temperature_prog_warn_threshold_writable":true,
"dev_under_temperature_prog_warn_threshold_writable":false,
"corrected_volatile_mem_err_prog_warn_threshold_writable":false,
"corrected_pmem_err_prog_warn_threshold_writable":false,
"life_used_crit_alert_threshold":75,
"life_used_prog_warn_threshold":25,
"dev_over_temperature_crit_alert_threshold":150,
"dev_under_temperature_crit_alert_threshold":65360,
"dev_over_temperature_prog_warn_threshold":75,
"dev_under_temperature_prog_warn_threshold":65472,
"corrected_volatile_mem_err_prog_warn_threshold":16,
"corrected_pmem_err_prog_warn_threshold":0
},
"serial":9947034750368612352,
"host":"0000:3f:00.0",
"partition_info":{
"total_size":137438953472,
"volatile_only_size":137438953472,
"persistent_only_size":0,
"partition_alignment_size":0
}
},
"decoders:endpoint1":[
{
"decoder":"decoder1.0",
"resource":70061654016,
"size":137438953472,
"interleave_ways":1,
"region":"region0",
"dpa_resource":0,
"dpa_size":137438953472,
"mode":"ram"
}
]
}
],
"decoders:root0":[
{
"decoder":"decoder0.0",
"resource":70061654016,
"size":137438953472,
"interleave_ways":1,
"max_available_extent":0,
"volatile_capable":true,
"nr_targets":1,
"targets":[
{
"target":"pci0000:3f",
"alias":"ACPI0016:00",
"position":0,
"id":4
}
],
"regions:decoder0.0":[
{
"region":"region0",
"resource":70061654016,
"size":137438953472,
"type":"ram",
"interleave_ways":1,
"interleave_granularity":4096,
"decode_state":"commit",
"mappings":[
{
"position":0,
"memdev":"mem0",
"decoder":"decoder1.0"
}
],
"daxregion":{
"id":0,
"size":137438953472,
"align":2097152,
"devices":[
{
"chardev":"dax0.0",
"size":137438953472,
"target_node":1,
"align":2097152,
"mode":"devdax"
}
]
}
}
]
}
]
}
]
~Gregory
next reply other threads:[~2023-04-12 18:43 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-12 18:43 Gregory Price [this message]
2023-04-13 11:39 ` [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check Gregory Price
2023-04-18 6:43 ` Dan Williams
2023-04-20 0:58 ` Gregory Price
2023-04-18 6:35 ` Dan Williams
2023-04-20 1:29 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZDb71ZXGtzz0ttQT@memverge.com \
--to=gregory.price@memverge.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=linux-cxl@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox