Linux CXL
 help / color / mirror / Atom feed
From: Gregory Price <gregory.price@memverge.com>
To: linux-cxl@vger.kernel.org
Cc: Dan Williams <dan.j.williams@intel.com>,
	Dave Jiang <dave.jiang@intel.com>
Subject: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
Date: Wed, 12 Apr 2023 14:43:33 -0400	[thread overview]
Message-ID: <ZDb71ZXGtzz0ttQT@memverge.com> (raw)



I was looking to validate mlock-ability of various pages when CXL is in
different states (numa, dax, etc), and I discovered a page_table_check
BUG when accessing MemExp memory while a device is in daxdev mode.

this happens essentially on a fault of the first accessed page

int dax_fd = open(device_path, O_RDWR);
void *mapped_memory = mmap(NULL, (1024*1024*2), PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
((char*)mapped_memory)[0] = 1;


Full details of my test here:

Step 1) Test that memory onlined in NUMA node works

[user@host0 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 63892 MB
node 0 free: 59622 MB
node 1 cpus:
node 1 size: 129024 MB
node 1 free: 129024 MB
node distances:
node   0   1
  0:  10  50
  1:  255  10


[user@host0 ~]# numactl --preferred=1 memhog 128G
... snip ...

Passes no problem, all memory is accessible and used.



Next, reconfigure the device to daxdev mode


[user@host0 ~]# daxctl list
[
  {
    "chardev":"dax0.0",
    "size":137438953472,
    "target_node":1,
    "align":2097152,
    "mode":"system-ram",
    "online_memblocks":63,
    "total_memblocks":63,
    "movable":true
  }
]
[user@host0 ~]# daxctl offline-memory dax0.0
offlined memory for 1 device
[user@host0 ~]# daxctl reconfigure-device --human --mode=devdax dax0.0
{
  "chardev":"dax0.0",
  "size":"128.00 GiB (137.44 GB)",
  "target_node":1,
  "align":2097152,
  "mode":"devdax"
}
reconfigured 1 device
[user@host0 mapping0]# daxctl list -M -u
{
  "chardev":"dax0.0",
  "size":"128.00 GiB (137.44 GB)",
  "target_node":1,
  "align":2097152,
  "mode":"devdax",
  "mappings":[
    {
      "page_offset":"0",
      "start":"0x1050000000",
      "end":"0x304fffffff",
      "size":"128.00 GiB (137.44 GB)"
    }
  ]
}


Now map and access the memory via /dev/dax0.0  (test program attached)

[ 1028.430734] kernel BUG at mm/page_table_check.c:53!
[ 1028.430753] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 1028.430763] CPU: 14 PID: 5292 Comm: daxmemtest Not tainted 6.3.0-rc6-dirty #22
[ 1028.430774] Hardware name: AMD Corporation ONYX/ONYX, BIOS ROX1006C 03/01/2023
[ 1028.430785] RIP: 0010:page_table_check_set.part.0+0x89/0xf0
[ 1028.430798] Code: 75 65 44 89 c2 f0 0f c1 10 83 c2 01 83 fa 01 7e 04 84 db 75 6d 48 83 c1 01 48 03 3d 21 09 52 05 4c 39 e1 74 52 48 85 ff 75 c2 <0f> 0b 8b 10 85 d2 75 37 44 89 c2 f0 0f c1 50 04 83 c2 01 79 d6 0f
[ 1028.430820] RSP: 0000:ff6120b001417d30 EFLAGS: 00010246
[ 1028.430829] RAX: ff115d297d82b128 RBX: 0000000000000001 RCX: 00000000001e0414
[ 1028.430838] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 1028.430847] RBP: fff3676981400000 R08: 0000000000000002 R09: 0000000032458015
[ 1028.430857] R10: 0000000000000001 R11: 000000001ad9d129 R12: 0000000000000200
[ 1028.430867] R13: fff3676981400000 R14: 84000010500008e7 R15: 0000000000000001
[ 1028.430876] FS:  00007f0a660c4740(0000) GS:ff115d37c1200000(0000) knlGS:0000000000000000
[ 1028.430887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1028.430895] CR2: 00007f0a65a00000 CR3: 00000001ab212005 CR4: 0000000000771ee0
[ 1028.430905] PKRU: 55555554
[ 1028.430909] Call Trace:
[ 1028.430914]  <TASK>
[ 1028.430919]  vmf_insert_pfn_pmd_prot+0x2b4/0x360
[ 1028.430929]  dev_dax_huge_fault+0x181/0x400 [device_dax]
[ 1028.430941]  __handle_mm_fault+0x806/0xfe0
[ 1028.430951]  handle_mm_fault+0x189/0x460
[ 1028.430958]  do_user_addr_fault+0x1e0/0x730
[ 1028.430968]  exc_page_fault+0x7e/0x200
[ 1028.430977]  asm_exc_page_fault+0x22/0x30
[ 1028.430984] RIP: 0033:0x401262
[ 1028.430990] Code: 00 b8 00 00 00 00 e8 1d fe ff ff 8b 45 f4 89 c7 e8 23 fe ff ff b8 01 00 00 00 eb 2a bf 7e 20 40 00 e8 e2 fd ff ff 48 8b 45 e0 <c6> 00 01 8b 45 f4 89 c7 e8 01 fe ff ff bf 85 20 40 00 e8 c7 fd ff
[ 1028.431011] RSP: 002b:00007fff888d8fc0 EFLAGS: 00010206
[ 1028.431019] RAX: 00007f0a65a00000 RBX: 00007fff888d90f8 RCX: 00007f0a65f01c37
[ 1028.431242] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00007f0a65ffaa70
[ 1028.431446] RBP: 00007fff888d8fe0 R08: 0000000000000003 R09: 0000000000000000
[ 1028.431651] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[ 1028.431852] R13: 00007fff888d9108 R14: 0000000000403e18 R15: 00007f0a6610d000
[ 1028.432053]  </TASK>
[ 1028.432247] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink xt_addrtype nft_compat br_netfilter bridge rpcsec_gss_krb5 stp llc auth_rpcgss overlay nfsv4 dns_resolver nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr sunrpc vfat fat intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm kmem device_dax dax_cxl irqbypass rapl wmi_bmof pcspkr dax_hmem cxl_mem ipmi_ssif cxl_port acpi_ipmi ipmi_si ipmi_devintf i2c_piix4 cxl_pci k10temp ipmi_msghandler cxl_acpi cxl_core acpi_cpufreq fuse zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 nvme nvme_core ast tg3 i2c_algo_bit nvme_common sp5100_tco ccp wmi
[ 1028.434042] ---[ end trace 0000000000000000 ]---
[ 1028.434278] RIP: 0010:page_table_check_set.part.0+0x89/0xf0
[ 1028.434518] Code: 75 65 44 89 c2 f0 0f c1 10 83 c2 01 83 fa 01 7e 04 84 db 75 6d 48 83 c1 01 48 03 3d 21 09 52 05 4c 39 e1 74 52 48 85 ff 75 c2 <0f> 0b 8b 10 85 d2 75 37 44 89 c2 f0 0f c1 50 04 83 c2 01 79 d6 0f
[ 1028.435009] RSP: 0000:ff6120b001417d30 EFLAGS: 00010246
[ 1028.435251] RAX: ff115d297d82b128 RBX: 0000000000000001 RCX: 00000000001e0414
[ 1028.435501] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 1028.435744] RBP: fff3676981400000 R08: 0000000000000002 R09: 0000000032458015
[ 1028.435985] R10: 0000000000000001 R11: 000000001ad9d129 R12: 0000000000000200
[ 1028.436220] R13: fff3676981400000 R14: 84000010500008e7 R15: 0000000000000001
[ 1028.436454] FS:  00007f0a660c4740(0000) GS:ff115d37c1200000(0000) knlGS:0000000000000000
[ 1028.436683] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1028.436910] CR2: 00007f0a65a00000 CR3: 00000001ab212005 CR4: 0000000000771ee0
[ 1028.437140] PKRU: 55555554
[ 1028.437375] note: daxmemtest[5292] exited with preempt_count 1



Test program:

#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

int main() {
    // Open the DAX device
    const char *device_path = "/dev/dax0.0"; // Replace with your DAX device path
    int dax_fd = open(device_path, O_RDWR);

    if (dax_fd < 0) {
        printf("Error: Unable to open DAX device: %s\n", strerror(errno));
        return 1;
    }
    printf("file opened\n");

    // Memory-map the DAX device
    size_t size = 1024*1024*2; // 2MB
    void *mapped_memory = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);

    if (mapped_memory == MAP_FAILED) {
        printf("Error: Unable to mmap DAX device: %s\n", strerror(errno));
        close(dax_fd);
        return 1;
    }
    printf("mmaped\n");

    ((char*)mapped_memory)[0] = 1;

/*
    // Lock the memory region using mlock
    int result = mlock(mapped_memory, size);

    if (result != 0) {
        printf("Error: Unable to lock memory using mlock: %s\n", strerror(errno));
        munmap(mapped_memory, size);
        close(dax_fd);
        return 1;
    }
    printf("mlocked\n");

    // Use the mapped_memory for your application

    // Remember to unlock the memory using munlock before unmapping it
    result = munlock(mapped_memory, size);
    if (result != 0) {
        printf("Error: Unable to unlock memory using munlock: %s\n", strerror(errno));
    }
    printf("munlocked\n");

    munmap(mapped_memory, size);
*/
    close(dax_fd);
    printf("success\n");
    return 0;
}



CXL topology at time of error:
[user@host0 ~]# ./cxl list -vvvv
[
  {
    "bus":"root0",
    "provider":"ACPI.CXL",
    "nr_dports":1,
    "dports":[
      {
        "dport":"pci0000:3f",
        "alias":"ACPI0016:00",
        "id":4
      }
    ],
    "endpoints:root0":[
      {
        "endpoint":"endpoint1",
        "host":"mem0",
        "depth":1,
        "memdev":{
          "memdev":"mem0",
          "ram_size":137438953472,
          "health":{
            "maintenance_needed":true,
            "performance_degraded":false,
            "hw_replacement_needed":false,
            "media_normal":false,
            "media_not_ready":false,
            "media_persistence_lost":true,
            "media_data_lost":false,
            "media_powerloss_persistence_loss":false,
            "media_shutdown_persistence_loss":false,
            "media_persistence_loss_imminent":false,
            "media_powerloss_data_loss":false,
            "media_shutdown_data_loss":false,
            "media_data_loss_imminent":false,
            "ext_life_used":"unknown",
            "ext_temperature":"normal",
            "ext_corrected_volatile":"normal",
            "ext_corrected_persistent":"normal",
            "life_used_percent":4,
            "temperature":0,
            "dirty_shutdowns":0,
            "volatile_errors":0,
            "pmem_errors":0
          },
          "alert_config":{
            "life_used_prog_warn_threshold_valid":false,
            "dev_over_temperature_prog_warn_threshold_valid":true,
            "dev_under_temperature_prog_warn_threshold_valid":false,
            "corrected_volatile_mem_err_prog_warn_threshold_valid":false,
            "corrected_pmem_err_prog_warn_threshold_valid":false,
            "life_used_prog_warn_threshold_writable":false,
            "dev_over_temperature_prog_warn_threshold_writable":true,
            "dev_under_temperature_prog_warn_threshold_writable":false,
            "corrected_volatile_mem_err_prog_warn_threshold_writable":false,
            "corrected_pmem_err_prog_warn_threshold_writable":false,
            "life_used_crit_alert_threshold":75,
            "life_used_prog_warn_threshold":25,
            "dev_over_temperature_crit_alert_threshold":150,
            "dev_under_temperature_crit_alert_threshold":65360,
            "dev_over_temperature_prog_warn_threshold":75,
            "dev_under_temperature_prog_warn_threshold":65472,
            "corrected_volatile_mem_err_prog_warn_threshold":16,
            "corrected_pmem_err_prog_warn_threshold":0
          },
          "serial":9947034750368612352,
          "host":"0000:3f:00.0",
          "partition_info":{
            "total_size":137438953472,
            "volatile_only_size":137438953472,
            "persistent_only_size":0,
            "partition_alignment_size":0
          }
        },
        "decoders:endpoint1":[
          {
            "decoder":"decoder1.0",
            "resource":70061654016,
            "size":137438953472,
            "interleave_ways":1,
            "region":"region0",
            "dpa_resource":0,
            "dpa_size":137438953472,
            "mode":"ram"
          }
        ]
      }
    ],
    "decoders:root0":[
      {
        "decoder":"decoder0.0",
        "resource":70061654016,
        "size":137438953472,
        "interleave_ways":1,
        "max_available_extent":0,
        "volatile_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:3f",
            "alias":"ACPI0016:00",
            "position":0,
            "id":4
          }
        ],
        "regions:decoder0.0":[
          {
            "region":"region0",
            "resource":70061654016,
            "size":137438953472,
            "type":"ram",
            "interleave_ways":1,
            "interleave_granularity":4096,
            "decode_state":"commit",
            "mappings":[
              {
                "position":0,
                "memdev":"mem0",
                "decoder":"decoder1.0"
              }
            ],
            "daxregion":{
              "id":0,
              "size":137438953472,
              "align":2097152,
              "devices":[
                {
                  "chardev":"dax0.0",
                  "size":137438953472,
                  "target_node":1,
                  "align":2097152,
                  "mode":"devdax"
                }
              ]
            }
          }
        ]
      }
    ]
  }
]



~Gregory

             reply	other threads:[~2023-04-12 18:43 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-12 18:43 Gregory Price [this message]
2023-04-13 11:39 ` [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check Gregory Price
2023-04-18  6:43   ` Dan Williams
2023-04-20  0:58     ` Gregory Price
2023-04-18  6:35 ` Dan Williams
2023-04-20  1:29   ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZDb71ZXGtzz0ttQT@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox