All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page
@ 2023-09-20 14:02 ankita
  2023-09-20 14:02 ` [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages ankita
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: ankita @ 2023-09-20 14:02 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, akpm, tony.luck, bp,
	naoya.horiguchi, linmiaohe
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, anuaggarwal,
	linux-kernel, linux-mm, linux-edac, kvm

From: Ankit Agrawal <ankita@nvidia.com>

The kernel MM currently handles ECC errors / poison only on memory page
backed by struct page. As part of [1], the nvgrace-gpu-vfio-pci module
maps the device memory to user VA (Qemu) using remap_pfn_range without
being added to the kernel. These pages are not backed by struct page.

Implement a new ECC handling for memory without struct pages. Kernel MM
expose registration APIs to allow modules that are managing the device
to register its memory region and a callback function. MM then tracks
such regions using interval tree.

The mechanism is largely similar to that of ECC on pfn with struct pages.
If there is an ECC error on a pfn, MM uses the registered memory failure
callback function to notify the module of the faulty PFN, so that the
module may take any required action. The pfn is then unmapped in Stage-2.
When the VM tries to access the page, it gets trapped in KVM, which calls
the vm ops fault function. If the module fault function returns
VM_FAULT_HWPOISON, KVM sends a BUS_MCEERR_AR to the usermode (Qemu) mapped
to the poisoned page.

Lastly, nvgrace-gpu-vfio-pci module make use of the new mechanism to get
poison handling support on the device memory.

Patch generated over v6.6-rc2 and with [1] applied. [1] is currently under
review.

[1] https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/

Ankit Agrawal (4):
  mm: handle poisoning of pfn without struct pages
  mm: Add poison error check in fixup_user_fault() for mapped pfn
  mm: Change ghes code to allow poison of non-struct pfn
  vfio/nvgpu: register device memory for poison handling

 drivers/acpi/apei/ghes.c            |  12 +--
 drivers/vfio/pci/nvgrace-gpu/main.c | 107 +++++++++++++++++++++-
 drivers/vfio/vfio.h                 |  11 ---
 drivers/vfio/vfio_main.c            |   3 +-
 include/linux/memory-failure.h      |  22 +++++
 include/linux/mm.h                  |   1 +
 include/linux/vfio.h                |  15 ++++
 include/ras/ras_event.h             |   1 +
 mm/Kconfig                          |   1 +
 mm/gup.c                            |   2 +-
 mm/memory-failure.c                 | 135 +++++++++++++++++++++++-----
 virt/kvm/kvm_main.c                 |   6 ++
 12 files changed, 270 insertions(+), 46 deletions(-)
 create mode 100644 include/linux/memory-failure.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling
  2023-09-20 14:02 ` [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling ankita
@ 2023-09-26  5:36 ` kernel test robot
  2023-09-28 19:45   ` Alex Williamson
  1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2023-09-25 15:58 UTC (permalink / raw)
  To: oe-kbuild; +Cc: lkp

:::::: 
:::::: Manual check reason: "git am base is a link in commit message"
:::::: 

BCC: lkp@intel.com
CC: llvm@lists.linux.dev
CC: oe-kbuild-all@lists.linux.dev
In-Reply-To: <20230920140210.12663-5-ankita@nvidia.com>
References: <20230920140210.12663-5-ankita@nvidia.com>
TO: ankita@nvidia.com
TO: ankita@nvidia.com
TO: jgg@nvidia.com
TO: alex.williamson@redhat.com
TO: akpm@linux-foundation.org
TO: tony.luck@intel.com
TO: bp@alien8.de
TO: naoya.horiguchi@nec.com
TO: linmiaohe@huawei.com
CC: aniketa@nvidia.com
CC: cjia@nvidia.com
CC: kwankhede@nvidia.com
CC: targupta@nvidia.com
CC: vsethi@nvidia.com
CC: acurrid@nvidia.com
CC: anuaggarwal@nvidia.com
CC: linux-kernel@vger.kernel.org
CC: linux-mm@kvack.org
CC: linux-edac@vger.kernel.org
CC: kvm@vger.kernel.org

Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on awilliam-vfio/for-linus]
[also build test WARNING on kvm/queue rafael-pm/linux-next linus/master]
[cannot apply to akpm-mm/mm-everything awilliam-vfio/next kvm/linux-next next-20230925]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/ankita-nvidia-com/mm-handle-poisoning-of-pfn-without-struct-pages/20230920-220626
base:   https://github.com/awilliam/linux-vfio.git for-linus
patch link:    https://lore.kernel.org/r/20230920140210.12663-5-ankita%40nvidia.com
patch subject: [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling
:::::: branch date: 5 days ago
:::::: commit date: 5 days ago
config: powerpc64-allmodconfig (https://download.01.org/0day-ci/archive/20230925/202309252319.hQ7rHJTJ-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230925/202309252319.hQ7rHJTJ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/r/202309252319.hQ7rHJTJ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/vfio/pci/nvgrace-gpu/main.c:27:6: warning: no previous prototype for function 'nvgrace_gpu_vfio_pci_pfn_memory_failure' [-Wmissing-prototypes]
      27 | void nvgrace_gpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
         |      ^
   drivers/vfio/pci/nvgrace-gpu/main.c:27:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
      27 | void nvgrace_gpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
         | ^
         | static 
   drivers/vfio/pci/nvgrace-gpu/main.c:300:9: warning: no previous prototype for function 'nvgrace_gpu_read_mem' [-Wmissing-prototypes]
     300 | ssize_t nvgrace_gpu_read_mem(void __user *buf, size_t count, loff_t *ppos,
         |         ^
   drivers/vfio/pci/nvgrace-gpu/main.c:300:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     300 | ssize_t nvgrace_gpu_read_mem(void __user *buf, size_t count, loff_t *ppos,
         | ^
         | static 
   drivers/vfio/pci/nvgrace-gpu/main.c:376:9: warning: no previous prototype for function 'nvgrace_gpu_write_mem' [-Wmissing-prototypes]
     376 | ssize_t nvgrace_gpu_write_mem(size_t count, loff_t *ppos, const void __user *buf,
         |         ^
   drivers/vfio/pci/nvgrace-gpu/main.c:376:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     376 | ssize_t nvgrace_gpu_write_mem(size_t count, loff_t *ppos, const void __user *buf,
         | ^
         | static 
   3 warnings generated.


vim +/nvgrace_gpu_vfio_pci_pfn_memory_failure +27 drivers/vfio/pci/nvgrace-gpu/main.c

b59e9d949a79e1 Ankit Agrawal 2023-09-14  25  
5f3746d8629350 Ankit Agrawal 2023-09-20  26  #ifdef CONFIG_MEMORY_FAILURE
5f3746d8629350 Ankit Agrawal 2023-09-20 @27  void nvgrace_gpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
5f3746d8629350 Ankit Agrawal 2023-09-20  28  		unsigned long pfn)
5f3746d8629350 Ankit Agrawal 2023-09-20  29  {
5f3746d8629350 Ankit Agrawal 2023-09-20  30  	struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
5f3746d8629350 Ankit Agrawal 2023-09-20  31  		pfn_space, struct nvgrace_gpu_vfio_pci_core_device, pfn_address_space);
5f3746d8629350 Ankit Agrawal 2023-09-20  32  	unsigned long mem_offset = pfn - pfn_space->node.start;
5f3746d8629350 Ankit Agrawal 2023-09-20  33  
5f3746d8629350 Ankit Agrawal 2023-09-20  34  	if (mem_offset >= nvdev->memlength)
5f3746d8629350 Ankit Agrawal 2023-09-20  35  		return;
5f3746d8629350 Ankit Agrawal 2023-09-20  36  
5f3746d8629350 Ankit Agrawal 2023-09-20  37  	/*
5f3746d8629350 Ankit Agrawal 2023-09-20  38  	 * MM has called to notify a poisoned page. Track that in the bitmap.
5f3746d8629350 Ankit Agrawal 2023-09-20  39  	 */
5f3746d8629350 Ankit Agrawal 2023-09-20  40  	__set_bit(mem_offset, nvdev->pfn_bitmap);
5f3746d8629350 Ankit Agrawal 2023-09-20  41  }
5f3746d8629350 Ankit Agrawal 2023-09-20  42  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-09-28 19:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-20 14:02 [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page ankita
2023-09-20 14:02 ` [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages ankita
2023-09-23  3:20   ` Miaohe Lin
2023-09-25 12:36     ` Jason Gunthorpe
2023-09-26  7:23   ` Naoya Horiguchi
2023-09-20 14:02 ` [PATCH v1 2/4] mm: Add poison error check in fixup_user_fault() for mapped pfn ankita
2023-09-20 14:02 ` [PATCH v1 3/4] mm: Change ghes code to allow poison of non-struct pfn ankita
2023-09-20 14:02 ` [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling ankita
2023-09-26  7:38   ` Naoya Horiguchi
2023-09-28 19:45   ` Alex Williamson
2023-09-20 16:02 ` [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page Andrew Morton
2023-09-20 16:04   ` Jason Gunthorpe
  -- strict thread matches above, loose matches on Subject: below --
2023-09-25 15:58 [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling kernel test robot
2023-09-26  5:36 ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.