LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH RFC PKS/PMEM 09/58] drivers/gpu: Utilize new kmap_thread()
From: Ira Weiny @ 2020-10-10 23:01 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra, David Airlie, Patrik Jakobsson,
	x86, Dave Hansen, Dan Williams, Fenghua Yu, linux-doc,
	linux-kernel, linux-nvdimm, linux-fsdevel, linux-mm,
	linux-kselftest, linuxppc-dev, kvm, netdev, bpf, kexec,
	linux-bcache, linux-mtd, devel, linux-efi, linux-mmc, linux-scsi,
	target-devel, linux-nfs, ceph-devel, linux-ext4, linux-aio,
	io-uring, linux-erofs, linux-um, linux-ntfs-dev, reiserfs-devel,
	linux-f2fs-devel, linux-nilfs, cluster-devel, ecryptfs,
	linux-cifs, linux-btrfs, linux-afs, linux-rdma, amd-gfx,
	dri-devel, intel-gfx, drbd-dev, linux-block, xen-devel,
	linux-cachefs, samba-technical, intel-wired-lan
In-Reply-To: <20201009220349.GQ438822@phenom.ffwll.local>

On Sat, Oct 10, 2020 at 12:03:49AM +0200, Daniel Vetter wrote:
> On Fri, Oct 09, 2020 at 12:49:44PM -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > These kmap() calls in the gpu stack are localized to a single thread.
> > To avoid the over head of global PKRS updates use the new kmap_thread()
> > call.
> > 
> > Cc: David Airlie <airlied@linux.ie>
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > Cc: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> I'm guessing the entire pile goes in through some other tree.
>

Apologies for not realizing there were multiple maintainers here.

But, I was thinking it would land together through the mm tree once the core
support lands.  I've tried to split these out in a way they can be easily
reviewed/acked by the correct developers.

> If so:
> 
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> If you want this to land through maintainer trees, then we need a
> per-driver split (since aside from amdgpu and radeon they're all different
> subtrees).

It is just RFC for the moment.  I need to get the core support accepted first
then this can land.

> 
> btw the two kmap calls in drm you highlight in the cover letter should
> also be convertible to kmap_thread. We only hold vmalloc mappings for a
> longer time (or it'd be quite a driver bug). So if you want maybe throw
> those two as two additional patches on top, and we can do some careful
> review & testing for them.

Cool.  I'll add them in.

Ira

> -Daniel
> 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c              | 12 ++++++------
> >  drivers/gpu/drm/gma500/gma_display.c                 |  4 ++--
> >  drivers/gpu/drm/gma500/mmu.c                         | 10 +++++-----
> >  drivers/gpu/drm/i915/gem/i915_gem_shmem.c            |  4 ++--
> >  .../gpu/drm/i915/gem/selftests/i915_gem_context.c    |  4 ++--
> >  drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c   |  8 ++++----
> >  drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c         |  4 ++--
> >  drivers/gpu/drm/i915/gt/intel_gtt.c                  |  4 ++--
> >  drivers/gpu/drm/i915/gt/shmem_utils.c                |  4 ++--
> >  drivers/gpu/drm/i915/i915_gem.c                      |  8 ++++----
> >  drivers/gpu/drm/i915/i915_gpu_error.c                |  4 ++--
> >  drivers/gpu/drm/i915/selftests/i915_perf.c           |  4 ++--
> >  drivers/gpu/drm/radeon/radeon_ttm.c                  |  4 ++--
> >  13 files changed, 37 insertions(+), 37 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > index 978bae731398..bd564bccb7a3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > @@ -2437,11 +2437,11 @@ static ssize_t amdgpu_ttm_gtt_read(struct file *f, char __user *buf,
> >  
> >  		page = adev->gart.pages[p];
> >  		if (page) {
> > -			ptr = kmap(page);
> > +			ptr = kmap_thread(page);
> >  			ptr += off;
> >  
> >  			r = copy_to_user(buf, ptr, cur_size);
> > -			kunmap(adev->gart.pages[p]);
> > +			kunmap_thread(adev->gart.pages[p]);
> >  		} else
> >  			r = clear_user(buf, cur_size);
> >  
> > @@ -2507,9 +2507,9 @@ static ssize_t amdgpu_iomem_read(struct file *f, char __user *buf,
> >  		if (p->mapping != adev->mman.bdev.dev_mapping)
> >  			return -EPERM;
> >  
> > -		ptr = kmap(p);
> > +		ptr = kmap_thread(p);
> >  		r = copy_to_user(buf, ptr + off, bytes);
> > -		kunmap(p);
> > +		kunmap_thread(p);
> >  		if (r)
> >  			return -EFAULT;
> >  
> > @@ -2558,9 +2558,9 @@ static ssize_t amdgpu_iomem_write(struct file *f, const char __user *buf,
> >  		if (p->mapping != adev->mman.bdev.dev_mapping)
> >  			return -EPERM;
> >  
> > -		ptr = kmap(p);
> > +		ptr = kmap_thread(p);
> >  		r = copy_from_user(ptr + off, buf, bytes);
> > -		kunmap(p);
> > +		kunmap_thread(p);
> >  		if (r)
> >  			return -EFAULT;
> >  
> > diff --git a/drivers/gpu/drm/gma500/gma_display.c b/drivers/gpu/drm/gma500/gma_display.c
> > index 3df6d6e850f5..35f4e55c941f 100644
> > --- a/drivers/gpu/drm/gma500/gma_display.c
> > +++ b/drivers/gpu/drm/gma500/gma_display.c
> > @@ -400,9 +400,9 @@ int gma_crtc_cursor_set(struct drm_crtc *crtc,
> >  		/* Copy the cursor to cursor mem */
> >  		tmp_dst = dev_priv->vram_addr + cursor_gt->offset;
> >  		for (i = 0; i < cursor_pages; i++) {
> > -			tmp_src = kmap(gt->pages[i]);
> > +			tmp_src = kmap_thread(gt->pages[i]);
> >  			memcpy(tmp_dst, tmp_src, PAGE_SIZE);
> > -			kunmap(gt->pages[i]);
> > +			kunmap_thread(gt->pages[i]);
> >  			tmp_dst += PAGE_SIZE;
> >  		}
> >  
> > diff --git a/drivers/gpu/drm/gma500/mmu.c b/drivers/gpu/drm/gma500/mmu.c
> > index 505044c9a673..fba7a3a461fd 100644
> > --- a/drivers/gpu/drm/gma500/mmu.c
> > +++ b/drivers/gpu/drm/gma500/mmu.c
> > @@ -192,20 +192,20 @@ struct psb_mmu_pd *psb_mmu_alloc_pd(struct psb_mmu_driver *driver,
> >  		pd->invalid_pte = 0;
> >  	}
> >  
> > -	v = kmap(pd->dummy_pt);
> > +	v = kmap_thread(pd->dummy_pt);
> >  	for (i = 0; i < (PAGE_SIZE / sizeof(uint32_t)); ++i)
> >  		v[i] = pd->invalid_pte;
> >  
> > -	kunmap(pd->dummy_pt);
> > +	kunmap_thread(pd->dummy_pt);
> >  
> > -	v = kmap(pd->p);
> > +	v = kmap_thread(pd->p);
> >  	for (i = 0; i < (PAGE_SIZE / sizeof(uint32_t)); ++i)
> >  		v[i] = pd->invalid_pde;
> >  
> > -	kunmap(pd->p);
> > +	kunmap_thread(pd->p);
> >  
> >  	clear_page(kmap(pd->dummy_page));
> > -	kunmap(pd->dummy_page);
> > +	kunmap_thread(pd->dummy_page);
> >  
> >  	pd->tables = vmalloc_user(sizeof(struct psb_mmu_pt *) * 1024);
> >  	if (!pd->tables)
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> > index 38113d3c0138..274424795fb7 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> > @@ -566,9 +566,9 @@ i915_gem_object_create_shmem_from_data(struct drm_i915_private *dev_priv,
> >  		if (err < 0)
> >  			goto fail;
> >  
> > -		vaddr = kmap(page);
> > +		vaddr = kmap_thread(page);
> >  		memcpy(vaddr, data, len);
> > -		kunmap(page);
> > +		kunmap_thread(page);
> >  
> >  		err = pagecache_write_end(file, file->f_mapping,
> >  					  offset, len, len,
> > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> > index 7ffc3c751432..b466c677d007 100644
> > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> > @@ -1754,7 +1754,7 @@ static int check_scratch_page(struct i915_gem_context *ctx, u32 *out)
> >  		return -EINVAL;
> >  	}
> >  
> > -	vaddr = kmap(page);
> > +	vaddr = kmap_thread(page);
> >  	if (!vaddr) {
> >  		pr_err("No (mappable) scratch page!\n");
> >  		return -EINVAL;
> > @@ -1765,7 +1765,7 @@ static int check_scratch_page(struct i915_gem_context *ctx, u32 *out)
> >  		pr_err("Inconsistent initial state of scratch page!\n");
> >  		err = -EINVAL;
> >  	}
> > -	kunmap(page);
> > +	kunmap_thread(page);
> >  
> >  	return err;
> >  }
> > diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > index 9c7402ce5bf9..447df22e2e06 100644
> > --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> > @@ -143,7 +143,7 @@ static int check_partial_mapping(struct drm_i915_gem_object *obj,
> >  	intel_gt_flush_ggtt_writes(&to_i915(obj->base.dev)->gt);
> >  
> >  	p = i915_gem_object_get_page(obj, offset >> PAGE_SHIFT);
> > -	cpu = kmap(p) + offset_in_page(offset);
> > +	cpu = kmap_thread(p) + offset_in_page(offset);
> >  	drm_clflush_virt_range(cpu, sizeof(*cpu));
> >  	if (*cpu != (u32)page) {
> >  		pr_err("Partial view for %lu [%u] (offset=%llu, size=%u [%llu, row size %u], fence=%d, tiling=%d, stride=%d) misalignment, expected write to page (%llu + %u [0x%llx]) of 0x%x, found 0x%x\n",
> > @@ -161,7 +161,7 @@ static int check_partial_mapping(struct drm_i915_gem_object *obj,
> >  	}
> >  	*cpu = 0;
> >  	drm_clflush_virt_range(cpu, sizeof(*cpu));
> > -	kunmap(p);
> > +	kunmap_thread(p);
> >  
> >  out:
> >  	__i915_vma_put(vma);
> > @@ -236,7 +236,7 @@ static int check_partial_mappings(struct drm_i915_gem_object *obj,
> >  		intel_gt_flush_ggtt_writes(&to_i915(obj->base.dev)->gt);
> >  
> >  		p = i915_gem_object_get_page(obj, offset >> PAGE_SHIFT);
> > -		cpu = kmap(p) + offset_in_page(offset);
> > +		cpu = kmap_thread(p) + offset_in_page(offset);
> >  		drm_clflush_virt_range(cpu, sizeof(*cpu));
> >  		if (*cpu != (u32)page) {
> >  			pr_err("Partial view for %lu [%u] (offset=%llu, size=%u [%llu, row size %u], fence=%d, tiling=%d, stride=%d) misalignment, expected write to page (%llu + %u [0x%llx]) of 0x%x, found 0x%x\n",
> > @@ -254,7 +254,7 @@ static int check_partial_mappings(struct drm_i915_gem_object *obj,
> >  		}
> >  		*cpu = 0;
> >  		drm_clflush_virt_range(cpu, sizeof(*cpu));
> > -		kunmap(p);
> > +		kunmap_thread(p);
> >  		if (err)
> >  			return err;
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
> > index 7fb36b12fe7a..38da348282f1 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
> > @@ -731,7 +731,7 @@ static void swizzle_page(struct page *page)
> >  	char *vaddr;
> >  	int i;
> >  
> > -	vaddr = kmap(page);
> > +	vaddr = kmap_thread(page);
> >  
> >  	for (i = 0; i < PAGE_SIZE; i += 128) {
> >  		memcpy(temp, &vaddr[i], 64);
> > @@ -739,7 +739,7 @@ static void swizzle_page(struct page *page)
> >  		memcpy(&vaddr[i + 64], temp, 64);
> >  	}
> >  
> > -	kunmap(page);
> > +	kunmap_thread(page);
> >  }
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.c b/drivers/gpu/drm/i915/gt/intel_gtt.c
> > index 2a72cce63fd9..4cfb24e9ed62 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gtt.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gtt.c
> > @@ -312,9 +312,9 @@ static void poison_scratch_page(struct page *page, unsigned long size)
> >  	do {
> >  		void *vaddr;
> >  
> > -		vaddr = kmap(page);
> > +		vaddr = kmap_thread(page);
> >  		memset(vaddr, POISON_FREE, PAGE_SIZE);
> > -		kunmap(page);
> > +		kunmap_thread(page);
> >  
> >  		page = pfn_to_page(page_to_pfn(page) + 1);
> >  		size -= PAGE_SIZE;
> > diff --git a/drivers/gpu/drm/i915/gt/shmem_utils.c b/drivers/gpu/drm/i915/gt/shmem_utils.c
> > index 43c7acbdc79d..a40d3130cebf 100644
> > --- a/drivers/gpu/drm/i915/gt/shmem_utils.c
> > +++ b/drivers/gpu/drm/i915/gt/shmem_utils.c
> > @@ -142,12 +142,12 @@ static int __shmem_rw(struct file *file, loff_t off,
> >  		if (IS_ERR(page))
> >  			return PTR_ERR(page);
> >  
> > -		vaddr = kmap(page);
> > +		vaddr = kmap_thread(page);
> >  		if (write)
> >  			memcpy(vaddr + offset_in_page(off), ptr, this);
> >  		else
> >  			memcpy(ptr, vaddr + offset_in_page(off), this);
> > -		kunmap(page);
> > +		kunmap_thread(page);
> >  		put_page(page);
> >  
> >  		len -= this;
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 9aa3066cb75d..cae8300fd224 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -312,14 +312,14 @@ shmem_pread(struct page *page, int offset, int len, char __user *user_data,
> >  	char *vaddr;
> >  	int ret;
> >  
> > -	vaddr = kmap(page);
> > +	vaddr = kmap_thread(page);
> >  
> >  	if (needs_clflush)
> >  		drm_clflush_virt_range(vaddr + offset, len);
> >  
> >  	ret = __copy_to_user(user_data, vaddr + offset, len);
> >  
> > -	kunmap(page);
> > +	kunmap_thread(page);
> >  
> >  	return ret ? -EFAULT : 0;
> >  }
> > @@ -708,7 +708,7 @@ shmem_pwrite(struct page *page, int offset, int len, char __user *user_data,
> >  	char *vaddr;
> >  	int ret;
> >  
> > -	vaddr = kmap(page);
> > +	vaddr = kmap_thread(page);
> >  
> >  	if (needs_clflush_before)
> >  		drm_clflush_virt_range(vaddr + offset, len);
> > @@ -717,7 +717,7 @@ shmem_pwrite(struct page *page, int offset, int len, char __user *user_data,
> >  	if (!ret && needs_clflush_after)
> >  		drm_clflush_virt_range(vaddr + offset, len);
> >  
> > -	kunmap(page);
> > +	kunmap_thread(page);
> >  
> >  	return ret ? -EFAULT : 0;
> >  }
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > index 3e6cbb0d1150..aecd469b6b6e 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > @@ -1058,9 +1058,9 @@ i915_vma_coredump_create(const struct intel_gt *gt,
> >  
> >  			drm_clflush_pages(&page, 1);
> >  
> > -			s = kmap(page);
> > +			s = kmap_thread(page);
> >  			ret = compress_page(compress, s, dst, false);
> > -			kunmap(page);
> > +			kunmap_thread(page);
> >  
> >  			drm_clflush_pages(&page, 1);
> >  
> > diff --git a/drivers/gpu/drm/i915/selftests/i915_perf.c b/drivers/gpu/drm/i915/selftests/i915_perf.c
> > index c2d001d9c0ec..7f7ef2d056f4 100644
> > --- a/drivers/gpu/drm/i915/selftests/i915_perf.c
> > +++ b/drivers/gpu/drm/i915/selftests/i915_perf.c
> > @@ -307,7 +307,7 @@ static int live_noa_gpr(void *arg)
> >  	}
> >  
> >  	/* Poison the ce->vm so we detect writes not to the GGTT gt->scratch */
> > -	scratch = kmap(ce->vm->scratch[0].base.page);
> > +	scratch = kmap_thread(ce->vm->scratch[0].base.page);
> >  	memset(scratch, POISON_FREE, PAGE_SIZE);
> >  
> >  	rq = intel_context_create_request(ce);
> > @@ -405,7 +405,7 @@ static int live_noa_gpr(void *arg)
> >  out_rq:
> >  	i915_request_put(rq);
> >  out_ce:
> > -	kunmap(ce->vm->scratch[0].base.page);
> > +	kunmap_thread(ce->vm->scratch[0].base.page);
> >  	intel_context_put(ce);
> >  out:
> >  	stream_destroy(stream);
> > diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> > index 004344dce140..0aba0cac51e1 100644
> > --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> > +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> > @@ -1013,11 +1013,11 @@ static ssize_t radeon_ttm_gtt_read(struct file *f, char __user *buf,
> >  
> >  		page = rdev->gart.pages[p];
> >  		if (page) {
> > -			ptr = kmap(page);
> > +			ptr = kmap_thread(page);
> >  			ptr += off;
> >  
> >  			r = copy_to_user(buf, ptr, cur_size);
> > -			kunmap(rdev->gart.pages[p]);
> > +			kunmap_thread(rdev->gart.pages[p]);
> >  		} else
> >  			r = clear_user(buf, cur_size);
> >  
> > -- 
> > 2.28.0.rc0.12.gb6a658bd00c9
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply

* Re: [PATCH RFC PKS/PMEM 10/58] drivers/rdma: Utilize new kmap_thread()
From: Bernard Metzler @ 2020-10-10 11:36 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-aio, linux-efi, kvm, linux-doc, Peter Zijlstra, linux-mmc,
	Dave Hansen, dri-devel, linux-mm, target-devel, linux-mtd,
	linux-kselftest, samba-technical, Thomas Gleixner, devel,
	linux-cifs, linux-nilfs, linux-scsi, linux-nvdimm, linux-rdma,
	x86, ceph-devel, io-uring, cluster-devel, Jason Gunthorpe,
	Doug Ledford, Ingo Molnar, intel-wired-lan, xen-devel, linux-ext4,
	Fenghua Yu, linux-afs, Faisal Latif, linux-um, intel-gfx,
	ecryptfs, linux-erofs, linux-nfs, reiserfs-devel, linux-block,
	linux-bcache, Borislav Petkov, Andy Lutomirski, drbd-dev, amd-gfx,
	Dan Williams, Shiraz Saleem, bpf, linux-cachefs, Mike Marciniszyn,
	linux-ntfs-dev, netdev, Dennis Dalessandro, kexec, linux-kernel,
	linux-f2fs-devel, linux-fsdevel, Andrew Morton, linuxppc-dev,
	linux-btrfs
In-Reply-To: <20201009195033.3208459-11-ira.weiny@intel.com>

-----ira.weiny@intel.com wrote: -----

>To: "Andrew Morton" <akpm@linux-foundation.org>, "Thomas Gleixner"
><tglx@linutronix.de>, "Ingo Molnar" <mingo@redhat.com>, "Borislav
>Petkov" <bp@alien8.de>, "Andy Lutomirski" <luto@kernel.org>, "Peter
>Zijlstra" <peterz@infradead.org>
>From: ira.weiny@intel.com
>Date: 10/09/2020 09:52PM
>Cc: "Ira Weiny" <ira.weiny@intel.com>, "Mike Marciniszyn"
><mike.marciniszyn@intel.com>, "Dennis Dalessandro"
><dennis.dalessandro@intel.com>, "Doug Ledford" <dledford@redhat.com>,
>"Jason Gunthorpe" <jgg@ziepe.ca>, "Faisal Latif"
><faisal.latif@intel.com>, "Shiraz Saleem" <shiraz.saleem@intel.com>,
>"Bernard Metzler" <bmt@zurich.ibm.com>, x86@kernel.org, "Dave Hansen"
><dave.hansen@linux.intel.com>, "Dan Williams"
><dan.j.williams@intel.com>, "Fenghua Yu" <fenghua.yu@intel.com>,
>linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
>linux-nvdimm@lists.01.org, linux-fsdevel@vger.kernel.org,
>linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
>linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
>netdev@vger.kernel.org, bpf@vger.kernel.org,
>kexec@lists.infradead.org, linux-bcache@vger.kernel.org,
>linux-mtd@lists.infradead.org, devel@driverdev.osuosl.org,
>linux-efi@vger.kernel.org, linux-mmc@vger.kernel.org,
>linux-scsi@vger.kernel.org, target-devel@vger.kernel.org,
>linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org,
>linux-ext4@vger.kernel.org, linux-aio@kvack.org,
>io-uring@vger.kernel.org, linux-erofs@lists.ozlabs.org,
>linux-um@lists.infradead.org, linux-ntfs-dev@lists.sourceforge.net,
>reiserfs-devel@vger.kernel.org,
>linux-f2fs-devel@lists.sourceforge.net, linux-nilfs@vger.kernel.org,
>cluster-devel@redhat.com, ecryptfs@vger.kernel.org,
>linux-cifs@vger.kernel.org, linux-btrfs@vger.kernel.org,
>linux-afs@lists.infradead.org, linux-rdma@vger.kernel.org,
>amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
>intel-gfx@lists.freedesktop.org, drbd-dev@tron.linbit.com,
>linux-block@vger.kernel.org, xen-devel@lists.xenproject.org,
>linux-cachefs@redhat.com, samba-technical@lists.samba.org,
>intel-wired-lan@lists.osuosl.org
>Subject: [EXTERNAL] [PATCH RFC PKS/PMEM 10/58] drivers/rdma: Utilize
>new kmap_thread()
>
>From: Ira Weiny <ira.weiny@intel.com>
>
>The kmap() calls in these drivers are localized to a single thread.
>To
>avoid the over head of global PKRS updates use the new kmap_thread()
>call.
>
>Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
>Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
>Cc: Doug Ledford <dledford@redhat.com>
>Cc: Jason Gunthorpe <jgg@ziepe.ca>
>Cc: Faisal Latif <faisal.latif@intel.com>
>Cc: Shiraz Saleem <shiraz.saleem@intel.com>
>Cc: Bernard Metzler <bmt@zurich.ibm.com>
>Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>---
> drivers/infiniband/hw/hfi1/sdma.c      |  4 ++--
> drivers/infiniband/hw/i40iw/i40iw_cm.c | 10 +++++-----
> drivers/infiniband/sw/siw/siw_qp_tx.c  | 14 +++++++-------
> 3 files changed, 14 insertions(+), 14 deletions(-)
>
>diff --git a/drivers/infiniband/hw/hfi1/sdma.c
>b/drivers/infiniband/hw/hfi1/sdma.c
>index 04575c9afd61..09d206e3229a 100644
>--- a/drivers/infiniband/hw/hfi1/sdma.c
>+++ b/drivers/infiniband/hw/hfi1/sdma.c
>@@ -3130,7 +3130,7 @@ int ext_coal_sdma_tx_descs(struct hfi1_devdata
>*dd, struct sdma_txreq *tx,
> 		}
> 
> 		if (type == SDMA_MAP_PAGE) {
>-			kvaddr = kmap(page);
>+			kvaddr = kmap_thread(page);
> 			kvaddr += offset;
> 		} else if (WARN_ON(!kvaddr)) {
> 			__sdma_txclean(dd, tx);
>@@ -3140,7 +3140,7 @@ int ext_coal_sdma_tx_descs(struct hfi1_devdata
>*dd, struct sdma_txreq *tx,
> 		memcpy(tx->coalesce_buf + tx->coalesce_idx, kvaddr, len);
> 		tx->coalesce_idx += len;
> 		if (type == SDMA_MAP_PAGE)
>-			kunmap(page);
>+			kunmap_thread(page);
> 
> 		/* If there is more data, return */
> 		if (tx->tlen - tx->coalesce_idx)
>diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c
>b/drivers/infiniband/hw/i40iw/i40iw_cm.c
>index a3b95805c154..122d7a5642a1 100644
>--- a/drivers/infiniband/hw/i40iw/i40iw_cm.c
>+++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c
>@@ -3721,7 +3721,7 @@ int i40iw_accept(struct iw_cm_id *cm_id, struct
>iw_cm_conn_param *conn_param)
> 		ibmr->device = iwpd->ibpd.device;
> 		iwqp->lsmm_mr = ibmr;
> 		if (iwqp->page)
>-			iwqp->sc_qp.qp_uk.sq_base = kmap(iwqp->page);
>+			iwqp->sc_qp.qp_uk.sq_base = kmap_thread(iwqp->page);
> 		dev->iw_priv_qp_ops->qp_send_lsmm(&iwqp->sc_qp,
> 							iwqp->ietf_mem.va,
> 							(accept.size + conn_param->private_data_len),
>@@ -3729,12 +3729,12 @@ int i40iw_accept(struct iw_cm_id *cm_id,
>struct iw_cm_conn_param *conn_param)
> 
> 	} else {
> 		if (iwqp->page)
>-			iwqp->sc_qp.qp_uk.sq_base = kmap(iwqp->page);
>+			iwqp->sc_qp.qp_uk.sq_base = kmap_thread(iwqp->page);
> 		dev->iw_priv_qp_ops->qp_send_lsmm(&iwqp->sc_qp, NULL, 0, 0);
> 	}
> 
> 	if (iwqp->page)
>-		kunmap(iwqp->page);
>+		kunmap_thread(iwqp->page);
> 
> 	iwqp->cm_id = cm_id;
> 	cm_node->cm_id = cm_id;
>@@ -4102,10 +4102,10 @@ static void i40iw_cm_event_connected(struct
>i40iw_cm_event *event)
> 	i40iw_cm_init_tsa_conn(iwqp, cm_node);
> 	read0 = (cm_node->send_rdma0_op == SEND_RDMA_READ_ZERO);
> 	if (iwqp->page)
>-		iwqp->sc_qp.qp_uk.sq_base = kmap(iwqp->page);
>+		iwqp->sc_qp.qp_uk.sq_base = kmap_thread(iwqp->page);
> 	dev->iw_priv_qp_ops->qp_send_rtt(&iwqp->sc_qp, read0);
> 	if (iwqp->page)
>-		kunmap(iwqp->page);
>+		kunmap_thread(iwqp->page);
> 
> 	memset(&attr, 0, sizeof(attr));
> 	attr.qp_state = IB_QPS_RTS;
>diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c
>b/drivers/infiniband/sw/siw/siw_qp_tx.c
>index d19d8325588b..4ed37c328d02 100644
>--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
>+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
>@@ -76,7 +76,7 @@ static int siw_try_1seg(struct siw_iwarp_tx *c_tx,
>void *paddr)
> 			if (unlikely(!p))
> 				return -EFAULT;
> 
>-			buffer = kmap(p);
>+			buffer = kmap_thread(p);
> 
> 			if (likely(PAGE_SIZE - off >= bytes)) {
> 				memcpy(paddr, buffer + off, bytes);
>@@ -84,7 +84,7 @@ static int siw_try_1seg(struct siw_iwarp_tx *c_tx,
>void *paddr)
> 				unsigned long part = bytes - (PAGE_SIZE - off);
> 
> 				memcpy(paddr, buffer + off, part);
>-				kunmap(p);
>+				kunmap_thread(p);
> 
> 				if (!mem->is_pbl)
> 					p = siw_get_upage(mem->umem,
>@@ -96,10 +96,10 @@ static int siw_try_1seg(struct siw_iwarp_tx
>*c_tx, void *paddr)
> 				if (unlikely(!p))
> 					return -EFAULT;
> 
>-				buffer = kmap(p);
>+				buffer = kmap_thread(p);
> 				memcpy(paddr + part, buffer, bytes - part);
> 			}
>-			kunmap(p);
>+			kunmap_thread(p);
> 		}
> 	}
> 	return (int)bytes;
>@@ -505,7 +505,7 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
>struct socket *s)
> 				page_array[seg] = p;
> 
> 				if (!c_tx->use_sendpage) {
>-					iov[seg].iov_base = kmap(p) + fp_off;
>+					iov[seg].iov_base = kmap_thread(p) + fp_off;

This misses a corresponding kunmap_thread() in siw_unmap_pages()
(pls change line 403 in siw_qp_tx.c as well)

Thanks,
Bernard.

> 					iov[seg].iov_len = plen;
> 
> 					/* Remember for later kunmap() */
>@@ -518,9 +518,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
>struct socket *s)
> 							plen);
> 				} else if (do_crc) {
> 					crypto_shash_update(c_tx->mpa_crc_hd,
>-							    kmap(p) + fp_off,
>+							    kmap_thread(p) + fp_off,
> 							    plen);
>-					kunmap(p);
>+					kunmap_thread(p);
> 				}
> 			} else {
> 				u64 va = sge->laddr + sge_off;
>-- 
>2.28.0.rc0.12.gb6a658bd00c9
>
>


^ permalink raw reply

* [PATCH] powerpc/mm: Fix verification of MMU_FTR_TYPE_44x
From: Christophe Leroy @ 2020-10-10 17:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Alastair D'Silva
  Cc: linuxppc-dev, linux-kernel

MMU_FTR_TYPE_44x cannot be checked by cpu_has_feature()

Use mmu_has_feature() instead

Fixes: 23eb7f560a2a ("powerpc: Convert flush_icache_range & friends to C")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ddc32cc1b6cf..b7586d8c835b 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -530,7 +530,7 @@ void __flush_dcache_icache(void *p)
 	 * space occurs, before returning to user space.
 	 */
 
-	if (cpu_has_feature(MMU_FTR_TYPE_44x))
+	if (mmu_has_feature(MMU_FTR_TYPE_44x))
 		return;
 
 	invalidate_icache_range(addr, addr + PAGE_SIZE);
-- 
2.25.0


^ permalink raw reply related

* [PATCH 4/4] powerpc/8xx: Manage _PAGE_ACCESSED through APG bits in L1 entry
From: Christophe Leroy @ 2020-10-10 15:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <3f2386f94fada00b541be76d6bc4b6409746a72d.1602342819.git.christophe.leroy@csgroup.eu>

When _PAGE_ACCESSED is not set, a minor fault is expected.
To do this, TLB miss exception ANDs _PAGE_PRESENT and _PAGE_ACCESSED
into the L2 entry valid bit.

To simplify the processing and reduce the number of instructions in
TLB miss exceptions, manage it as an APG bit and get it next to
_PAGE_GUARDED bit to allow a copy in one go. Then declare the
corresponding groups as handling all accesses as user accesses.
As the PP bits always define user as No Access, it will generate
a fault.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/nohash/32/kup-8xx.h |  2 +-
 arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 47 +++++++-------------
 arch/powerpc/include/asm/nohash/32/pte-8xx.h |  9 ++--
 arch/powerpc/kernel/head_8xx.S               | 36 +++------------
 4 files changed, 28 insertions(+), 66 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
index 85ed2390fb99..567cdc557402 100644
--- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
@@ -63,7 +63,7 @@ static inline void restore_user_access(unsigned long flags)
 static inline bool
 bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write)
 {
-	return WARN(!((regs->kuap ^ MD_APG_KUAP) & 0xf0000000),
+	return WARN(!((regs->kuap ^ MD_APG_KUAP) & 0xff000000),
 		    "Bug: fault blocked by AP register !");
 }
 
diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
index 1d9ac0f9c794..0bd1b144eb76 100644
--- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
@@ -33,19 +33,18 @@
  * respectively NA for All or X for Supervisor and no access for User.
  * Then we use the APG to say whether accesses are according to Page rules or
  * "all Supervisor" rules (Access to all)
- * Therefore, we define 2 APG groups. lsb is _PMD_USER
- * 0 => Kernel => 01 (all accesses performed according to page definition)
- * 1 => User => 00 (all accesses performed as supervisor iaw page definition)
- * 2-15 => Not Used
- */
-#define MI_APG_INIT	0x40000000
-
-/*
- * 0 => Kernel => 01 (all accesses performed according to page definition)
- * 1 => User => 10 (all accesses performed according to swaped page definition)
- * 2-15 => Not Used
- */
-#define MI_APG_KUEP	0x60000000
+ * _PAGE_ACCESSED is also managed via APG. When _PAGE_ACCESSED is not set, say
+ * "all User" rules, that will lead to NA for all.
+ * Therefore, we define 4 APG groups. lsb is _PAGE_ACCESSED
+ * 0 => Kernel => 11 (all accesses performed according as user iaw page definition)
+ * 1 => Kernel+Accessed => 01 (all accesses performed according to page definition)
+ * 2 => User => 11 (all accesses performed according as user iaw page definition)
+ * 3 => User+Accessed => 00 (all accesses performed as supervisor iaw page definition) for INIT
+ *                    => 10 (all accesses performed according to swaped page definition) for KUEP
+ * 4-15 => Not Used
+ */
+#define MI_APG_INIT	0xdc000000
+#define MI_APG_KUEP	0xde000000
 
 /* The effective page number register.  When read, contains the information
  * about the last instruction TLB miss.  When MI_RPN is written, bits in
@@ -106,25 +105,9 @@
 #define MD_Ks		0x80000000	/* Should not be set */
 #define MD_Kp		0x40000000	/* Should always be set */
 
-/*
- * All pages' PP data bits are set to either 000 or 011 or 001, which means
- * respectively RW for Supervisor and no access for User, or RO for
- * Supervisor and no access for user and NA for ALL.
- * Then we use the APG to say whether accesses are according to Page rules or
- * "all Supervisor" rules (Access to all)
- * Therefore, we define 2 APG groups. lsb is _PMD_USER
- * 0 => Kernel => 01 (all accesses performed according to page definition)
- * 1 => User => 00 (all accesses performed as supervisor iaw page definition)
- * 2-15 => Not Used
- */
-#define MD_APG_INIT	0x40000000
-
-/*
- * 0 => No user => 01 (all accesses performed according to page definition)
- * 1 => User => 10 (all accesses performed according to swaped page definition)
- * 2-15 => Not Used
- */
-#define MD_APG_KUAP	0x60000000
+/* See explanation above at the definition of MI_APG_INIT */
+#define MD_APG_INIT	0xdc000000
+#define MD_APG_KUAP	0xde000000
 
 /* The effective page number register.  When read, contains the information
  * about the last instruction TLB miss.  When MD_RPN is written, bits in
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 66f403a7da44..1581204467e1 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -39,9 +39,9 @@
  * into the TLB.
  */
 #define _PAGE_GUARDED	0x0010	/* Copied to L1 G entry in DTLB */
-#define _PAGE_SPECIAL	0x0020	/* SW entry */
+#define _PAGE_ACCESSED	0x0020	/* Copied to L1 APG 1 entry in I/DTLB */
 #define _PAGE_EXEC	0x0040	/* Copied to PP (bit 21) in ITLB */
-#define _PAGE_ACCESSED	0x0080	/* software: page referenced */
+#define _PAGE_SPECIAL	0x0080	/* SW entry */
 
 #define _PAGE_NA	0x0200	/* Supervisor NA, User no access */
 #define _PAGE_RO	0x0600	/* Supervisor RO, User no access */
@@ -59,11 +59,12 @@
 
 #define _PMD_PRESENT	0x0001
 #define _PMD_PRESENT_MASK	_PMD_PRESENT
-#define _PMD_BAD	0x0fd0
+#define _PMD_BAD	0x0f90
 #define _PMD_PAGE_MASK	0x000c
 #define _PMD_PAGE_8M	0x000c
 #define _PMD_PAGE_512K	0x0004
-#define _PMD_USER	0x0020	/* APG 1 */
+#define _PMD_ACCESSED	0x0020	/* APG 1 */
+#define _PMD_USER	0x0040	/* APG 2 */
 
 #define _PTE_NONE_MASK	0
 
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 6f3799a04121..ee0bfebc375f 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -222,23 +222,13 @@ InstructionTLBMiss:
 3:
 	mtcr	r11
 #endif
-#if defined(CONFIG_HUGETLBFS) || !defined(CONFIG_PIN_TLB_TEXT)
 	lwz	r11, (swapper_pg_dir-PAGE_OFFSET)@l(r10)	/* Get level 1 entry */
 	mtspr	SPRN_MD_TWC, r11
-#else
-	lwz	r10, (swapper_pg_dir-PAGE_OFFSET)@l(r10)	/* Get level 1 entry */
-	mtspr	SPRN_MI_TWC, r10	/* Set segment attributes */
-	mtspr	SPRN_MD_TWC, r10
-#endif
 	mfspr	r10, SPRN_MD_TWC
 	lwz	r10, 0(r10)	/* Get the pte */
-#if defined(CONFIG_HUGETLBFS) || !defined(CONFIG_PIN_TLB_TEXT)
+	rlwimi	r11, r10, 0, _PAGE_GUARDED | _PAGE_ACCESSED
 	rlwimi	r11, r10, 32 - 9, _PMD_PAGE_512K
 	mtspr	SPRN_MI_TWC, r11
-#endif
-	rlwinm	r11, r10, 32-7, _PAGE_PRESENT
-	and	r11, r11, r10
-	rlwimi	r10, r11, 0, _PAGE_PRESENT
 	/* The Linux PTE won't go exactly into the MMU TLB.
 	 * Software indicator bits 20 and 23 must be clear.
 	 * Software indicator bits 22, 24, 25, 26, and 27 must be
@@ -289,28 +279,16 @@ DataStoreTLBMiss:
 	mfspr	r10, SPRN_MD_TWC
 	lwz	r10, 0(r10)	/* Get the pte */
 
-	/* Insert the Guarded flag into the TWC from the Linux PTE.
+	/* Insert Guarded and Accessed flags into the TWC from the Linux PTE.
 	 * It is bit 27 of both the Linux PTE and the TWC (at least
 	 * I got that right :-).  It will be better when we can put
 	 * this into the Linux pgd/pmd and load it in the operation
 	 * above.
 	 */
-	rlwimi	r11, r10, 0, _PAGE_GUARDED
+	rlwimi	r11, r10, 0, _PAGE_GUARDED | _PAGE_ACCESSED
 	rlwimi	r11, r10, 32 - 9, _PMD_PAGE_512K
 	mtspr	SPRN_MD_TWC, r11
 
-	/* Both _PAGE_ACCESSED and _PAGE_PRESENT has to be set.
-	 * We also need to know if the insn is a load/store, so:
-	 * Clear _PAGE_PRESENT and load that which will
-	 * trap into DTLB Error with store bit set accordinly.
-	 */
-	/* PRESENT=0x1, ACCESSED=0x20
-	 * r11 = ((r10 & PRESENT) & ((r10 & ACCESSED) >> 5));
-	 * r10 = (r10 & ~PRESENT) | r11;
-	 */
-	rlwinm	r11, r10, 32-7, _PAGE_PRESENT
-	and	r11, r11, r10
-	rlwimi	r10, r11, 0, _PAGE_PRESENT
 	/* The Linux PTE won't go exactly into the MMU TLB.
 	 * Software indicator bits 24, 25, 26, and 27 must be
 	 * set.  All other Linux PTE bits control the behavior
@@ -701,7 +679,7 @@ initial_mmu:
 	li	r9, 4				/* up to 4 pages of 8M */
 	mtctr	r9
 	lis	r9, KERNELBASE@h		/* Create vaddr for TLB */
-	li	r10, MI_PS8MEG | MI_SVALID	/* Set 8M byte page */
+	li	r10, MI_PS8MEG | _PMD_ACCESSED | MI_SVALID
 	li	r11, MI_BOOTINIT		/* Create RPN for address 0 */
 1:
 	mtspr	SPRN_MI_CTR, r8	/* Set instruction MMU control */
@@ -765,7 +743,7 @@ _GLOBAL(mmu_pin_tlb)
 #ifdef CONFIG_PIN_TLB_TEXT
 	LOAD_REG_IMMEDIATE(r5, 28 << 8)
 	LOAD_REG_IMMEDIATE(r6, PAGE_OFFSET)
-	LOAD_REG_IMMEDIATE(r7, MI_SVALID | MI_PS8MEG)
+	LOAD_REG_IMMEDIATE(r7, MI_SVALID | MI_PS8MEG | _PMD_ACCESSED)
 	LOAD_REG_IMMEDIATE(r8, 0xf0 | _PAGE_RO | _PAGE_SPS | _PAGE_SH | _PAGE_PRESENT)
 	LOAD_REG_ADDR(r9, _sinittext)
 	li	r0, 4
@@ -787,7 +765,7 @@ _GLOBAL(mmu_pin_tlb)
 	LOAD_REG_IMMEDIATE(r5, 28 << 8 | MD_TWAM)
 #ifdef CONFIG_PIN_TLB_DATA
 	LOAD_REG_IMMEDIATE(r6, PAGE_OFFSET)
-	LOAD_REG_IMMEDIATE(r7, MI_SVALID | MI_PS8MEG)
+	LOAD_REG_IMMEDIATE(r7, MI_SVALID | MI_PS8MEG | _PMD_ACCESSED)
 #ifdef CONFIG_PIN_TLB_IMMR
 	li	r0, 3
 #else
@@ -824,7 +802,7 @@ _GLOBAL(mmu_pin_tlb)
 #endif
 #ifdef CONFIG_PIN_TLB_IMMR
 	LOAD_REG_IMMEDIATE(r0, VIRT_IMMR_BASE | MD_EVALID)
-	LOAD_REG_IMMEDIATE(r7, MD_SVALID | MD_PS512K | MD_GUARDED)
+	LOAD_REG_IMMEDIATE(r7, MD_SVALID | MD_PS512K | MD_GUARDED | _PMD_ACCESSED)
 	mfspr   r8, SPRN_IMMR
 	rlwinm	r8, r8, 0, 0xfff80000
 	ori	r8, r8, 0xf0 | _PAGE_DIRTY | _PAGE_SPS | _PAGE_SH | \
-- 
2.25.0


^ permalink raw reply related

* [PATCH] powerpc/40x: Always fault when _PAGE_ACCESSED is not set
From: Christophe Leroy @ 2020-10-10 15:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

The kernel expects pte_young() to work regardless of CONFIG_SWAP.

Make sure a minor fault is taken to set _PAGE_ACCESSED when it
is not already set, regardless of the selection of CONFIG_SWAP.

Fixes: 2c74e2586bb9 ("powerpc/40x: Rework 40x PTE access and TLB miss")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/head_40x.S | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/arch/powerpc/kernel/head_40x.S b/arch/powerpc/kernel/head_40x.S
index 44c9018aed1b..a1ae00689e0f 100644
--- a/arch/powerpc/kernel/head_40x.S
+++ b/arch/powerpc/kernel/head_40x.S
@@ -284,11 +284,7 @@ _ENTRY(saved_ksp_limit)
 
 	rlwimi	r11, r10, 22, 20, 29	/* Compute PTE address */
 	lwz	r11, 0(r11)		/* Get Linux PTE */
-#ifdef CONFIG_SWAP
 	li	r9, _PAGE_PRESENT | _PAGE_ACCESSED
-#else
-	li	r9, _PAGE_PRESENT
-#endif
 	andc.	r9, r9, r11		/* Check permission */
 	bne	5f
 
@@ -369,11 +365,7 @@ _ENTRY(saved_ksp_limit)
 
 	rlwimi	r11, r10, 22, 20, 29	/* Compute PTE address */
 	lwz	r11, 0(r11)		/* Get Linux PTE */
-#ifdef CONFIG_SWAP
 	li	r9, _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
-#else
-	li	r9, _PAGE_PRESENT | _PAGE_EXEC
-#endif
 	andc.	r9, r9, r11		/* Check permission */
 	bne	5f
 
-- 
2.25.0


^ permalink raw reply related

* [PATCH 3/4] powerpc/8xx: Always fault when _PAGE_ACCESSED is not set
From: Christophe Leroy @ 2020-10-10 15:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <3f2386f94fada00b541be76d6bc4b6409746a72d.1602342819.git.christophe.leroy@csgroup.eu>

The kernel expects pte_young() to work regardless of CONFIG_SWAP.

Make sure a minor fault is taken to set _PAGE_ACCESSED when it
is not already set, regardless of the selection of CONFIG_SWAP.

This adds at least 3 instructions to the TLB miss exception
handlers fast path. Following patch will reduce this overhead.

Fixes: d069cb4373fe ("powerpc/8xx: Don't touch ACCESSED when no SWAP.")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/head_8xx.S | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 83101c5888e3..6f3799a04121 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -202,9 +202,7 @@ SystemCall:
 
 InstructionTLBMiss:
 	mtspr	SPRN_SPRG_SCRATCH0, r10
-#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_SWAP) || defined(CONFIG_HUGETLBFS)
 	mtspr	SPRN_SPRG_SCRATCH1, r11
-#endif
 
 	/* If we are faulting a kernel address, we have to use the
 	 * kernel page tables.
@@ -238,11 +236,9 @@ InstructionTLBMiss:
 	rlwimi	r11, r10, 32 - 9, _PMD_PAGE_512K
 	mtspr	SPRN_MI_TWC, r11
 #endif
-#ifdef CONFIG_SWAP
 	rlwinm	r11, r10, 32-7, _PAGE_PRESENT
 	and	r11, r11, r10
 	rlwimi	r10, r11, 0, _PAGE_PRESENT
-#endif
 	/* The Linux PTE won't go exactly into the MMU TLB.
 	 * Software indicator bits 20 and 23 must be clear.
 	 * Software indicator bits 22, 24, 25, 26, and 27 must be
@@ -256,9 +252,7 @@ InstructionTLBMiss:
 
 	/* Restore registers */
 0:	mfspr	r10, SPRN_SPRG_SCRATCH0
-#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_SWAP) || defined(CONFIG_HUGETLBFS)
 	mfspr	r11, SPRN_SPRG_SCRATCH1
-#endif
 	rfi
 	patch_site	0b, patch__itlbmiss_exit_1
 
@@ -268,9 +262,7 @@ InstructionTLBMiss:
 	addi	r10, r10, 1
 	stw	r10, (itlb_miss_counter - PAGE_OFFSET)@l(0)
 	mfspr	r10, SPRN_SPRG_SCRATCH0
-#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_SWAP) || defined(CONFIG_HUGETLBFS)
 	mfspr	r11, SPRN_SPRG_SCRATCH1
-#endif
 	rfi
 #endif
 
@@ -316,11 +308,9 @@ DataStoreTLBMiss:
 	 * r11 = ((r10 & PRESENT) & ((r10 & ACCESSED) >> 5));
 	 * r10 = (r10 & ~PRESENT) | r11;
 	 */
-#ifdef CONFIG_SWAP
 	rlwinm	r11, r10, 32-7, _PAGE_PRESENT
 	and	r11, r11, r10
 	rlwimi	r10, r11, 0, _PAGE_PRESENT
-#endif
 	/* The Linux PTE won't go exactly into the MMU TLB.
 	 * Software indicator bits 24, 25, 26, and 27 must be
 	 * set.  All other Linux PTE bits control the behavior
-- 
2.25.0


^ permalink raw reply related

* [PATCH 2/4] powerpc/8xx: Fix TLB Miss exceptions with CONFIG_SWAP
From: Christophe Leroy @ 2020-10-10 15:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <3f2386f94fada00b541be76d6bc4b6409746a72d.1602342819.git.christophe.leroy@csgroup.eu>

When CONFIG_SWAP is set, _PAGE_ACCESSED gets ANDed with
_PAGE_PRESENT. But during several changes of _PAGE_ACCESSED value,
this operation which is based on bit rotation has been forgotten.

As of today it doesn't work.

Update to the rotation to the correct number of bits.

Fixes: 5f356497c384 ("powerpc/8xx: remove unused _PAGE_WRITETHRU")
Fixes: e0a8e0d90a9f ("powerpc/8xx: Handle PAGE_USER via APG bits")
Fixes: 5b2753fc3e8a ("powerpc/8xx: Implementation of PAGE_EXEC")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/head_8xx.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 32d85387bdc5..83101c5888e3 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -239,7 +239,7 @@ InstructionTLBMiss:
 	mtspr	SPRN_MI_TWC, r11
 #endif
 #ifdef CONFIG_SWAP
-	rlwinm	r11, r10, 32-5, _PAGE_PRESENT
+	rlwinm	r11, r10, 32-7, _PAGE_PRESENT
 	and	r11, r11, r10
 	rlwimi	r10, r11, 0, _PAGE_PRESENT
 #endif
@@ -317,7 +317,7 @@ DataStoreTLBMiss:
 	 * r10 = (r10 & ~PRESENT) | r11;
 	 */
 #ifdef CONFIG_SWAP
-	rlwinm	r11, r10, 32-5, _PAGE_PRESENT
+	rlwinm	r11, r10, 32-7, _PAGE_PRESENT
 	and	r11, r11, r10
 	rlwimi	r10, r11, 0, _PAGE_PRESENT
 #endif
-- 
2.25.0


^ permalink raw reply related

* [PATCH 1/4] powerpc/8xx: Fix instruction TLB miss exception with perf enabled
From: Christophe Leroy @ 2020-10-10 15:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

When perf is enabled, r11 must also be restored when CONFIG_HUGETLBFS
is selected.

Fixes: a891c43b97d3 ("powerpc/8xx: Prepare handlers for _PAGE_HUGE for 512k pages.")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/head_8xx.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 9f359d3fba74..32d85387bdc5 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -268,7 +268,7 @@ InstructionTLBMiss:
 	addi	r10, r10, 1
 	stw	r10, (itlb_miss_counter - PAGE_OFFSET)@l(0)
 	mfspr	r10, SPRN_SPRG_SCRATCH0
-#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_SWAP)
+#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_SWAP) || defined(CONFIG_HUGETLBFS)
 	mfspr	r11, SPRN_SPRG_SCRATCH1
 #endif
 	rfi
-- 
2.25.0


^ permalink raw reply related

* [PATCH] powerpc/603: Always fault when _PAGE_ACCESSED is not set
From: Christophe Leroy @ 2020-10-10 15:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

The kernel expects pte_young() to work regardless of CONFIG_SWAP.

Make sure a minor fault is taken to set _PAGE_ACCESSED when it
is not already set, regardless of the selection of CONFIG_SWAP.

Fixes: 84de6ab0e904 ("powerpc/603: don't handle PAGE_ACCESSED in TLB miss handlers.")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/head_book3s_32.S | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/arch/powerpc/kernel/head_book3s_32.S b/arch/powerpc/kernel/head_book3s_32.S
index 5eb9eedac920..2aa16d5368e1 100644
--- a/arch/powerpc/kernel/head_book3s_32.S
+++ b/arch/powerpc/kernel/head_book3s_32.S
@@ -457,11 +457,7 @@ InstructionTLBMiss:
 	cmplw	0,r1,r3
 #endif
 	mfspr	r2, SPRN_SPRG_PGDIR
-#ifdef CONFIG_SWAP
 	li	r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
-#else
-	li	r1,_PAGE_PRESENT | _PAGE_EXEC
-#endif
 #if defined(CONFIG_MODULES) || defined(CONFIG_DEBUG_PAGEALLOC)
 	bgt-	112f
 	lis	r2, (swapper_pg_dir - PAGE_OFFSET)@ha	/* if kernel address, use */
@@ -523,11 +519,7 @@ DataLoadTLBMiss:
 	lis	r1, TASK_SIZE@h		/* check if kernel address */
 	cmplw	0,r1,r3
 	mfspr	r2, SPRN_SPRG_PGDIR
-#ifdef CONFIG_SWAP
 	li	r1, _PAGE_PRESENT | _PAGE_ACCESSED
-#else
-	li	r1, _PAGE_PRESENT
-#endif
 	bgt-	112f
 	lis	r2, (swapper_pg_dir - PAGE_OFFSET)@ha	/* if kernel address, use */
 	addi	r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l	/* kernel page table */
@@ -603,11 +595,7 @@ DataStoreTLBMiss:
 	lis	r1, TASK_SIZE@h		/* check if kernel address */
 	cmplw	0,r1,r3
 	mfspr	r2, SPRN_SPRG_PGDIR
-#ifdef CONFIG_SWAP
 	li	r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED
-#else
-	li	r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT
-#endif
 	bgt-	112f
 	lis	r2, (swapper_pg_dir - PAGE_OFFSET)@ha	/* if kernel address, use */
 	addi	r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l	/* kernel page table */
-- 
2.25.0


^ permalink raw reply related

* Re: [PATCH 1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length
From: Catalin Marinas @ 2020-10-10 11:09 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Jann Horn, linuxppc-dev, linux-kernel, Christoph Hellwig,
	linux-mm, Paul Mackerras, sparclinux, Anthony Yznaga,
	Andrew Morton, Will Deacon, David S. Miller, linux-arm-kernel
In-Reply-To: <d5332a7b-c300-6d28-18b9-4b7d4110ef86@oracle.com>

Hi Khalid,

On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
> On 10/7/20 1:39 AM, Jann Horn wrote:
> > arch_validate_prot() is a hook that can validate whether a given set of
> > protection flags is valid in an mprotect() operation. It is given the set
> > of protection flags and the address being modified.
> > 
> > However, the address being modified can currently not actually be used in
> > a meaningful way because:
> > 
> > 1. Only the address is given, but not the length, and the operation can
> >    span multiple VMAs. Therefore, the callee can't actually tell which
> >    virtual address range, or which VMAs, are being targeted.
> > 2. The mmap_lock is not held, meaning that if the callee were to check
> >    the VMA at @addr, that VMA would be unrelated to the one the
> >    operation is performed on.
> > 
> > Currently, custom arch_validate_prot() handlers are defined by
> > arm64, powerpc and sparc.
> > arm64 and powerpc don't care about the address range, they just check the
> > flags against CPU support masks.
> > sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
> > the mmap_lock.
> > 
> > Change the function signature to also take a length, and move the
> > arch_validate_prot() call in mm/mprotect.c down into the locked region.
[...]
> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
> is made without holding mmap_lock. Lock is not acquired until
> vm_mmap_pgoff(). This variance is uncomfortable but I am more
> uncomfortable forcing all implementations of validate_prot to require
> mmap_lock be held when non-sparc implementations do not have such need
> yet. Since do_mmap2() is in powerpc specific code, for now this patch
> solves a current problem.

I still think sparc should avoid walking the vmas in
arch_validate_prot(). The core code already has the vmas, though not
when calling arch_validate_prot(). That's one of the reasons I added
arch_validate_flags() with the MTE patches. For sparc, this could be
(untested, just copied the arch_validate_prot() code):

static inline bool arch_validate_flags(unsigned long vm_flags)
{
	if (!(vm_flags & VM_SPARC_ADI))
		return true;

	if (!adi_capable())
		return false;

	/* ADI can not be enabled on PFN mapped pages */
	if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
		return false;

	/*
	 * Mergeable pages can become unmergeable if ADI is enabled on
	 * them even if they have identical data on them. This can be
	 * because ADI enabled pages with identical data may still not
	 * have identical ADI tags on them. Disallow ADI on mergeable
	 * pages.
	 */
	if (vma->vm_flags & VM_MERGEABLE)
		return false;

	return true;
}

> That leaves open the question of should
> generic mmap call arch_validate_prot and return EINVAL for invalid
> combination of protection bits, but that is better addressed in a
> separate patch.

The above would cover mmap() as well.

The current sparc_validate_prot() relies on finding the vma for the
corresponding address. However, if you call this early in the mmap()
path, there's no such vma. It is only created later in mmap_region()
which no longer has the original PROT_* flags (all converted to VM_*
flags).

Calling arch_validate_flags() on mmap() has a small side-effect on the
user ABI: if the CPU is not adi_capable(), PROT_ADI is currently ignored
on mmap() but rejected by sparc_validate_prot(). Powerpc already does
this already and I think it should be fine for arm64 (it needs checking
though as we have another flag, PROT_BTI, hopefully dynamic loaders
don't pass this flag unconditionally).

However, as I said above, it doesn't solve the mmap() PROT_ADI checking
for sparc since there's no vma yet. I'd strongly recommend the
arch_validate_flags() approach and reverting the "start" parameter added
to arch_validate_prot() if you go for the flags route.

Thanks.

-- 
Catalin

^ permalink raw reply

* Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()
From: James Bottomley @ 2020-10-10  2:43 UTC (permalink / raw)
  To: Eric Biggers, ira.weiny
  Cc: linux-aio, linux-efi, kvm, linux-doc, Peter Zijlstra, linux-mmc,
	Dave Hansen, dri-devel, linux-mm, target-devel, linux-mtd,
	linux-kselftest, Thomas Gleixner, drbd-dev, devel, linux-cifs,
	linux-nilfs, linux-scsi, linux-nvdimm, linux-rdma, x86, amd-gfx,
	linux-afs, cluster-devel, Ingo Molnar, intel-wired-lan, kexec,
	xen-devel, linux-ext4, bpf, Dan Williams, Fenghua Yu, intel-gfx,
	ecryptfs, linux-um, reiserfs-devel, linux-block, linux-bcache,
	Borislav Petkov, Andy Lutomirski, Jaegeuk Kim, ceph-devel,
	io-uring, linux-cachefs, linux-nfs, linux-ntfs-dev, netdev,
	linuxppc-dev, samba-technical, linux-kernel, linux-f2fs-devel,
	linux-fsdevel, Andrew Morton, linux-erofs, linux-btrfs
In-Reply-To: <20201009213434.GA839@sol.localdomain>

On Fri, 2020-10-09 at 14:34 -0700, Eric Biggers wrote:
> On Fri, Oct 09, 2020 at 12:49:57PM -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The kmap() calls in this FS are localized to a single thread.  To
> > avoid the over head of global PKRS updates use the new
> > kmap_thread() call.
> > 
> > Cc: Jaegeuk Kim <jaegeuk@kernel.org>
> > Cc: Chao Yu <chao@kernel.org>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > ---
> >  fs/f2fs/f2fs.h | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > index d9e52a7f3702..ff72a45a577e 100644
> > --- a/fs/f2fs/f2fs.h
> > +++ b/fs/f2fs/f2fs.h
> > @@ -2410,12 +2410,12 @@ static inline struct page
> > *f2fs_pagecache_get_page(
> >  
> >  static inline void f2fs_copy_page(struct page *src, struct page
> > *dst)
> >  {
> > -	char *src_kaddr = kmap(src);
> > -	char *dst_kaddr = kmap(dst);
> > +	char *src_kaddr = kmap_thread(src);
> > +	char *dst_kaddr = kmap_thread(dst);
> >  
> >  	memcpy(dst_kaddr, src_kaddr, PAGE_SIZE);
> > -	kunmap(dst);
> > -	kunmap(src);
> > +	kunmap_thread(dst);
> > +	kunmap_thread(src);
> >  }
> 
> Wouldn't it make more sense to switch cases like this to
> kmap_atomic()?
> The pages are only mapped to do a memcpy(), then they're immediately
> unmapped.

On a VIPT/VIVT architecture, this is horrendously wasteful.  You're
taking something that was mapped at colour c_src mapping it to a new
address src_kaddr, which is likely a different colour and necessitates
flushing the original c_src, then you copy it to dst_kaddr, which is
also likely a different colour from c_dst, so dst_kaddr has to be
flushed on kunmap and c_dst has to be invalidated on kmap.  What we
should have is an architectural primitive for doing this, something
like kmemcopy_arch(dst, src).  PIPT architectures can implement it as
the above (possibly losing kmap if they don't need it) but VIPT/VIVT
architectures can set up a correctly coloured mapping so they can
simply copy from c_src to c_dst without any need to flush and the data
arrives cache hot at c_dst.

James



^ permalink raw reply

* Re: [PATCH RFC PKS/PMEM 51/58] kernel: Utilize new kmap_thread()
From: Eric W. Biederman @ 2020-10-10  3:43 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-aio, linux-efi, kvm, linux-doc, Peter Zijlstra, linux-mmc,
	Dave Hansen, dri-devel, linux-mm, target-devel, linux-mtd,
	linux-kselftest, samba-technical, Thomas Gleixner, drbd-dev,
	devel, linux-cifs, linux-nilfs, linux-scsi, linux-nvdimm,
	linux-rdma, x86, ceph-devel, amd-gfx, io-uring, cluster-devel,
	Ingo Molnar, intel-wired-lan, xen-devel, linux-ext4, Fenghua Yu,
	linux-afs, linux-um, intel-gfx, ecryptfs, linux-erofs,
	reiserfs-devel, linux-block, linux-bcache, Borislav Petkov,
	Andy Lutomirski, Dan Williams, Andrew Morton, linux-cachefs,
	linux-nfs, linux-ntfs-dev, netdev, kexec, linux-kernel,
	linux-f2fs-devel, linux-fsdevel, bpf, linuxppc-dev, linux-btrfs
In-Reply-To: <20201009195033.3208459-52-ira.weiny@intel.com>

ira.weiny@intel.com writes:

> From: Ira Weiny <ira.weiny@intel.com>
>
> This kmap() call is localized to a single thread.  To avoid the over
> head of global PKRS updates use the new kmap_thread() call.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  kernel/kexec_core.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index c19c0dad1ebe..272a9920c0d6 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -815,7 +815,7 @@ static int kimage_load_normal_segment(struct kimage *image,
>  		if (result < 0)
>  			goto out;
>  
> -		ptr = kmap(page);
> +		ptr = kmap_thread(page);
>  		/* Start with a clear page */
>  		clear_page(ptr);
>  		ptr += maddr & ~PAGE_MASK;
> @@ -828,7 +828,7 @@ static int kimage_load_normal_segment(struct kimage *image,
>  			memcpy(ptr, kbuf, uchunk);
>  		else
>  			result = copy_from_user(ptr, buf, uchunk);
> -		kunmap(page);
> +		kunmap_thread(page);
>  		if (result) {
>  			result = -EFAULT;
>  			goto out;
> @@ -879,7 +879,7 @@ static int kimage_load_crash_segment(struct kimage *image,
>  			goto out;
>  		}
>  		arch_kexec_post_alloc_pages(page_address(page), 1, 0);
> -		ptr = kmap(page);
> +		ptr = kmap_thread(page);
>  		ptr += maddr & ~PAGE_MASK;
>  		mchunk = min_t(size_t, mbytes,
>  				PAGE_SIZE - (maddr & ~PAGE_MASK));
> @@ -895,7 +895,7 @@ static int kimage_load_crash_segment(struct kimage *image,
>  		else
>  			result = copy_from_user(ptr, buf, uchunk);
>  		kexec_flush_icache_page(page);
> -		kunmap(page);
> +		kunmap_thread(page);
>  		arch_kexec_pre_free_pages(page_address(page), 1);
>  		if (result) {
>  			result = -EFAULT;

^ permalink raw reply

* Re: [PATCH RFC PKS/PMEM 57/58] nvdimm/pmem: Stray access protection for pmem->virt_addr
From: John Hubbard @ 2020-10-10  2:53 UTC (permalink / raw)
  To: ira.weiny, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Peter Zijlstra
  Cc: linux-aio, linux-efi, kvm, linux-doc, linux-mmc, Dave Hansen,
	dri-devel, linux-mm, target-devel, linux-mtd, linux-kselftest,
	samba-technical, ceph-devel, drbd-dev, devel, linux-cifs,
	linux-nilfs, linux-scsi, linux-nvdimm, linux-rdma, x86, amd-gfx,
	linux-afs, cluster-devel, linux-cachefs, intel-wired-lan,
	xen-devel, linux-ext4, Fenghua Yu, linux-um, intel-gfx, ecryptfs,
	linux-erofs, reiserfs-devel, linux-block, linux-bcache,
	Dan Williams, io-uring, linux-nfs, linux-ntfs-dev, netdev, kexec,
	linux-kernel, linux-f2fs-devel, linux-fsdevel, bpf, linuxppc-dev,
	linux-btrfs
In-Reply-To: <20201009195033.3208459-58-ira.weiny@intel.com>

On 10/9/20 12:50 PM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The pmem driver uses a cached virtual address to access its memory
> directly.  Because the nvdimm driver is well aware of the special
> protections it has mapped memory with, we call dev_access_[en|dis]able()
> around the direct pmem->virt_addr (pmem_addr) usage instead of the
> unnecessary overhead of trying to get a page to kmap.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>   drivers/nvdimm/pmem.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index fab29b514372..e4dc1ae990fc 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -148,7 +148,9 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
>   	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
>   		return BLK_STS_IOERR;
>   
> +	dev_access_enable(false);
>   	rc = read_pmem(page, page_off, pmem_addr, len);
> +	dev_access_disable(false);

Hi Ira!

The APIs should be tweaked to use a symbol (GLOBAL, PER_THREAD), instead of
true/false. Try reading the above and you'll see that it sounds like it's
doing the opposite of what it is ("enable_this(false)" sounds like a clumsy
API design to *disable*, right?). And there is no hint about the scope.

And it *could* be so much more readable like this:

     dev_access_enable(DEV_ACCESS_THIS_THREAD);



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply

* Re: [PATCH RFC PKS/PMEM 48/58] drivers/md: Utilize new kmap_thread()
From: Coly Li @ 2020-10-10  2:20 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-aio, linux-efi, kvm, linux-doc, Peter Zijlstra, linux-mmc,
	Dave Hansen, dri-devel, linux-mm, target-devel, linux-mtd,
	linux-kselftest, samba-technical, Thomas Gleixner, drbd-dev,
	devel, linux-cifs, linux-nilfs, linux-scsi, linux-nvdimm,
	linux-rdma, x86, ceph-devel, amd-gfx, io-uring, cluster-devel,
	Ingo Molnar, intel-wired-lan, xen-devel, linux-ext4,
	Kent Overstreet, Fenghua Yu, linux-afs, linux-um, intel-gfx,
	ecryptfs, linux-erofs, reiserfs-devel, linux-block, linux-bcache,
	Borislav Petkov, Andy Lutomirski, Dan Williams, Andrew Morton,
	linux-cachefs, linux-nfs, linux-ntfs-dev, netdev, kexec,
	linux-kernel, linux-f2fs-devel, linux-fsdevel, bpf, linuxppc-dev,
	linux-btrfs
In-Reply-To: <20201009195033.3208459-49-ira.weiny@intel.com>

On 2020/10/10 03:50, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> These kmap() calls are localized to a single thread.  To avoid the over
> head of global PKRS updates use the new kmap_thread() call.
> 

Hi Ira,

There were a number of options considered.

1) Attempt to change all the thread local kmap() calls to kmap_atomic()
2) Introduce a flags parameter to kmap() to indicate if the mapping
should be global or not
3) Change ~20-30 call sites to 'kmap_global()' to indicate that they
require a global mapping of the pages
4) Change ~209 call sites to 'kmap_thread()' to indicate that the
mapping is to be used within that thread of execution only


I copied the above information from patch 00/58 to this message. The
idea behind kmap_thread() is fine to me, but as you said the new api is
very easy to be missed in new code (even for me). I would like to be
supportive to option 2) introduce a flag to kmap(), then we won't forget
the new thread-localized kmap method, and people won't ask why a
_thread() function is called but no kthread created.

Thanks.


Coly Li



> Cc: Coly Li <colyli@suse.de> (maintainer:BCACHE (BLOCK LAYER CACHE))
> Cc: Kent Overstreet <kent.overstreet@gmail.com> (maintainer:BCACHE (BLOCK LAYER CACHE))
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  drivers/md/bcache/request.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index c7cadaafa947..a4571f6d09dd 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -44,10 +44,10 @@ static void bio_csum(struct bio *bio, struct bkey *k)
>  	uint64_t csum = 0;
>  
>  	bio_for_each_segment(bv, bio, iter) {
> -		void *d = kmap(bv.bv_page) + bv.bv_offset;
> +		void *d = kmap_thread(bv.bv_page) + bv.bv_offset;
>  
>  		csum = bch_crc64_update(csum, d, bv.bv_len);
> -		kunmap(bv.bv_page);
> +		kunmap_thread(bv.bv_page);
>  	}
>  
>  	k->ptr[KEY_PTRS(k)] = csum & (~0ULL >> 1);
> 


^ permalink raw reply

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
From: Alexander Viro @ 2020-10-10  1:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Kees Cook, Alexander Viro, the arch/x86 maintainers,
	Linux Kernel Mailing List, Alexey Dobriyan, Eric Biggers,
	Luis Chamberlain, Al Viro, linux-fsdevel, linuxppc-dev,
	Christoph Hellwig
In-Reply-To: <CAHk-=wigvcmp-jcgoNCbx45W7j3=0jA320CfpskwuoEjefM7nQ@mail.gmail.com>

On Fri, Oct 09, 2020 at 06:29:13PM -0700, Linus Torvalds wrote:
> On Fri, Oct 9, 2020 at 6:19 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > Okay, that makes more sense.  So the patchset from Matthew
> > https://lkml.kernel.org/linux-fsdevel/20201003025534.21045-1-willy@infradead.org/T/#u
> > isn't what you had in mind.
> 
> No.
> 
> That first patch makes sense - it's just the "ppos can be NULL" patch.
> 
> But as mentioned, NULL isn't "shorthand for zero". It's just "pipes
> don't _have_ a pos, trying to pass in some explicit position is
> crazy".
> 
> So no, the other patches in that set are a bit odd, I think.
> 
> SOME of them look potentially fine - the bpfilter one seems to be
> valid, for example, because it's literally about reading/writing a
> pipe. And maybe the sysctl one is similarly sensible - I didn't check
> the context of that one.

FWIW, I hadn't pushed that branch out (or merged it into #for-next yet);
for one thing, uml part (mconsole) is simply broken, for another...
IMO ##5--8 are asking for kernel_pread() and if you look at binfmt_elf.c,
you'll see elf_read() being pretty much that.  acct.c, keys and usermode
parts are asking for kernel_pwrite() as well.

I've got stuck looking through the drivers/target stuff - it would've
been another kernel_pwrite() candidate, but it smells like its use of
filp_open() is really asking for trouble, starting with symlink attacks.
Not sure - I'm not familiar with the area, but...


^ permalink raw reply

* RE: [PATCH v3 1/5] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr1-alt-reg' property
From: Ran Wang @ 2020-10-10  1:57 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree@vger.kernel.org, Biwen Li, Shawn Guo,
	linux-kernel@vger.kernel.org, Leo Li, Ran Wang,
	linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <20200929082234.36619-1-ran.wang_1@nxp.com>

Hi Rob,

On Tuesday, September 29, 2020 4:23 PM, Ran Wang wrote:
> 
> From: Biwen Li <biwen.li@nxp.com>
> 
> The 'fsl,ippdexpcr1-alt-reg' property is used to handle an errata A-008646 on
> LS1021A.
> 
> Signed-off-by: Biwen Li <biwen.li@nxp.com>
> Signed-off-by: Ran Wang <ran.wang_1@nxp.com>
> ---
> Change in v3:
>  - Simplize related proterty definition and rename it.

Could you please comment for this version? Thanks in advance.

Regards,
Ran

> Change in v2:
>  - None
> 
>  Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> index 5a33619..62a22fc 100644
> --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> @@ -34,6 +34,9 @@ Chassis Version		Example Chips
>  Optional properties:
>   - little-endian : RCPM register block is Little Endian. Without it RCPM
>     will be Big Endian (default case).
> + - fsl,ippdexpcr1-alt-reg : The property is trying to workaround a
> +   hardware issue (found on SoC LS1021A only), if pressent, RCPM driver
> +   will use SCFG_SPARECR8 as a shadow register for RCPM_IPPDEXPCR1.
> 
>  Example:
>  The RCPM node for T4240:
> @@ -43,6 +46,15 @@ The RCPM node for T4240:
>  		#fsl,rcpm-wakeup-cells = <2>;
>  	};
> 
> +The RCPM node for LS1021A:
> +	rcpm: rcpm@1ee2140 {
> +		compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-2.1+";
> +		reg = <0x0 0x1ee2140 0x0 0x8>;
> +		#fsl,rcpm-wakeup-cells = <2>;
> +		fsl,ippdexpcr1-alt-reg;
> +	};
> +
> +
>  * Freescale RCPM Wakeup Source Device Tree Bindings
>  -------------------------------------------
>  Required fsl,rcpm-wakeup property should be added to a device node if the
> device
> --
> 2.7.4


^ permalink raw reply

* [PATCH RFC PKS/PMEM 53/58] lib: Utilize new kmap_thread()
From: ira.weiny @ 2020-10-09 19:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: linux-aio, Song Liu, linux-efi, kvm, linux-doc, linux-mmc,
	Dave Hansen, dri-devel, linux-mm, target-devel, linux-mtd,
	linux-kselftest, samba-technical, Ira Weiny, ceph-devel, drbd-dev,
	devel, linux-cifs, linux-nilfs, linux-scsi, linux-nvdimm,
	linux-rdma, x86, John Fastabend, amd-gfx, linux-afs,
	cluster-devel, linux-cachefs, intel-wired-lan, Yonghong Song,
	linux-ext4, Andrii Nakryiko, Fenghua Yu, linux-um, intel-gfx,
	ecryptfs, linux-erofs, reiserfs-devel, linux-block,
	Jérôme Glisse, Alexander Viro, xen-devel, KP Singh,
	Dan Williams, io-uring, linux-bcache, linux-nfs, linux-ntfs-dev,
	netdev, kexec, linux-kernel, linux-f2fs-devel, linux-fsdevel, bpf,
	linuxppc-dev, Martin KaFai Lau, linux-btrfs
In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com>

From: Ira Weiny <ira.weiny@intel.com>

These kmap() calls are localized to a single thread.  To avoid the over
head of global PKRS updates use the new kmap_thread() call.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Andrii Nakryiko <andriin@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@chromium.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 lib/iov_iter.c | 12 ++++++------
 lib/test_bpf.c |  4 ++--
 lib/test_hmm.c |  8 ++++----
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 5e40786c8f12..1d47f957cf95 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -208,7 +208,7 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
 	}
 	/* Too bad - revert to non-atomic kmap */
 
-	kaddr = kmap(page);
+	kaddr = kmap_thread(page);
 	from = kaddr + offset;
 	left = copyout(buf, from, copy);
 	copy -= left;
@@ -225,7 +225,7 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
 		from += copy;
 		bytes -= copy;
 	}
-	kunmap(page);
+	kunmap_thread(page);
 
 done:
 	if (skip == iov->iov_len) {
@@ -292,7 +292,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 	}
 	/* Too bad - revert to non-atomic kmap */
 
-	kaddr = kmap(page);
+	kaddr = kmap_thread(page);
 	to = kaddr + offset;
 	left = copyin(to, buf, copy);
 	copy -= left;
@@ -309,7 +309,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 		to += copy;
 		bytes -= copy;
 	}
-	kunmap(page);
+	kunmap_thread(page);
 
 done:
 	if (skip == iov->iov_len) {
@@ -1742,10 +1742,10 @@ int iov_iter_for_each_range(struct iov_iter *i, size_t bytes,
 		return 0;
 
 	iterate_all_kinds(i, bytes, v, -EINVAL, ({
-		w.iov_base = kmap(v.bv_page) + v.bv_offset;
+		w.iov_base = kmap_thread(v.bv_page) + v.bv_offset;
 		w.iov_len = v.bv_len;
 		err = f(&w, context);
-		kunmap(v.bv_page);
+		kunmap_thread(v.bv_page);
 		err;}), ({
 		w = v;
 		err = f(&w, context);})
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index ca7d635bccd9..441f822f56ba 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6506,11 +6506,11 @@ static void *generate_test_data(struct bpf_test *test, int sub)
 		if (!page)
 			goto err_kfree_skb;
 
-		ptr = kmap(page);
+		ptr = kmap_thread(page);
 		if (!ptr)
 			goto err_free_page;
 		memcpy(ptr, test->frag_data, MAX_DATA);
-		kunmap(page);
+		kunmap_thread(page);
 		skb_add_rx_frag(skb, 0, page, 0, MAX_DATA, MAX_DATA);
 	}
 
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e7dc3de355b7..e40d26f97f45 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -329,9 +329,9 @@ static int dmirror_do_read(struct dmirror *dmirror, unsigned long start,
 		if (!page)
 			return -ENOENT;
 
-		tmp = kmap(page);
+		tmp = kmap_thread(page);
 		memcpy(ptr, tmp, PAGE_SIZE);
-		kunmap(page);
+		kunmap_thread(page);
 
 		ptr += PAGE_SIZE;
 		bounce->cpages++;
@@ -398,9 +398,9 @@ static int dmirror_do_write(struct dmirror *dmirror, unsigned long start,
 		if (!page || xa_pointer_tag(entry) != DPT_XA_TAG_WRITE)
 			return -ENOENT;
 
-		tmp = kmap(page);
+		tmp = kmap_thread(page);
 		memcpy(tmp, ptr, PAGE_SIZE);
-		kunmap(page);
+		kunmap_thread(page);
 
 		ptr += PAGE_SIZE;
 		bounce->cpages++;
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related

* [PATCH RFC PKS/PMEM 40/58] net: Utilize new kmap_thread()
From: ira.weiny @ 2020-10-09 19:50 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: linux-aio, Aviad Yehezkel, kvm, linux-doc, linux-mmc, Dave Hansen,
	dri-devel, linux-mm, linux-efi, linux-mtd, linux-kselftest,
	samba-technical, linux-f2fs-devel, Ira Weiny, ceph-devel,
	drbd-dev, devel, linux-cifs, linux-nilfs, xen-devel,
	Daniel Borkmann, linux-nvdimm, Boris Pismenny, x86,
	John Fastabend, amd-gfx, io-uring, cluster-devel, linux-cachefs,
	intel-wired-lan, Alexey Kuznetsov, linux-ext4, linux-rdma,
	Fenghua Yu, linux-afs, linux-um, intel-gfx, ecryptfs, linux-erofs,
	reiserfs-devel, linux-block, linux-bcache, Jakub Kicinski,
	Dan Williams, Trond Myklebust, linux-nfs, linux-scsi,
	Hideaki YOSHIFUJI, netdev, kexec, linux-kernel, David S. Miller,
	linux-ntfs-dev, target-devel, linux-fsdevel, bpf, linuxppc-dev,
	Anna Schumaker, linux-btrfs
In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com>

From: Ira Weiny <ira.weiny@intel.com>

These kmap() calls in these drivers are localized to a single thread.
To avoid the over head of global PKRS updates use the new kmap_thread()
call.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Boris Pismenny <borisp@nvidia.com>
Cc: Aviad Yehezkel <aviadye@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 net/ceph/messenger.c | 4 ++--
 net/core/datagram.c  | 4 ++--
 net/core/sock.c      | 8 ++++----
 net/ipv4/ip_output.c | 4 ++--
 net/sunrpc/cache.c   | 4 ++--
 net/sunrpc/xdr.c     | 8 ++++----
 net/tls/tls_device.c | 4 ++--
 7 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index d4d7a0e52491..0c49b8e333da 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1535,10 +1535,10 @@ static u32 ceph_crc32c_page(u32 crc, struct page *page,
 {
 	char *kaddr;
 
-	kaddr = kmap(page);
+	kaddr = kmap_thread(page);
 	BUG_ON(kaddr == NULL);
 	crc = crc32c(crc, kaddr + page_offset, length);
-	kunmap(page);
+	kunmap_thread(page);
 
 	return crc;
 }
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 639745d4f3b9..cbd0a343074a 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -441,14 +441,14 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset,
 		end = start + skb_frag_size(frag);
 		if ((copy = end - offset) > 0) {
 			struct page *page = skb_frag_page(frag);
-			u8 *vaddr = kmap(page);
+			u8 *vaddr = kmap_thread(page);
 
 			if (copy > len)
 				copy = len;
 			n = INDIRECT_CALL_1(cb, simple_copy_to_iter,
 					vaddr + skb_frag_off(frag) + offset - start,
 					copy, data, to);
-			kunmap(page);
+			kunmap_thread(page);
 			offset += n;
 			if (n != copy)
 				goto short_copy;
diff --git a/net/core/sock.c b/net/core/sock.c
index 6c5c6b18eff4..9b46a75cd8c1 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2846,11 +2846,11 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz
 	ssize_t res;
 	struct msghdr msg = {.msg_flags = flags};
 	struct kvec iov;
-	char *kaddr = kmap(page);
+	char *kaddr = kmap_thread(page);
 	iov.iov_base = kaddr + offset;
 	iov.iov_len = size;
 	res = kernel_sendmsg(sock, &msg, &iov, 1, size);
-	kunmap(page);
+	kunmap_thread(page);
 	return res;
 }
 EXPORT_SYMBOL(sock_no_sendpage);
@@ -2861,12 +2861,12 @@ ssize_t sock_no_sendpage_locked(struct sock *sk, struct page *page,
 	ssize_t res;
 	struct msghdr msg = {.msg_flags = flags};
 	struct kvec iov;
-	char *kaddr = kmap(page);
+	char *kaddr = kmap_thread(page);
 
 	iov.iov_base = kaddr + offset;
 	iov.iov_len = size;
 	res = kernel_sendmsg_locked(sk, &msg, &iov, 1, size);
-	kunmap(page);
+	kunmap_thread(page);
 	return res;
 }
 EXPORT_SYMBOL(sock_no_sendpage_locked);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e6f2ada9e7d5..05304fb251a4 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -949,9 +949,9 @@ csum_page(struct page *page, int offset, int copy)
 {
 	char *kaddr;
 	__wsum csum;
-	kaddr = kmap(page);
+	kaddr = kmap_thread(page);
 	csum = csum_partial(kaddr + offset, copy, 0);
-	kunmap(page);
+	kunmap_thread(page);
 	return csum;
 }
 
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index baef5ee43dbb..88193f2a8e6f 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -935,9 +935,9 @@ static ssize_t cache_downcall(struct address_space *mapping,
 	if (!page)
 		goto out_slow;
 
-	kaddr = kmap(page);
+	kaddr = kmap_thread(page);
 	ret = cache_do_downcall(kaddr, buf, count, cd);
-	kunmap(page);
+	kunmap_thread(page);
 	unlock_page(page);
 	put_page(page);
 	return ret;
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index be11d672b5b9..00afbb48fb0a 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -1353,7 +1353,7 @@ xdr_xcode_array2(struct xdr_buf *buf, unsigned int base,
 		base &= ~PAGE_MASK;
 		avail_page = min_t(unsigned int, PAGE_SIZE - base,
 					avail_here);
-		c = kmap(*ppages) + base;
+		c = kmap_thread(*ppages) + base;
 
 		while (avail_here) {
 			avail_here -= avail_page;
@@ -1429,9 +1429,9 @@ xdr_xcode_array2(struct xdr_buf *buf, unsigned int base,
 				}
 			}
 			if (avail_here) {
-				kunmap(*ppages);
+				kunmap_thread(*ppages);
 				ppages++;
-				c = kmap(*ppages);
+				c = kmap_thread(*ppages);
 			}
 
 			avail_page = min(avail_here,
@@ -1471,7 +1471,7 @@ xdr_xcode_array2(struct xdr_buf *buf, unsigned int base,
 out:
 	kfree(elem);
 	if (ppages)
-		kunmap(*ppages);
+		kunmap_thread(*ppages);
 	return err;
 }
 
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index b74e2741f74f..ead5b1c485f8 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -576,13 +576,13 @@ int tls_device_sendpage(struct sock *sk, struct page *page,
 		goto out;
 	}
 
-	kaddr = kmap(page);
+	kaddr = kmap_thread(page);
 	iov.iov_base = kaddr + offset;
 	iov.iov_len = size;
 	iov_iter_kvec(&msg_iter, WRITE, &iov, 1, size);
 	rc = tls_push_data(sk, &msg_iter, size,
 			   flags, TLS_RECORD_TYPE_DATA);
-	kunmap(page);
+	kunmap_thread(page);
 
 out:
 	release_sock(sk);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related

* [PATCH RFC PKS/PMEM 10/58] drivers/rdma: Utilize new kmap_thread()
From: ira.weiny @ 2020-10-09 19:49 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: linux-aio, linux-efi, kvm, linux-doc, linux-mmc, Dave Hansen,
	dri-devel, linux-mm, target-devel, linux-mtd, linux-kselftest,
	samba-technical, Ira Weiny, ceph-devel, drbd-dev, devel,
	linux-cifs, linux-nilfs, linux-scsi, linux-nvdimm, linux-rdma,
	x86, amd-gfx, linux-afs, cluster-devel, Jason Gunthorpe,
	Doug Ledford, linux-cachefs, intel-wired-lan, Bernard Metzler,
	linux-ext4, Fenghua Yu, Faisal Latif, linux-um, intel-gfx,
	ecryptfs, linux-erofs, linux-nfs, reiserfs-devel, linux-block,
	linux-bcache, xen-devel, Dan Williams, Shiraz Saleem, io-uring,
	Mike Marciniszyn, linux-ntfs-dev, netdev, Dennis Dalessandro,
	kexec, linux-kernel, linux-f2fs-devel, linux-fsdevel, bpf,
	linuxppc-dev, linux-btrfs
In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com>

From: Ira Weiny <ira.weiny@intel.com>

The kmap() calls in these drivers are localized to a single thread.  To
avoid the over head of global PKRS updates use the new kmap_thread()
call.

Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Faisal Latif <faisal.latif@intel.com>
Cc: Shiraz Saleem <shiraz.saleem@intel.com>
Cc: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/infiniband/hw/hfi1/sdma.c      |  4 ++--
 drivers/infiniband/hw/i40iw/i40iw_cm.c | 10 +++++-----
 drivers/infiniband/sw/siw/siw_qp_tx.c  | 14 +++++++-------
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/sdma.c b/drivers/infiniband/hw/hfi1/sdma.c
index 04575c9afd61..09d206e3229a 100644
--- a/drivers/infiniband/hw/hfi1/sdma.c
+++ b/drivers/infiniband/hw/hfi1/sdma.c
@@ -3130,7 +3130,7 @@ int ext_coal_sdma_tx_descs(struct hfi1_devdata *dd, struct sdma_txreq *tx,
 		}
 
 		if (type == SDMA_MAP_PAGE) {
-			kvaddr = kmap(page);
+			kvaddr = kmap_thread(page);
 			kvaddr += offset;
 		} else if (WARN_ON(!kvaddr)) {
 			__sdma_txclean(dd, tx);
@@ -3140,7 +3140,7 @@ int ext_coal_sdma_tx_descs(struct hfi1_devdata *dd, struct sdma_txreq *tx,
 		memcpy(tx->coalesce_buf + tx->coalesce_idx, kvaddr, len);
 		tx->coalesce_idx += len;
 		if (type == SDMA_MAP_PAGE)
-			kunmap(page);
+			kunmap_thread(page);
 
 		/* If there is more data, return */
 		if (tx->tlen - tx->coalesce_idx)
diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c b/drivers/infiniband/hw/i40iw/i40iw_cm.c
index a3b95805c154..122d7a5642a1 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_cm.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c
@@ -3721,7 +3721,7 @@ int i40iw_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 		ibmr->device = iwpd->ibpd.device;
 		iwqp->lsmm_mr = ibmr;
 		if (iwqp->page)
-			iwqp->sc_qp.qp_uk.sq_base = kmap(iwqp->page);
+			iwqp->sc_qp.qp_uk.sq_base = kmap_thread(iwqp->page);
 		dev->iw_priv_qp_ops->qp_send_lsmm(&iwqp->sc_qp,
 							iwqp->ietf_mem.va,
 							(accept.size + conn_param->private_data_len),
@@ -3729,12 +3729,12 @@ int i40iw_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 
 	} else {
 		if (iwqp->page)
-			iwqp->sc_qp.qp_uk.sq_base = kmap(iwqp->page);
+			iwqp->sc_qp.qp_uk.sq_base = kmap_thread(iwqp->page);
 		dev->iw_priv_qp_ops->qp_send_lsmm(&iwqp->sc_qp, NULL, 0, 0);
 	}
 
 	if (iwqp->page)
-		kunmap(iwqp->page);
+		kunmap_thread(iwqp->page);
 
 	iwqp->cm_id = cm_id;
 	cm_node->cm_id = cm_id;
@@ -4102,10 +4102,10 @@ static void i40iw_cm_event_connected(struct i40iw_cm_event *event)
 	i40iw_cm_init_tsa_conn(iwqp, cm_node);
 	read0 = (cm_node->send_rdma0_op == SEND_RDMA_READ_ZERO);
 	if (iwqp->page)
-		iwqp->sc_qp.qp_uk.sq_base = kmap(iwqp->page);
+		iwqp->sc_qp.qp_uk.sq_base = kmap_thread(iwqp->page);
 	dev->iw_priv_qp_ops->qp_send_rtt(&iwqp->sc_qp, read0);
 	if (iwqp->page)
-		kunmap(iwqp->page);
+		kunmap_thread(iwqp->page);
 
 	memset(&attr, 0, sizeof(attr));
 	attr.qp_state = IB_QPS_RTS;
diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
index d19d8325588b..4ed37c328d02 100644
--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -76,7 +76,7 @@ static int siw_try_1seg(struct siw_iwarp_tx *c_tx, void *paddr)
 			if (unlikely(!p))
 				return -EFAULT;
 
-			buffer = kmap(p);
+			buffer = kmap_thread(p);
 
 			if (likely(PAGE_SIZE - off >= bytes)) {
 				memcpy(paddr, buffer + off, bytes);
@@ -84,7 +84,7 @@ static int siw_try_1seg(struct siw_iwarp_tx *c_tx, void *paddr)
 				unsigned long part = bytes - (PAGE_SIZE - off);
 
 				memcpy(paddr, buffer + off, part);
-				kunmap(p);
+				kunmap_thread(p);
 
 				if (!mem->is_pbl)
 					p = siw_get_upage(mem->umem,
@@ -96,10 +96,10 @@ static int siw_try_1seg(struct siw_iwarp_tx *c_tx, void *paddr)
 				if (unlikely(!p))
 					return -EFAULT;
 
-				buffer = kmap(p);
+				buffer = kmap_thread(p);
 				memcpy(paddr + part, buffer, bytes - part);
 			}
-			kunmap(p);
+			kunmap_thread(p);
 		}
 	}
 	return (int)bytes;
@@ -505,7 +505,7 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 				page_array[seg] = p;
 
 				if (!c_tx->use_sendpage) {
-					iov[seg].iov_base = kmap(p) + fp_off;
+					iov[seg].iov_base = kmap_thread(p) + fp_off;
 					iov[seg].iov_len = plen;
 
 					/* Remember for later kunmap() */
@@ -518,9 +518,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 							plen);
 				} else if (do_crc) {
 					crypto_shash_update(c_tx->mpa_crc_hd,
-							    kmap(p) + fp_off,
+							    kmap_thread(p) + fp_off,
 							    plen);
-					kunmap(p);
+					kunmap_thread(p);
 				}
 			} else {
 				u64 va = sge->laddr + sge_off;
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related

* [PATCH RFC PKS/PMEM 06/58] kmap: Introduce k[un]map_thread debugging
From: ira.weiny @ 2020-10-09 19:49 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: Juri Lelli, linux-aio, linux-efi, kvm, linux-doc, linux-mmc,
	Dave Hansen, dri-devel, Ben Segall, linux-mm, target-devel,
	linux-mtd, linux-kselftest, samba-technical, Ira Weiny,
	ceph-devel, drbd-dev, devel, linux-cifs, linux-nilfs,
	Vincent Guittot, linux-scsi, linux-nvdimm, linux-rdma, x86,
	amd-gfx, io-uring, cluster-devel, linux-cachefs, intel-wired-lan,
	Mel Gorman, xen-devel, linux-ext4, Fenghua Yu, linux-afs,
	linux-um, intel-gfx, ecryptfs, linux-erofs, reiserfs-devel,
	Steven Rostedt, linux-block, linux-bcache, Dan Williams,
	Dietmar Eggemann, linux-nfs, linux-ntfs-dev, netdev, kexec,
	linux-kernel, linux-f2fs-devel, linux-fsdevel, bpf, linuxppc-dev,
	linux-btrfs
In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com>

From: Ira Weiny <ira.weiny@intel.com>

Most kmap() callers use the map within a single thread and have no need
for the protection domain to be enabled globally.

To differentiate these kmap users, new k[un]map_thread() calls were
introduced which are thread local.

To aid in debugging the new use of kmap_thread(), add a reference count,
a check on that count, and tracing to ID where mapping errors occur.

Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/highmem.h            |  5 +++
 include/linux/sched.h              |  5 +++
 include/trace/events/kmap_thread.h | 56 ++++++++++++++++++++++++++++++
 init/init_task.c                   |  3 ++
 kernel/fork.c                      | 15 ++++++++
 lib/Kconfig.debug                  |  8 +++++
 mm/debug.c                         | 23 ++++++++++++
 7 files changed, 115 insertions(+)
 create mode 100644 include/trace/events/kmap_thread.h

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ef7813544719..22d1c000802e 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -247,6 +247,10 @@ static inline void kunmap(struct page *page)
 	__kunmap(page, true);
 }
 
+#ifdef CONFIG_DEBUG_KMAP_THREAD
+void *kmap_thread(struct page *page);
+void kunmap_thread(struct page *page);
+#else
 static inline void *kmap_thread(struct page *page)
 {
 	return __kmap(page, false);
@@ -255,6 +259,7 @@ static inline void kunmap_thread(struct page *page)
 {
 	__kunmap(page, false);
 }
+#endif
 
 /*
  * Prevent people trying to call kunmap_atomic() as if it were kunmap()
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25d97ab6c757..4627ea4a49e6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1318,6 +1318,11 @@ struct task_struct {
 #ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
 	unsigned int			dev_page_access_ref;
 #endif
+
+#ifdef CONFIG_DEBUG_KMAP_THREAD
+	unsigned int			kmap_thread_cnt;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/trace/events/kmap_thread.h b/include/trace/events/kmap_thread.h
new file mode 100644
index 000000000000..e7143cfe0daf
--- /dev/null
+++ b/include/trace/events/kmap_thread.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+
+/*
+ * Copyright (c) 2020 Intel Corporation.  All rights reserved.
+ *
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM kmap_thread
+
+#if !defined(_TRACE_KMAP_THREAD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_KMAP_THREAD_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(kmap_thread_template,
+	TP_PROTO(struct task_struct *tsk, struct page *page,
+		 void *caller_addr, int cnt),
+	TP_ARGS(tsk, page, caller_addr, cnt),
+
+	TP_STRUCT__entry(
+		__field(int, pid)
+		__field(struct page *, page)
+		__field(void *, caller_addr)
+		__field(int, cnt)
+	),
+
+	TP_fast_assign(
+		__entry->pid = tsk->pid;
+		__entry->page = page;
+		__entry->caller_addr = caller_addr;
+		__entry->cnt = cnt;
+	),
+
+	TP_printk("PID %d; (%d) %pS %p",
+		__entry->pid,
+		__entry->cnt,
+		__entry->caller_addr,
+		__entry->page
+	)
+);
+
+DEFINE_EVENT(kmap_thread_template, kmap_thread,
+	TP_PROTO(struct task_struct *tsk, struct page *page,
+		 void *caller_addr, int cnt),
+	TP_ARGS(tsk, page, caller_addr, cnt));
+
+DEFINE_EVENT(kmap_thread_template, kunmap_thread,
+	TP_PROTO(struct task_struct *tsk, struct page *page,
+		 void *caller_addr, int cnt),
+	TP_ARGS(tsk, page, caller_addr, cnt));
+
+
+#endif /* _TRACE_KMAP_THREAD_H */
+
+#include <trace/define_trace.h>
diff --git a/init/init_task.c b/init/init_task.c
index 9b39f25de59b..19f09965eb34 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -212,6 +212,9 @@ struct task_struct init_task
 #ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
 	.dev_page_access_ref = 0,
 #endif
+#ifdef CONFIG_DEBUG_KMAP_THREAD
+	.kmap_thread_cnt = 0,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index b6a3ee328a89..2c66e49b7614 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -722,6 +722,17 @@ static inline void put_signal_struct(struct signal_struct *sig)
 		free_signal_struct(sig);
 }
 
+#ifdef CONFIG_DEBUG_KMAP_THREAD
+static void check_outstanding_kmap_thread(struct task_struct *tsk)
+{
+	if (tsk->kmap_thread_cnt)
+		pr_warn(KERN_ERR "WARNING: PID %d; Failed to kunmap_thread() [cnt %d]\n",
+			tsk->pid, tsk->kmap_thread_cnt);
+}
+#else
+static void check_outstanding_kmap_thread(struct task_struct *tsk) { }
+#endif
+
 void __put_task_struct(struct task_struct *tsk)
 {
 	WARN_ON(!tsk->exit_state);
@@ -734,6 +745,7 @@ void __put_task_struct(struct task_struct *tsk)
 	exit_creds(tsk);
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
+	check_outstanding_kmap_thread(tsk);
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
@@ -943,6 +955,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #endif
 #ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
 	tsk->dev_page_access_ref = 0;
+#endif
+#ifdef CONFIG_DEBUG_KMAP_THREAD
+	tsk->kmap_thread_cnt = 0;
 #endif
 	return tsk;
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index f015c09ba5a1..6507b43d5b0c 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -858,6 +858,14 @@ config DEBUG_HIGHMEM
 	  This option enables additional error checking for high memory
 	  systems.  Disable for production systems.
 
+config DEBUG_KMAP_THREAD
+	bool "Kmap debugging"
+	depends on DEBUG_KERNEL
+	help
+	  This option enables additional error checking for kernel mapping code
+	  specifically the k[un]map_thread() calls.  Disable for production
+	  systems.
+
 config HAVE_DEBUG_STACKOVERFLOW
 	bool
 
diff --git a/mm/debug.c b/mm/debug.c
index ca8d1cacdecc..68d186f3570e 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -320,3 +320,26 @@ void page_init_poison(struct page *page, size_t size)
 }
 EXPORT_SYMBOL_GPL(page_init_poison);
 #endif		/* CONFIG_DEBUG_VM */
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/kmap_thread.h>
+
+#ifdef CONFIG_DEBUG_KMAP_THREAD
+void *kmap_thread(struct page *page)
+{
+	trace_kmap_thread(current, page, __builtin_return_address(0),
+			  current->kmap_thread_cnt);
+	current->kmap_thread_cnt++;
+	return __kmap(page, false);
+}
+EXPORT_SYMBOL_GPL(kmap_thread);
+
+void kunmap_thread(struct page *page)
+{
+	__kunmap(page, false);
+	current->kmap_thread_cnt--;
+	trace_kunmap_thread(current, page, __builtin_return_address(0),
+			    current->kmap_thread_cnt);
+}
+EXPORT_SYMBOL_GPL(kunmap_thread);
+#endif
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related

* [PATCH RFC PKS/PMEM 03/58] memremap: Add zone device access protection
From: ira.weiny @ 2020-10-09 19:49 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Peter Zijlstra
  Cc: Juri Lelli, linux-aio, linux-efi, kvm, linux-doc, linux-mmc,
	Dave Hansen, dri-devel, Ben Segall, linux-mm, target-devel,
	linux-mtd, linux-kselftest, samba-technical, Ira Weiny,
	ceph-devel, drbd-dev, devel, linux-cifs, linux-nilfs,
	Vincent Guittot, linux-scsi, linux-nvdimm, linux-rdma, x86,
	amd-gfx, io-uring, cluster-devel, linux-cachefs, intel-wired-lan,
	Mel Gorman, xen-devel, linux-ext4, Fenghua Yu, linux-afs,
	linux-um, intel-gfx, ecryptfs, linux-erofs, reiserfs-devel,
	Steven Rostedt, linux-block, linux-bcache, Dan Williams,
	Dietmar Eggemann, linux-nfs, linux-ntfs-dev, netdev, kexec,
	linux-kernel, linux-f2fs-devel, linux-fsdevel, bpf, linuxppc-dev,
	linux-btrfs
In-Reply-To: <20201009195033.3208459-1-ira.weiny@intel.com>

From: Ira Weiny <ira.weiny@intel.com>

Device managed memory exposes itself to the kernel direct map which
allows stray pointers to access these device memories.

Stray pointers to normal memory may result in a crash or other
undesirable behavior which, while unfortunate, are usually recoverable
with a reboot.  Stray access, specifically stray writes, to areas such
as non-volatile memory are permanent in nature and thus are more likely
to result in permanent user data loss vs stray access to other memory
areas.

Furthermore, we protect against reads which can help with speculative
reads to poison areas as well.  But this is a secondary reason.

Set up an infrastructure for extra device access protection. Then
implement the new protection using the new Protection Keys Supervisor
(PKS) on architectures which support it.

To enable this extra protection devices specify a flag in the pgmap to
indicate that these areas wish to use additional protection.

Kernel code which intends to access this memory can do so automatically
through the use of the kmap infrastructure calling into
dev_access_[enable|disable]() described here.  The kmap infrastructure
is implemented in a follow on patch.

In addition, users can directly enable/disable the access through
dev_access_[enable|disable]() if they have a priori knowledge of the
type of pages they are accessing.

All calls to enable/disable protection flow through
dev_access_[enable|disable]() and are nestable by the use of a per task
reference count.  This reference count does 2 things.

1) Allows a thread to nest calls to disable protection such that the
   first call to re-enable protection does not 'break' the last access of
   the pmem device memory.

2) Provides faster performance by avoiding lots of MSR writes.  For
   example, looping over a sequence of pmem pages.

In addition, we must ensure the reference count is preserved through an
exception so we add the count to irqentry_state_t and save/restore the
reference count while giving exceptions their own count should they use
a kmap call.

The following shows how this works through an exception:

    ...
            // ref == 0
            dev_access_enable()  // ref += 1 ==> disable protection
                    irq()
                            // enable protection
                            // ref = 0
                            _handler()
                                    dev_access_enable()   // ref += 1 ==> disable protection
                                    dev_access_disable()  // ref -= 1 ==> enable protection
                            // WARN_ON(ref != 0)
                            // disable protection
            do_pmem_thing()  // all good here
            dev_access_disable() // ref -= 1 ==> 0 ==> enable protection
    ...

Nested exceptions operate the same way with each exception storing the
interrupted exception state all the way down.

The pkey value is never free'ed as this optimizes the implementation to
be either on or off using a static branch conditional in the fast paths.

Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/entry/common.c      | 21 +++++++++
 include/linux/entry-common.h |  3 ++
 include/linux/memremap.h     |  1 +
 include/linux/mm.h           | 43 +++++++++++++++++
 include/linux/sched.h        |  3 ++
 init/init_task.c             |  3 ++
 kernel/fork.c                |  3 ++
 mm/Kconfig                   | 13 ++++++
 mm/memremap.c                | 90 ++++++++++++++++++++++++++++++++++++
 9 files changed, 180 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 86ad32e0095e..3680724c1a4d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -264,12 +264,27 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, irqentry_state_t *irq_state
  *
  * NOTE That the thread saved PKRS must be preserved separately to ensure
  * global overrides do not 'stick' on a thread.
+ *
+ * Furthermore, Zone Device Access Protection maintains access in a re-entrant
+ * manner through a reference count which also needs to be maintained should
+ * exception handlers use those interfaces for memory access.  Here we start
+ * off the exception handler ref count to 0 and ensure it is 0 when the
+ * exception is done.  Then restore it for the interrupted task.
  */
 noinstr void irq_save_pkrs(irqentry_state_t *state)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_PKS))
 		return;
 
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	/*
+	 * Save the ref count of the current running process and set it to 0
+	 * for any irq users to properly track re-entrance
+	 */
+	state->pkrs_ref = current->dev_page_access_ref;
+	current->dev_page_access_ref = 0;
+#endif
+
 	/*
 	 * The thread_pkrs must be maintained separately to prevent global
 	 * overrides from 'sticking' on a thread.
@@ -286,6 +301,12 @@ noinstr void irq_restore_pkrs(irqentry_state_t *state)
 
 	write_pkrs(state->pkrs);
 	current->thread.saved_pkrs = state->thread_pkrs;
+
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	WARN_ON_ONCE(current->dev_page_access_ref != 0);
+	/* Restore the interrupted process reference */
+	current->dev_page_access_ref = state->pkrs_ref;
+#endif
 }
 #endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
 
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index c3b361ffa059..06743cce2dbf 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -343,6 +343,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs);
 #ifndef irqentry_state
 typedef struct irqentry_state {
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	unsigned int pkrs_ref;
+#endif
 	u32 pkrs;
 	u32 thread_pkrs;
 #endif
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index e5862746751b..b6713ee7b218 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -89,6 +89,7 @@ struct dev_pagemap_ops {
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
+#define PGMAP_PROT_ENABLED	(1 << 1)
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 16b799a0522c..9e845515ff15 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1141,6 +1141,49 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
 
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+DECLARE_STATIC_KEY_FALSE(dev_protection_static_key);
+
+/*
+ * We make page_is_access_protected() as quick as possible.
+ *    1) If no mappings have been enabled with extra protection we skip this
+ *       entirely
+ *    2) Skip pages which are not ZONE_DEVICE
+ *    3) Only then check if this particular page was mapped with extra
+ *       protections.
+ */
+static inline bool page_is_access_protected(struct page *page)
+{
+	if (!static_branch_unlikely(&dev_protection_static_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	if (page->pgmap->flags & PGMAP_PROT_ENABLED)
+		return true;
+	return false;
+}
+
+void __dev_access_enable(bool global);
+void __dev_access_disable(bool global);
+static __always_inline void dev_access_enable(bool global)
+{
+	if (static_branch_unlikely(&dev_protection_static_key))
+		__dev_access_enable(global);
+}
+static __always_inline void dev_access_disable(bool global)
+{
+	if (static_branch_unlikely(&dev_protection_static_key))
+		__dev_access_disable(global);
+}
+#else
+static inline bool page_is_access_protected(struct page *page)
+{
+	return false;
+}
+static inline void dev_access_enable(bool global) { }
+static inline void dev_access_disable(bool global) { }
+#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */
+
 /* 127: arbitrary random number, small enough to assemble well */
 #define page_ref_zero_or_close_to_overflow(page) \
 	((unsigned int) page_ref_count(page) + 127u <= 127u)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index afe01e232935..25d97ab6c757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1315,6 +1315,9 @@ struct task_struct {
 	struct callback_head		mce_kill_me;
 #endif
 
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	unsigned int			dev_page_access_ref;
+#endif
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/init/init_task.c b/init/init_task.c
index f6889fce64af..9b39f25de59b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -209,6 +209,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	.dev_page_access_ref = 0,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index da8d360fb032..b6a3ee328a89 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -940,6 +940,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
+#endif
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+	tsk->dev_page_access_ref = 0;
 #endif
 	return tsk;
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b9bc004d9bc..01dd75720ae6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -794,6 +794,19 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config ZONE_DEVICE_ACCESS_PROTECTION
+	bool "Device memory access protection"
+	depends on ZONE_DEVICE
+	depends on ARCH_HAS_SUPERVISOR_PKEYS
+
+	help
+	  Enable the option of having access protections on device memory
+	  areas.  This protects against access to device memory which is not
+	  intended such as stray writes.  This feature is particularly useful
+	  to protect against corruption of persistent memory.
+
+	  If in doubt, say 'Y'.
+
 config DEV_PAGEMAP_OPS
 	bool
 
diff --git a/mm/memremap.c b/mm/memremap.c
index fbfc79fd9c24..edad2aa0bd24 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -6,12 +6,16 @@
 #include <linux/memory_hotplug.h>
 #include <linux/mm.h>
 #include <linux/pfn_t.h>
+#include <linux/pkeys.h>
 #include <linux/swap.h>
 #include <linux/mmzone.h>
 #include <linux/swapops.h>
 #include <linux/types.h>
 #include <linux/wait_bit.h>
 #include <linux/xarray.h>
+#include <uapi/asm-generic/mman-common.h>
+
+#define PKEY_INVALID (INT_MIN)
 
 static DEFINE_XARRAY(pgmap_array);
 
@@ -67,6 +71,89 @@ static void devmap_managed_enable_put(void)
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
+#ifdef CONFIG_ZONE_DEVICE_ACCESS_PROTECTION
+/*
+ * Note; all devices which have asked for protections share the same key.  The
+ * key may, or may not, have been provided by the core.  If not, protection
+ * will remain disabled.  The key acquisition is attempted at init time and
+ * never again.  So we don't have to worry about dev_page_pkey changing.
+ */
+static int dev_page_pkey = PKEY_INVALID;
+DEFINE_STATIC_KEY_FALSE(dev_protection_static_key);
+EXPORT_SYMBOL(dev_protection_static_key);
+
+static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot)
+{
+	if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID) {
+		pgprotval_t val = pgprot_val(prot);
+
+		static_branch_inc(&dev_protection_static_key);
+		prot = __pgprot(val | _PAGE_PKEY(dev_page_pkey));
+	}
+	return prot;
+}
+
+static void dev_pgprot_put(struct dev_pagemap *pgmap)
+{
+	if (pgmap->flags & PGMAP_PROT_ENABLED && dev_page_pkey != PKEY_INVALID)
+		static_branch_dec(&dev_protection_static_key);
+}
+
+void __dev_access_disable(bool global)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	if (!--current->dev_page_access_ref)
+		pks_mknoaccess(dev_page_pkey, global);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(__dev_access_disable);
+
+void __dev_access_enable(bool global)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	/* 0 clears the PKEY_DISABLE_ACCESS bit, allowing access */
+	if (!current->dev_page_access_ref++)
+		pks_mkrdwr(dev_page_pkey, global);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(__dev_access_enable);
+
+/**
+ * dev_access_protection_init: Configure a PKS key domain for device pages
+ *
+ * The domain defaults to the protected state.  Device page mappings should set
+ * the PGMAP_PROT_ENABLED flag when mapping pages.
+ *
+ * Note the pkey is never free'ed.  This is run at init time and we either get
+ * the key or we do not.  We need to do this to maintian a constant key (or
+ * not) as device memory is added or removed.
+ */
+static int __init __dev_access_protection_init(void)
+{
+	int pkey = pks_key_alloc("Device Memory");
+
+	if (pkey < 0)
+		return 0;
+
+	dev_page_pkey = pkey;
+
+	return 0;
+}
+subsys_initcall(__dev_access_protection_init);
+#else
+static pgprot_t dev_pgprot_get(struct dev_pagemap *pgmap, pgprot_t prot)
+{
+	return prot;
+}
+static void dev_pgprot_put(struct dev_pagemap *pgmap)
+{
+}
+#endif /* CONFIG_ZONE_DEVICE_ACCESS_PROTECTION */
+
 static void pgmap_array_delete(struct resource *res)
 {
 	xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end),
@@ -156,6 +243,7 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 	pgmap_array_delete(res);
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
 	devmap_managed_enable_put();
+	dev_pgprot_put(pgmap);
 }
 EXPORT_SYMBOL_GPL(memunmap_pages);
 
@@ -191,6 +279,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 	int error, is_ram;
 	bool need_devmap_managed = true;
 
+	params.pgprot = dev_pgprot_get(pgmap, params.pgprot);
+
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related

* Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()
From: Eric Biggers @ 2020-10-10  1:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-aio, linux-efi, kvm, linux-doc, Peter Zijlstra, linux-mmc,
	Dave Hansen, dri-devel, linux-mm, target-devel, linux-mtd,
	linux-kselftest, ira.weiny, Thomas Gleixner, drbd-dev, devel,
	linux-cifs, linux-nilfs, linux-scsi, linux-nvdimm, linux-rdma,
	x86, amd-gfx, linux-afs, cluster-devel, Ingo Molnar,
	intel-wired-lan, kexec, xen-devel, linux-ext4, bpf, Dan Williams,
	Fenghua Yu, intel-gfx, ecryptfs, linux-um, reiserfs-devel,
	linux-block, linux-bcache, Borislav Petkov, Andy Lutomirski,
	Jaegeuk Kim, ceph-devel, io-uring, linux-cachefs, linux-nfs,
	linux-ntfs-dev, netdev, linuxppc-dev, samba-technical,
	linux-kernel, linux-f2fs-devel, linux-fsdevel, Andrew Morton,
	linux-erofs, linux-btrfs
In-Reply-To: <20201010003954.GW20115@casper.infradead.org>

On Sat, Oct 10, 2020 at 01:39:54AM +0100, Matthew Wilcox wrote:
> On Fri, Oct 09, 2020 at 02:34:34PM -0700, Eric Biggers wrote:
> > On Fri, Oct 09, 2020 at 12:49:57PM -0700, ira.weiny@intel.com wrote:
> > > The kmap() calls in this FS are localized to a single thread.  To avoid
> > > the over head of global PKRS updates use the new kmap_thread() call.
> > >
> > > @@ -2410,12 +2410,12 @@ static inline struct page *f2fs_pagecache_get_page(
> > >  
> > >  static inline void f2fs_copy_page(struct page *src, struct page *dst)
> > >  {
> > > -	char *src_kaddr = kmap(src);
> > > -	char *dst_kaddr = kmap(dst);
> > > +	char *src_kaddr = kmap_thread(src);
> > > +	char *dst_kaddr = kmap_thread(dst);
> > >  
> > >  	memcpy(dst_kaddr, src_kaddr, PAGE_SIZE);
> > > -	kunmap(dst);
> > > -	kunmap(src);
> > > +	kunmap_thread(dst);
> > > +	kunmap_thread(src);
> > >  }
> > 
> > Wouldn't it make more sense to switch cases like this to kmap_atomic()?
> > The pages are only mapped to do a memcpy(), then they're immediately unmapped.
> 
> Maybe you missed the earlier thread from Thomas trying to do something
> similar for rather different reasons ...
> 
> https://lore.kernel.org/lkml/20200919091751.011116649@linutronix.de/

I did miss it.  I'm not subscribed to any of the mailing lists it was sent to.

Anyway, it shouldn't matter.  Patchsets should be standalone, and not require
reading random prior threads on linux-kernel to understand.

And I still don't really understand.  After this patchset, there is still code
nearly identical to the above (doing a temporary mapping just for a memcpy) that
would still be using kmap_atomic().  Is the idea that later, such code will be
converted to use kmap_thread() instead?  If not, why use one over the other?

- Eric

^ permalink raw reply

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
From: Linus Torvalds @ 2020-10-10  1:29 UTC (permalink / raw)
  To: Eric Biggers, Alexander Viro
  Cc: linux-arch, Kees Cook, the arch/x86 maintainers,
	Linux Kernel Mailing List, Christoph Hellwig, Luis Chamberlain,
	Al Viro, linux-fsdevel, linuxppc-dev, Alexey Dobriyan
In-Reply-To: <20201010011919.GC1122@sol.localdomain>

On Fri, Oct 9, 2020 at 6:19 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> Okay, that makes more sense.  So the patchset from Matthew
> https://lkml.kernel.org/linux-fsdevel/20201003025534.21045-1-willy@infradead.org/T/#u
> isn't what you had in mind.

No.

That first patch makes sense - it's just the "ppos can be NULL" patch.

But as mentioned, NULL isn't "shorthand for zero". It's just "pipes
don't _have_ a pos, trying to pass in some explicit position is
crazy".

So no, the other patches in that set are a bit odd, I think.

SOME of them look potentially fine - the bpfilter one seems to be
valid, for example, because it's literally about reading/writing a
pipe. And maybe the sysctl one is similarly sensible - I didn't check
the context of that one.

But no, NULL shouldn't mean "start at position zero, and we don't care
about the result".

              Linus

^ permalink raw reply

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
From: Eric Biggers @ 2020-10-10  1:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Kees Cook, the arch/x86 maintainers,
	Linux Kernel Mailing List, Christoph Hellwig, Luis Chamberlain,
	Al Viro, linux-fsdevel, linuxppc-dev, Alexey Dobriyan
In-Reply-To: <CAHk-=whcEzYjkqdpZciHh+iAdUttvfWZYoiHiF67XuTXB1YJLw@mail.gmail.com>

On Fri, Oct 09, 2020 at 06:03:31PM -0700, Linus Torvalds wrote:
> On Fri, Oct 9, 2020 at 3:06 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > It's a bit unintuitive that ppos=NULL means "use pos 0", not "use file->f_pos".
> 
> That's not at all what it means.
> 
> A NULL ppos means "this has no position at all", and is what we use
> for FMODE_STREAM file descriptors (ie sockets, pipes, etc).
> 
> It also means that we don't do the locking for position updates.
> 
> The fact that "ki_pos" gets set to zero is just because it needs to be
> _something_. It shouldn't actually ever be used for stream devices.
> 

Okay, that makes more sense.  So the patchset from Matthew
https://lkml.kernel.org/linux-fsdevel/20201003025534.21045-1-willy@infradead.org/T/#u
isn't what you had in mind.

- Eric

^ permalink raw reply

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
From: Linus Torvalds @ 2020-10-10  1:03 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-arch, Kees Cook, the arch/x86 maintainers,
	Linux Kernel Mailing List, Christoph Hellwig, Luis Chamberlain,
	Al Viro, linux-fsdevel, linuxppc-dev, Alexey Dobriyan
In-Reply-To: <20201009220633.GA1122@sol.localdomain>

On Fri, Oct 9, 2020 at 3:06 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> It's a bit unintuitive that ppos=NULL means "use pos 0", not "use file->f_pos".

That's not at all what it means.

A NULL ppos means "this has no position at all", and is what we use
for FMODE_STREAM file descriptors (ie sockets, pipes, etc).

It also means that we don't do the locking for position updates.

The fact that "ki_pos" gets set to zero is just because it needs to be
_something_. It shouldn't actually ever be used for stream devices.

                  Linus

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox