Re: copy_page_range() - William Lee Irwin III

All of lore.kernel.org
 help / color / mirror / Atom feed

From: William Lee Irwin III <wli@holomorphy.com>
To: "David S. Miller" <davem@redhat.com>
Cc: torvalds@osdl.org, linux-arch@vger.kernel.org
Subject: Re: copy_page_range()
Date: Sat, 7 Aug 2004 01:07:51 -0700	[thread overview]
Message-ID: <20040807080751.GX17188@holomorphy.com> (raw)
In-Reply-To: <20040807000529.5ca6e8fe.davem@redhat.com>

On Sat, Aug 07, 2004 at 12:05:29AM -0700, David S. Miller wrote:
> Every couple months I look at this thing.
> The main issue is that it's very cache unfriendly,
> especially with how sparsely populated the page tables
> are for 64-bit processes.
> As a simple example, it's at the top of the kernel
> profile for 64-bit lat_proc {fork,exec,shell} on
> sparc64.
> And it's in fact the pmd array scans that take all
> of the cache misses, and thus most of the run time.
> An idea I've always been entertaining is to associate
> a bitmask with each pmd table.  For example, a possible
> current implementation could be to abuse page_struct->index
> for this bitmask, and use virt_to_page(pmdp)->index to get
> at it.

Sounds generally reasonable.


On Sat, Aug 07, 2004 at 12:05:29AM -0700, David S. Miller wrote:
> This divides the pmd table into BITS_PER_LONG sections.
> If the bit is set in ->index then we populated at least
> one of the pmd entries in that section.  We never clear
> bits, except at pmd table allocation time.
> Then the pmd scan iterates over ->index, and only actually
> dereferences the pmd entries iff it finds a set bit, and
> it only dereferences the section of pmd entries represented
> by that bit.
> Another idea I've also considered is to implement the
> pgd/pmd levels as a more compact tree, based upon virtual
> address, such as a radix tree.
> I think all of this could be experimented with if we
> abstracted out the pmd/pgd/pte iteration.  So much stuff
> in the kernel mm code is of the form:
> 	for_each_pgd(pgdp)
> 		for_each_pmd(pgdp, pmdp)
> 			for_each_pte(pmdp, ptep)
> 				do_something(ptep)
> At 2-levels, as on most of the 32-bit platforms, things
> aren't so bad.
> Comments?

The number of levels can be abstracted easily. Something to give an
idea of how might be something like this:

struct pte_walk_state {
	pgd_t *pgd;
	pmd_t *pmd;
	pte_t *pte;
	unsigned long vaddr;
};

int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
						struct vm_area_struct *vma)
{
	int cow, ret = 0;
	struct pte_walk_state walk_parent, walk_child;

	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
	spin_lock(&dst->page_table_lock);
	pte_walk_descend_and_create(dst, &walk_child, vma->vm_start);
	for_each_inuse_pte(src, &walk_parent, vma->vm_start, vma->vm_end) {
		if (pte_walk_move_and_create(&walk_child, walk_parent.vaddr)) {
			ret = -ENOMEM;
			break;
		}
		/*
		 * do stuff to child and parent ptes
		 */
		 ...
	}
	spin_unlock(&dst->page_table_lock);
	return ret;
}

void zap_page_range(struct vm_area_struct *vma, unsigned long start,
				unsigned long len, struct zap_details *details)
{
	struct pte_walk_state walk;

	spin_lock(&vma->vm_mm->page_table_lock);
	for_each_inuse_pte(vma->vm_mm, &walk, vma->vm_start, vma->vm_end) {
		/*
		 * wipe pte and do stuff
		 */
		...
	}
	spin_unlock(&vma->vm_mm->page_table_lock);
}

where #define for_each_inuse_pte(mm, walk, start, end) \
	for (pte_walk_descend(mm, walk, start); (walk)->vaddr < (end); \
						next_inuse_pte(walk))

etc.


-- wli

next prev parent reply	other threads:[~2004-08-07  8:07 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-08-07  7:05 copy_page_range() David S. Miller
2004-08-07  8:07 ` William Lee Irwin III [this message]
2004-08-11  7:07   ` copy_page_range() David S. Miller
2004-08-11  7:35     ` copy_page_range() William Lee Irwin III
2004-08-11 16:13     ` copy_page_range() Linus Torvalds
2004-08-11 20:45       ` copy_page_range() David S. Miller
2004-08-12  3:53       ` copy_page_range() David S. Miller
2004-08-09  9:01 ` copy_page_range() David Mosberger
2004-08-09  9:04   ` copy_page_range() William Lee Irwin III
2004-08-09  9:27     ` copy_page_range() David Mosberger
2004-08-09  9:29       ` copy_page_range() William Lee Irwin III
2004-08-09 10:01         ` copy_page_range() David Mosberger
2004-08-09 17:46       ` copy_page_range() David S. Miller
2004-08-09 17:08     ` copy_page_range() Linus Torvalds
2004-08-09 18:49       ` copy_page_range() William Lee Irwin III
2004-08-09 17:45   ` copy_page_range() David S. Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040807080751.GX17188@holomorphy.com \
    --to=wli@holomorphy.com \
    --cc=davem@redhat.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.