public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed
From: William Lee Irwin III <wli@holomorphy.com>
To: "David S. Miller" <davem@redhat.com>
Cc: torvalds@osdl.org, linux-arch@vger.kernel.org
Subject: Re: copy_page_range()
Date: Sat, 7 Aug 2004 01:07:51 -0700	[thread overview]
Message-ID: <20040807080751.GX17188@holomorphy.com> (raw)
In-Reply-To: <20040807000529.5ca6e8fe.davem@redhat.com>

On Sat, Aug 07, 2004 at 12:05:29AM -0700, David S. Miller wrote:
> Every couple months I look at this thing.
> The main issue is that it's very cache unfriendly,
> especially with how sparsely populated the page tables
> are for 64-bit processes.
> As a simple example, it's at the top of the kernel
> profile for 64-bit lat_proc {fork,exec,shell} on
> sparc64.
> And it's in fact the pmd array scans that take all
> of the cache misses, and thus most of the run time.
> An idea I've always been entertaining is to associate
> a bitmask with each pmd table.  For example, a possible
> current implementation could be to abuse page_struct->index
> for this bitmask, and use virt_to_page(pmdp)->index to get
> at it.

Sounds generally reasonable.


On Sat, Aug 07, 2004 at 12:05:29AM -0700, David S. Miller wrote:
> This divides the pmd table into BITS_PER_LONG sections.
> If the bit is set in ->index then we populated at least
> one of the pmd entries in that section.  We never clear
> bits, except at pmd table allocation time.
> Then the pmd scan iterates over ->index, and only actually
> dereferences the pmd entries iff it finds a set bit, and
> it only dereferences the section of pmd entries represented
> by that bit.
> Another idea I've also considered is to implement the
> pgd/pmd levels as a more compact tree, based upon virtual
> address, such as a radix tree.
> I think all of this could be experimented with if we
> abstracted out the pmd/pgd/pte iteration.  So much stuff
> in the kernel mm code is of the form:
> 	for_each_pgd(pgdp)
> 		for_each_pmd(pgdp, pmdp)
> 			for_each_pte(pmdp, ptep)
> 				do_something(ptep)
> At 2-levels, as on most of the 32-bit platforms, things
> aren't so bad.
> Comments?

The number of levels can be abstracted easily. Something to give an
idea of how might be something like this:

struct pte_walk_state {
	pgd_t *pgd;
	pmd_t *pmd;
	pte_t *pte;
	unsigned long vaddr;
};

int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
						struct vm_area_struct *vma)
{
	int cow, ret = 0;
	struct pte_walk_state walk_parent, walk_child;

	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
	spin_lock(&dst->page_table_lock);
	pte_walk_descend_and_create(dst, &walk_child, vma->vm_start);
	for_each_inuse_pte(src, &walk_parent, vma->vm_start, vma->vm_end) {
		if (pte_walk_move_and_create(&walk_child, walk_parent.vaddr)) {
			ret = -ENOMEM;
			break;
		}
		/*
		 * do stuff to child and parent ptes
		 */
		 ...
	}
	spin_unlock(&dst->page_table_lock);
	return ret;
}

void zap_page_range(struct vm_area_struct *vma, unsigned long start,
				unsigned long len, struct zap_details *details)
{
	struct pte_walk_state walk;

	spin_lock(&vma->vm_mm->page_table_lock);
	for_each_inuse_pte(vma->vm_mm, &walk, vma->vm_start, vma->vm_end) {
		/*
		 * wipe pte and do stuff
		 */
		...
	}
	spin_unlock(&vma->vm_mm->page_table_lock);
}

where #define for_each_inuse_pte(mm, walk, start, end) \
	for (pte_walk_descend(mm, walk, start); (walk)->vaddr < (end); \
						next_inuse_pte(walk))

etc.


-- wli

  reply	other threads:[~2004-08-07  8:07 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-08-07  7:05 copy_page_range() David S. Miller
2004-08-07  8:07 ` William Lee Irwin III [this message]
2004-08-11  7:07   ` copy_page_range() David S. Miller
2004-08-11  7:35     ` copy_page_range() William Lee Irwin III
2004-08-11 16:13     ` copy_page_range() Linus Torvalds
2004-08-11 20:45       ` copy_page_range() David S. Miller
2004-08-12  3:53       ` copy_page_range() David S. Miller
2004-08-09  9:01 ` copy_page_range() David Mosberger
2004-08-09  9:04   ` copy_page_range() William Lee Irwin III
2004-08-09  9:27     ` copy_page_range() David Mosberger
2004-08-09  9:29       ` copy_page_range() William Lee Irwin III
2004-08-09 10:01         ` copy_page_range() David Mosberger
2004-08-09 17:46       ` copy_page_range() David S. Miller
2004-08-09 17:08     ` copy_page_range() Linus Torvalds
2004-08-09 18:49       ` copy_page_range() William Lee Irwin III
2004-08-09 17:45   ` copy_page_range() David S. Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040807080751.GX17188@holomorphy.com \
    --to=wli@holomorphy.com \
    --cc=davem@redhat.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox