From: Mel Gorman <mgorman@suse.de>
To: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linux-mm@kvack.org, linux-s390@vger.kernel.org,
Hugh Dickins <hughd@google.com>, Jan Kara <jack@suse.cz>,
Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Subject: Re: [PATCH] s390/mm: implement software dirty bits
Date: Wed, 6 Feb 2013 11:21:11 +0000 [thread overview]
Message-ID: <20130206112111.GP21389@suse.de> (raw)
In-Reply-To: <1360087925-8456-3-git-send-email-schwidefsky@de.ibm.com>
On Tue, Feb 05, 2013 at 10:12:05AM -0800, Martin Schwidefsky wrote:
> The s390 architecture is unique in respect to dirty page detection,
> it uses the change bit in the per-page storage key to track page
> modifications. All other architectures track dirty bits by means
> of page table entries. This property of s390 has caused numerous
> problems in the past, e.g. see git commit ef5d437f71afdf4a
> "mm: fix XFS oops due to dirty pages without buffers on s390".
>
> To avoid future issues in regard to per-page dirty bits convert
> s390 to a fault based software dirty bit detection mechanism. All
> user page table entries which are marked as clean will be hardware
> read-only, even if the pte is supposed to be writable. A write by
> the user process will trigger a protection fault which will cause
> the user pte to be marked as dirty and the hardware read-only bit
> is removed.
>
> With this change the dirty bit in the storage key is irrelevant
> for Linux as a host, but the storage key is still required for
> KVM guests. The effect is that page_test_and_clear_dirty and the
> related code can be removed. The referenced bit in the storage
> key is still used by the page_test_and_clear_young primitive to
> provide page age information.
>
> For page cache pages of mappings with mapping_cap_account_dirty
> there will not be any change in behavior as the dirty bit tracking
> already uses read-only ptes to control the amount of dirty pages.
> Only for swap cache pages and pages of mappings without
> mapping_cap_account_dirty there can be additional protection faults.
> To avoid an excessive number of additional faults the mk_pte
> primitive checks for PageDirty if the pgprot value allows for writes
> and pre-dirties the pte. That avoids all additional faults for
> tmpfs and shmem pages until these pages are added to the swap cache.
>
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
I have a few clarifications below just to make sure I'm reading it right
but I think it looks fine and the change to mm/rmap.c is welcome.
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 098adbb..2b3d3b6 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -29,6 +29,7 @@
> #ifndef __ASSEMBLY__
> #include <linux/sched.h>
> #include <linux/mm_types.h>
> +#include <linux/page-flags.h>
> #include <asm/bug.h>
> #include <asm/page.h>
>
> @@ -221,13 +222,15 @@ extern unsigned long MODULES_END;
> /* Software bits in the page table entry */
> #define _PAGE_SWT 0x001 /* SW pte type bit t */
> #define _PAGE_SWX 0x002 /* SW pte type bit x */
> -#define _PAGE_SWC 0x004 /* SW pte changed bit (for KVM) */
> -#define _PAGE_SWR 0x008 /* SW pte referenced bit (for KVM) */
> -#define _PAGE_SPECIAL 0x010 /* SW associated with special page */
> +#define _PAGE_SWC 0x004 /* SW pte changed bit */
> +#define _PAGE_SWR 0x008 /* SW pte referenced bit */
> +#define _PAGE_SWW 0x010 /* SW pte write bit */
> +#define _PAGE_SPECIAL 0x020 /* SW associated with special page */
> #define __HAVE_ARCH_PTE_SPECIAL
>
> /* Set of bits not changed in pte_modify */
> -#define _PAGE_CHG_MASK (PAGE_MASK | _PAGE_SPECIAL | _PAGE_SWC | _PAGE_SWR)
> +#define _PAGE_CHG_MASK (PAGE_MASK | _PAGE_SPECIAL | _PAGE_CO | \
> + _PAGE_SWC | _PAGE_SWR)
>
If I'm reading it right, the _PAGE_CO is bit is what allows you to force
the hardware to trap even if the PTE says the page is writable. This is
what necessitates the shuffling of _PAGE_SPECIAL so you have a software
write bit and a hardware write bit.
For existing distributions they might not be able to use this patch for
bugs like ""mm: fix XFS oops due to dirty pages without buffers on s390"
because this shuffling of bits will break KABIi but that's not your problem.
> /* Six different types of pages. */
> #define _PAGE_TYPE_EMPTY 0x400
> @@ -321,6 +324,7 @@ extern unsigned long MODULES_END;
>
> /* Bits in the region table entry */
> #define _REGION_ENTRY_ORIGIN ~0xfffUL/* region/segment table origin */
> +#define _REGION_ENTRY_RO 0x200 /* region protection bit */
> #define _REGION_ENTRY_INV 0x20 /* invalid region table entry */
> #define _REGION_ENTRY_TYPE_MASK 0x0c /* region/segment table type mask */
> #define _REGION_ENTRY_TYPE_R1 0x0c /* region first table type */
> @@ -382,9 +386,10 @@ extern unsigned long MODULES_END;
> */
> #define PAGE_NONE __pgprot(_PAGE_TYPE_NONE)
> #define PAGE_RO __pgprot(_PAGE_TYPE_RO)
> -#define PAGE_RW __pgprot(_PAGE_TYPE_RW)
> +#define PAGE_RW __pgprot(_PAGE_TYPE_RO | _PAGE_SWW)
> +#define PAGE_RWC __pgprot(_PAGE_TYPE_RW | _PAGE_SWW | _PAGE_SWC)
>
> -#define PAGE_KERNEL PAGE_RW
> +#define PAGE_KERNEL PAGE_RWC
> #define PAGE_COPY PAGE_RO
>
> /*
And this combination of page bits looks consistent. The details are
heavily buried in the arch code so I hope however deals with this area
in the future spots the changelog. Of course, that guy is likely to be
you anyway :)
> @@ -631,23 +636,23 @@ static inline pgste_t pgste_update_all(pte_t *ptep, pgste_t pgste)
> bits = skey & (_PAGE_CHANGED | _PAGE_REFERENCED);
> /* Clear page changed & referenced bit in the storage key */
> if (bits & _PAGE_CHANGED)
> - page_set_storage_key(address, skey ^ bits, 1);
> + page_set_storage_key(address, skey ^ bits, 0);
> else if (bits)
> page_reset_referenced(address);
> /* Transfer page changed & referenced bit to guest bits in pgste */
> pgste_val(pgste) |= bits << 48; /* RCP_GR_BIT & RCP_GC_BIT */
> /* Get host changed & referenced bits from pgste */
> bits |= (pgste_val(pgste) & (RCP_HR_BIT | RCP_HC_BIT)) >> 52;
> - /* Clear host bits in pgste. */
> + /* Transfer page changed & referenced bit to kvm user bits */
> + pgste_val(pgste) |= bits << 45; /* KVM_UR_BIT & KVM_UC_BIT */
> + /* Clear relevant host bits in pgste. */
> pgste_val(pgste) &= ~(RCP_HR_BIT | RCP_HC_BIT);
> pgste_val(pgste) &= ~(RCP_ACC_BITS | RCP_FP_BIT);
> /* Copy page access key and fetch protection bit to pgste */
> pgste_val(pgste) |=
> (unsigned long) (skey & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
> - /* Transfer changed and referenced to kvm user bits */
> - pgste_val(pgste) |= bits << 45; /* KVM_UR_BIT & KVM_UC_BIT */
> - /* Transfer changed & referenced to pte sofware bits */
> - pte_val(*ptep) |= bits << 1; /* _PAGE_SWR & _PAGE_SWC */
> + /* Transfer referenced bit to pte */
> + pte_val(*ptep) |= (bits & _PAGE_REFERENCED) << 1;
> #endif
> return pgste;
>
> @@ -660,20 +665,25 @@ static inline pgste_t pgste_update_young(pte_t *ptep, pgste_t pgste)
>
> if (!pte_present(*ptep))
> return pgste;
> + /* Get referenced bit from storage key */
> young = page_reset_referenced(pte_val(*ptep) & PAGE_MASK);
> - /* Transfer page referenced bit to pte software bit (host view) */
> - if (young || (pgste_val(pgste) & RCP_HR_BIT))
> + if (young)
> + pgste_val(pgste) |= RCP_GR_BIT;
> + /* Get host referenced bit from pgste */
> + if (pgste_val(pgste) & RCP_HR_BIT) {
> + pgste_val(pgste) &= ~RCP_HR_BIT;
> + young = 1;
> + }
> + /* Transfer referenced bit to kvm user bits and pte */
> + if (young) {
> + pgste_val(pgste) |= KVM_UR_BIT;
> pte_val(*ptep) |= _PAGE_SWR;
> - /* Clear host referenced bit in pgste. */
> - pgste_val(pgste) &= ~RCP_HR_BIT;
> - /* Transfer page referenced bit to guest bit in pgste */
> - pgste_val(pgste) |= (unsigned long) young << 50; /* set RCP_GR_BIT */
> + }
> #endif
> return pgste;
> -
> }
>
> -static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
> +static inline void pgste_set_key(pte_t *ptep, pgste_t pgste, pte_t entry)
> {
> #ifdef CONFIG_PGSTE
> unsigned long address;
> @@ -687,10 +697,23 @@ static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
> /* Set page access key and fetch protection bit from pgste */
> nkey |= (pgste_val(pgste) & (RCP_ACC_BITS | RCP_FP_BIT)) >> 56;
> if (okey != nkey)
> - page_set_storage_key(address, nkey, 1);
> + page_set_storage_key(address, nkey, 0);
> #endif
> }
>
> +static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
> +{
> + if (!MACHINE_HAS_ESOP && (pte_val(entry) & _PAGE_SWW)) {
> + /*
> + * Without enhanced suppression-on-protection force
> + * the dirty bit on for all writable ptes.
> + */
> + pte_val(entry) |= _PAGE_SWC;
> + pte_val(entry) &= ~_PAGE_RO;
> + }
> + *ptep = entry;
> +}
> +
> /**
> * struct gmap_struct - guest address space
> * @mm: pointer to the parent mm_struct
So establishing a writable PTE clears the hardware RO override as
well and the changed bit is set so it'll be considered dirty.
Reading down through the other page tables updates, I couldn't spot a place
where you were inconsistent with the handling of the hardware _PAGE_CO
and _PAGE_RO. Writes were allowed when the change bit was already set.
I couldn't follow it all, particularly around the KVM bits so take it with
a grain of salt but for me anyway;
Acked-by: Mel Gorman <mgorman@suse.de>
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-02-06 11:21 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-05 18:12 [PATCH] software dirty bits for s390 Martin Schwidefsky
2013-02-05 18:12 ` Martin Schwidefsky
2013-02-05 19:28 ` Martin Schwidefsky
2013-02-05 18:12 ` [PATCH] s390/mm: implement software dirty bits Martin Schwidefsky
2013-02-06 11:21 ` Mel Gorman [this message]
2013-02-06 18:16 ` Martin Schwidefsky
2013-02-07 0:20 ` Hugh Dickins
2013-02-07 19:18 ` Martin Schwidefsky
2013-02-11 14:27 ` Martin Schwidefsky
2013-02-11 22:08 ` Hugh Dickins
2013-02-12 8:46 ` Martin Schwidefsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130206112111.GP21389@suse.de \
--to=mgorman@suse.de \
--cc=ehrhardt@linux.vnet.ibm.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=schwidefsky@de.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.