From mboxrd@z Thu Jan 1 00:00:00 1970 From: jglisse@redhat.com Subject: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback Date: Mon, 16 Oct 2017 23:10:01 -0400 Message-ID: <20171017031003.7481-1-jglisse@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Return-path: Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org List-Id: iommu@lists.linux-foundation.org From: Jérôme Glisse (Andrew you already have v1 in your queue of patch 1, patch 2 is new, i think you can drop it patch 1 v1 for v2, v2 is bit more conservative and i fixed typos) All this only affect user of invalidate_range callback (at this time CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in drivers/iommu/amd_iommu_v2.c|intel-svm.c) This patchset remove useless double call to mmu_notifier->invalidate_range callback wherever it is safe to do so. The first patch just remove useless call and add documentation explaining why it is safe to do so. The second patch go further by introducing mmu_notifier_invalidate_range_only_end() which skip callback to invalidate_range this can be done when clearing a pte, pmd or pud with notification which call invalidate_range right after clearing under the page table lock. It should improve performances but i am lacking hardware and benchmarks which might show an improvement. Maybe folks in cc can help here. Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: David Woodhouse Cc: Alistair Popple Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Stephen Rothwell Cc: Andrew Donnellan Cc: iommu@lists.linux-foundation.org Cc: linuxppc-dev@lists.ozlabs.org Jérôme Glisse (2): mm/mmu_notifier: avoid double notification when it is useless v2 mm/mmu_notifier: avoid call to invalidate_range() in range_end() Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++ fs/dax.c | 9 +++- include/linux/mmu_notifier.h | 20 +++++++-- mm/huge_memory.c | 66 ++++++++++++++++++++++++--- mm/hugetlb.c | 16 +++++-- mm/ksm.c | 15 ++++++- mm/memory.c | 6 ++- mm/migrate.c | 15 +++++-- mm/mmu_notifier.c | 11 ++++- mm/rmap.c | 59 ++++++++++++++++++++++--- 10 files changed, 281 insertions(+), 29 deletions(-) create mode 100644 Documentation/vm/mmu_notifier.txt -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org Subject: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Mon, 16 Oct 2017 23:10:02 -0400 Message-ID: <20171017031003.7481-2-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Return-path: In-Reply-To: <20171017031003.7481-1-jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Cc: Andrea Arcangeli , Stephen Rothwell , Joerg Roedel , Benjamin Herrenschmidt , Andrew Donnellan , linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , linux-next-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Michael Ellerman , Alistair Popple , Andrew Morton , Linus Torvalds , David Woodhouse List-Id: iommu@lists.linux-foundation.org RnJvbTogSsOpcsO0bWUgR2xpc3NlIDxqZ2xpc3NlQHJlZGhhdC5jb20+CgpUaGlzIHBhdGNoIG9u bHkgYWZmZWN0cyB1c2VycyBvZiBtbXVfbm90aWZpZXItPmludmFsaWRhdGVfcmFuZ2UgY2FsbGJh Y2sKd2hpY2ggYXJlIGRldmljZSBkcml2ZXJzIHJlbGF0ZWQgdG8gQVRTL1BBU0lELCBDQVBJLCBJ T01NVXYyLCBTVk0gLi4uCmFuZCBpdCBpcyBhbiBvcHRpbWl6YXRpb24gZm9yIHRob3NlIHVzZXJz LiBFdmVyeW9uZSBlbHNlIGlzIHVuYWZmZWN0ZWQKYnkgaXQuCgpXaGVuIGNsZWFyaW5nIGEgcHRl L3BtZCB3ZSBhcmUgZ2l2ZW4gYSBjaG9pY2UgdG8gbm90aWZ5IHRoZSBldmVudCB1bmRlcgp0aGUg cGFnZSB0YWJsZSBsb2NrIChub3RpZnkgdmVyc2lvbiBvZiAqX2NsZWFyX2ZsdXNoIGhlbHBlcnMg ZG8gY2FsbCB0aGUKbW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2UpLiBCdXQgdGhhdCBub3Rp ZmljYXRpb24gaXMgbm90IG5lY2Vzc2FyeSBpbgphbGwgY2FzZXMuCgpUaGlzIHBhdGNoZXMgcmVt b3ZlIGFsbW9zdCBhbGwgY2FzZXMgd2hlcmUgaXQgaXMgdXNlbGVzcyB0byBoYXZlIGEgY2FsbAp0 byBtbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZSBiZWZvcmUgbW11X25vdGlmaWVyX2ludmFs aWRhdGVfcmFuZ2VfZW5kLgpJdCBhbHNvIGFkZHMgZG9jdW1lbnRhdGlvbiBpbiBhbGwgdGhvc2Ug Y2FzZXMgZXhwbGFpbmluZyB3aHkuCgpCZWxvdyBpcyBhIG1vcmUgaW4gZGVwdGggYW5hbHlzaXMg b2Ygd2h5IHRoaXMgaXMgZmluZSB0byBkbyB0aGlzOgoKRm9yIHNlY29uZGFyeSBUTEIgKG5vbiBD UFUgVExCKSBsaWtlIElPTU1VIFRMQiBvciBkZXZpY2UgVExCICh3aGVuIGRldmljZQp1c2UgdGhp bmcgbGlrZSBBVFMvUEFTSUQgdG8gZ2V0IHRoZSBJT01NVSB0byB3YWxrIHRoZSBDUFUgcGFnZSB0 YWJsZSB0bwphY2Nlc3MgYSBwcm9jZXNzIHZpcnR1YWwgYWRkcmVzcyBzcGFjZSkuIFRoZXJlIGlz IG9ubHkgMiBjYXNlcyB3aGVuIHlvdQpuZWVkIHRvIG5vdGlmeSB0aG9zZSBzZWNvbmRhcnkgVExC IHdoaWxlIGhvbGRpbmcgcGFnZSB0YWJsZSBsb2NrIHdoZW4KY2xlYXJpbmcgYSBwdGUvcG1kOgoK ICBBKSBwYWdlIGJhY2tpbmcgYWRkcmVzcyBpcyBmcmVlIGJlZm9yZSBtbXVfbm90aWZpZXJfaW52 YWxpZGF0ZV9yYW5nZV9lbmQKICBCKSBhIHBhZ2UgdGFibGUgZW50cnkgaXMgdXBkYXRlZCB0byBw b2ludCB0byBhIG5ldyBwYWdlIChDT1csIHdyaXRlIGZhdWx0CiAgICAgb24gemVybyBwYWdlLCBf X3JlcGxhY2VfcGFnZSgpLCAuLi4pCgpDYXNlIEEgaXMgb2J2aW91cyB5b3UgZG8gbm90IHdhbnQg dG8gdGFrZSB0aGUgcmlzayBmb3IgdGhlIGRldmljZSB0byB3cml0ZQp0byBhIHBhZ2UgdGhhdCBt aWdodCBub3cgYmUgdXNlZCBieSBzb21ldGhpbmcgY29tcGxldGVseSBkaWZmZXJlbnQuCgpDYXNl IEIgaXMgbW9yZSBzdWJ0bGUuIEZvciBjb3JyZWN0bmVzcyBpdCByZXF1aXJlcyB0aGUgZm9sbG93 aW5nIHNlcXVlbmNlCnRvIGhhcHBlbjoKICAtIHRha2UgcGFnZSB0YWJsZSBsb2NrCiAgLSBjbGVh ciBwYWdlIHRhYmxlIGVudHJ5IGFuZCBub3RpZnkgKHBtZC9wdGVfaHVnZV9jbGVhcl9mbHVzaF9u b3RpZnkoKSkKICAtIHNldCBwYWdlIHRhYmxlIGVudHJ5IHRvIHBvaW50IHRvIG5ldyBwYWdlCgpJ ZiBjbGVhcmluZyB0aGUgcGFnZSB0YWJsZSBlbnRyeSBpcyBub3QgZm9sbG93ZWQgYnkgYSBub3Rp ZnkgYmVmb3JlIHNldHRpbmcKdGhlIG5ldyBwdGUvcG1kIHZhbHVlIHRoZW4geW91IGNhbiBicmVh ayBtZW1vcnkgbW9kZWwgbGlrZSBDMTEgb3IgQysrMTEgZm9yCnRoZSBkZXZpY2UuCgpDb25zaWRl ciB0aGUgZm9sbG93aW5nIHNjZW5hcmlvIChkZXZpY2UgdXNlIGEgZmVhdHVyZSBzaW1pbGFyIHRv IEFUUy8KUEFTSUQpOgoKVHdvIGFkZHJlc3MgYWRkckEgYW5kIGFkZHJCIHN1Y2ggdGhhdCB8YWRk ckEgLSBhZGRyQnwgPj0gUEFHRV9TSVpFIHdlCmFzc3VtZSB0aGV5IGFyZSB3cml0ZSBwcm90ZWN0 ZWQgZm9yIENPVyAob3RoZXIgY2FzZSBvZiBCIGFwcGx5IHRvbykuCgpbVGltZSBOXSAtLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LQpDUFUtdGhyZWFkLTAgIHt0cnkgdG8gd3JpdGUgdG8gYWRkckF9CkNQVS10aHJlYWQtMSAge3Ry eSB0byB3cml0ZSB0byBhZGRyQn0KQ1BVLXRocmVhZC0yICB7fQpDUFUtdGhyZWFkLTMgIHt9CkRF Vi10aHJlYWQtMCAge3JlYWQgYWRkckEgYW5kIHBvcHVsYXRlIGRldmljZSBUTEJ9CkRFVi10aHJl YWQtMiAge3JlYWQgYWRkckIgYW5kIHBvcHVsYXRlIGRldmljZSBUTEJ9CltUaW1lIE4rMV0gLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tCkNQVS10aHJlYWQtMCAge0NPV19zdGVwMDoge21tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3Jh bmdlX3N0YXJ0KGFkZHJBKX19CkNQVS10aHJlYWQtMSAge0NPV19zdGVwMDoge21tdV9ub3RpZmll cl9pbnZhbGlkYXRlX3JhbmdlX3N0YXJ0KGFkZHJCKX19CkNQVS10aHJlYWQtMiAge30KQ1BVLXRo cmVhZC0zICB7fQpERVYtdGhyZWFkLTAgIHt9CkRFVi10aHJlYWQtMiAge30KW1RpbWUgTisyXSAt LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0KQ1BVLXRocmVhZC0wICB7Q09XX3N0ZXAxOiB7dXBkYXRlIHBhZ2UgdGFibGUgcG9pbnQg dG8gbmV3IHBhZ2UgZm9yIGFkZHJBfX0KQ1BVLXRocmVhZC0xICB7Q09XX3N0ZXAxOiB7dXBkYXRl IHBhZ2UgdGFibGUgcG9pbnQgdG8gbmV3IHBhZ2UgZm9yIGFkZHJCfX0KQ1BVLXRocmVhZC0yICB7 fQpDUFUtdGhyZWFkLTMgIHt9CkRFVi10aHJlYWQtMCAge30KREVWLXRocmVhZC0yICB7fQpbVGlt ZSBOKzNdIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLQpDUFUtdGhyZWFkLTAgIHtwcmVlbXB0ZWR9CkNQVS10aHJlYWQtMSAge3By ZWVtcHRlZH0KQ1BVLXRocmVhZC0yICB7d3JpdGUgdG8gYWRkckEgd2hpY2ggaXMgYSB3cml0ZSB0 byBuZXcgcGFnZX0KQ1BVLXRocmVhZC0zICB7fQpERVYtdGhyZWFkLTAgIHt9CkRFVi10aHJlYWQt MiAge30KW1RpbWUgTiszXSAtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KQ1BVLXRocmVhZC0wICB7cHJlZW1wdGVkfQpDUFUtdGhy ZWFkLTEgIHtwcmVlbXB0ZWR9CkNQVS10aHJlYWQtMiAge30KQ1BVLXRocmVhZC0zICB7d3JpdGUg dG8gYWRkckIgd2hpY2ggaXMgYSB3cml0ZSB0byBuZXcgcGFnZX0KREVWLXRocmVhZC0wICB7fQpE RVYtdGhyZWFkLTIgIHt9CltUaW1lIE4rNF0gLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCkNQVS10aHJlYWQtMCAge3ByZWVtcHRl ZH0KQ1BVLXRocmVhZC0xICB7Q09XX3N0ZXAzOiB7bW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFu Z2VfZW5kKGFkZHJCKX19CkNQVS10aHJlYWQtMiAge30KQ1BVLXRocmVhZC0zICB7fQpERVYtdGhy ZWFkLTAgIHt9CkRFVi10aHJlYWQtMiAge30KW1RpbWUgTis1XSAtLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KQ1BVLXRocmVhZC0w ICB7cHJlZW1wdGVkfQpDUFUtdGhyZWFkLTEgIHt9CkNQVS10aHJlYWQtMiAge30KQ1BVLXRocmVh ZC0zICB7fQpERVYtdGhyZWFkLTAgIHtyZWFkIGFkZHJBIGZyb20gb2xkIHBhZ2V9CkRFVi10aHJl YWQtMiAge3JlYWQgYWRkckIgZnJvbSBuZXcgcGFnZX0KClNvIGhlcmUgYmVjYXVzZSBhdCB0aW1l IE4rMiB0aGUgY2xlYXIgcGFnZSB0YWJsZSBlbnRyeSB3YXMgbm90IHBhaXIgd2l0aCBhCm5vdGlm aWNhdGlvbiB0byBpbnZhbGlkYXRlIHRoZSBzZWNvbmRhcnkgVExCLCB0aGUgZGV2aWNlIHNlZSB0 aGUgbmV3IHZhbHVlCmZvciBhZGRyQiBiZWZvcmUgc2VpbmcgdGhlIG5ldyB2YWx1ZSBmb3IgYWRk ckEuIFRoaXMgYnJlYWsgdG90YWwgbWVtb3J5Cm9yZGVyaW5nIGZvciB0aGUgZGV2aWNlLgoKV2hl biBjaGFuZ2luZyBhIHB0ZSB0byB3cml0ZSBwcm90ZWN0IG9yIHRvIHBvaW50IHRvIGEgbmV3IHdy aXRlIHByb3RlY3RlZApwYWdlIHdpdGggc2FtZSBjb250ZW50IChLU00pIGl0IGlzIG9rIHRvIGRl bGF5IGludmFsaWRhdGVfcmFuZ2UgY2FsbGJhY2sgdG8KbW11X25vdGlmaWVyX2ludmFsaWRhdGVf cmFuZ2VfZW5kKCkgb3V0c2lkZSB0aGUgcGFnZSB0YWJsZSBsb2NrLiBUaGlzIGlzCnRydWUgZXZl biBpZiB0aGUgdGhyZWFkIGRvaW5nIHBhZ2UgdGFibGUgdXBkYXRlIGlzIHByZWVtcHRlZCByaWdo dCBhZnRlcgpyZWxlYXNpbmcgcGFnZSB0YWJsZSBsb2NrIGJlZm9yZSBjYWxsaW5nIG1tdV9ub3Rp Zmllcl9pbnZhbGlkYXRlX3JhbmdlX2VuZAoKQ2hhbmdlZCBzaW5jZSB2MToKICAtIHR5cG9zICh0 aGFua3MgdG8gQW5kcmVhKQogIC0gQXZvaWQgdW5uZWNlc3NhcnkgcHJlY2F1dGlvbiBpbiB0cnlf dG9fdW5tYXAoKSAoQW5kcmVhKQogIC0gQmUgbW9yZSBjb25zZXJ2YXRpdmUgaW4gdHJ5X3RvX3Vu bWFwX29uZSgpCgpTaWduZWQtb2ZmLWJ5OiBKw6lyw7RtZSBHbGlzc2UgPGpnbGlzc2VAcmVkaGF0 LmNvbT4KQ2M6IEFuZHJlYSBBcmNhbmdlbGkgPGFhcmNhbmdlQHJlZGhhdC5jb20+CkNjOiBOYWRh diBBbWl0IDxuYWRhdi5hbWl0QGdtYWlsLmNvbT4KQ2M6IExpbnVzIFRvcnZhbGRzIDx0b3J2YWxk c0BsaW51eC1mb3VuZGF0aW9uLm9yZz4KQ2M6IEFuZHJldyBNb3J0b24gPGFrcG1AbGludXgtZm91 bmRhdGlvbi5vcmc+CkNjOiBKb2VyZyBSb2VkZWwgPGpyb2VkZWxAc3VzZS5kZT4KQ2M6IFN1cmF2 ZWUgU3V0aGlrdWxwYW5pdCA8c3VyYXZlZS5zdXRoaWt1bHBhbml0QGFtZC5jb20+CkNjOiBEYXZp ZCBXb29kaG91c2UgPGR3bXcyQGluZnJhZGVhZC5vcmc+CkNjOiBBbGlzdGFpciBQb3BwbGUgPGFs aXN0YWlyQHBvcHBsZS5pZC5hdT4KQ2M6IE1pY2hhZWwgRWxsZXJtYW4gPG1wZUBlbGxlcm1hbi5p ZC5hdT4KQ2M6IEJlbmphbWluIEhlcnJlbnNjaG1pZHQgPGJlbmhAa2VybmVsLmNyYXNoaW5nLm9y Zz4KQ2M6IFN0ZXBoZW4gUm90aHdlbGwgPHNmckBjYW5iLmF1dWcub3JnLmF1PgpDYzogQW5kcmV3 IERvbm5lbGxhbiA8YW5kcmV3LmRvbm5lbGxhbkBhdTEuaWJtLmNvbT4KCkNjOiBpb21tdUBsaXN0 cy5saW51eC1mb3VuZGF0aW9uLm9yZwpDYzogbGludXhwcGMtZGV2QGxpc3RzLm96bGFicy5vcmcK Q2M6IGxpbnV4LW5leHRAdmdlci5rZXJuZWwub3JnCi0tLQogRG9jdW1lbnRhdGlvbi92bS9tbXVf bm90aWZpZXIudHh0IHwgOTMgKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysr CiBmcy9kYXguYyAgICAgICAgICAgICAgICAgICAgICAgICAgfCAgOSArKystCiBpbmNsdWRlL2xp bnV4L21tdV9ub3RpZmllci5oICAgICAgfCAgMyArLQogbW0vaHVnZV9tZW1vcnkuYyAgICAgICAg ICAgICAgICAgIHwgMjAgKysrKysrKy0tCiBtbS9odWdldGxiLmMgICAgICAgICAgICAgICAgICAg ICAgfCAxNiArKysrKy0tCiBtbS9rc20uYyAgICAgICAgICAgICAgICAgICAgICAgICAgfCAxNSAr KysrKystCiBtbS9ybWFwLmMgICAgICAgICAgICAgICAgICAgICAgICAgfCA1OSArKysrKysrKysr KysrKysrKysrKysrLS0tCiA3IGZpbGVzIGNoYW5nZWQsIDE5OCBpbnNlcnRpb25zKCspLCAxNyBk ZWxldGlvbnMoLSkKIGNyZWF0ZSBtb2RlIDEwMDY0NCBEb2N1bWVudGF0aW9uL3ZtL21tdV9ub3Rp Zmllci50eHQKCmRpZmYgLS1naXQgYS9Eb2N1bWVudGF0aW9uL3ZtL21tdV9ub3RpZmllci50eHQg Yi9Eb2N1bWVudGF0aW9uL3ZtL21tdV9ub3RpZmllci50eHQKbmV3IGZpbGUgbW9kZSAxMDA2NDQK aW5kZXggMDAwMDAwMDAwMDAwLi4yM2I0NjI1NjZiYjcKLS0tIC9kZXYvbnVsbAorKysgYi9Eb2N1 bWVudGF0aW9uL3ZtL21tdV9ub3RpZmllci50eHQKQEAgLTAsMCArMSw5MyBAQAorV2hlbiBkbyB5 b3UgbmVlZCB0byBub3RpZnkgaW5zaWRlIHBhZ2UgdGFibGUgbG9jayA/CisKK1doZW4gY2xlYXJp bmcgYSBwdGUvcG1kIHdlIGFyZSBnaXZlbiBhIGNob2ljZSB0byBub3RpZnkgdGhlIGV2ZW50IHRo cm91Z2gKKyhub3RpZnkgdmVyc2lvbiBvZiAqX2NsZWFyX2ZsdXNoIGNhbGwgbW11X25vdGlmaWVy X2ludmFsaWRhdGVfcmFuZ2UpIHVuZGVyCit0aGUgcGFnZSB0YWJsZSBsb2NrLiBCdXQgdGhhdCBu b3RpZmljYXRpb24gaXMgbm90IG5lY2Vzc2FyeSBpbiBhbGwgY2FzZXMuCisKK0ZvciBzZWNvbmRh cnkgVExCIChub24gQ1BVIFRMQikgbGlrZSBJT01NVSBUTEIgb3IgZGV2aWNlIFRMQiAod2hlbiBk ZXZpY2UgdXNlCit0aGluZyBsaWtlIEFUUy9QQVNJRCB0byBnZXQgdGhlIElPTU1VIHRvIHdhbGsg dGhlIENQVSBwYWdlIHRhYmxlIHRvIGFjY2VzcyBhCitwcm9jZXNzIHZpcnR1YWwgYWRkcmVzcyBz cGFjZSkuIFRoZXJlIGlzIG9ubHkgMiBjYXNlcyB3aGVuIHlvdSBuZWVkIHRvIG5vdGlmeQordGhv c2Ugc2Vjb25kYXJ5IFRMQiB3aGlsZSBob2xkaW5nIHBhZ2UgdGFibGUgbG9jayB3aGVuIGNsZWFy aW5nIGEgcHRlL3BtZDoKKworICBBKSBwYWdlIGJhY2tpbmcgYWRkcmVzcyBpcyBmcmVlIGJlZm9y ZSBtbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZV9lbmQoKQorICBCKSBhIHBhZ2UgdGFibGUg ZW50cnkgaXMgdXBkYXRlZCB0byBwb2ludCB0byBhIG5ldyBwYWdlIChDT1csIHdyaXRlIGZhdWx0 CisgICAgIG9uIHplcm8gcGFnZSwgX19yZXBsYWNlX3BhZ2UoKSwgLi4uKQorCitDYXNlIEEgaXMg b2J2aW91cyB5b3UgZG8gbm90IHdhbnQgdG8gdGFrZSB0aGUgcmlzayBmb3IgdGhlIGRldmljZSB0 byB3cml0ZSB0bworYSBwYWdlIHRoYXQgbWlnaHQgbm93IGJlIHVzZWQgYnkgc29tZSBjb21wbGV0 ZWx5IGRpZmZlcmVudCB0YXNrLgorCitDYXNlIEIgaXMgbW9yZSBzdWJ0bGUuIEZvciBjb3JyZWN0 bmVzcyBpdCByZXF1aXJlcyB0aGUgZm9sbG93aW5nIHNlcXVlbmNlIHRvCitoYXBwZW46CisgIC0g dGFrZSBwYWdlIHRhYmxlIGxvY2sKKyAgLSBjbGVhciBwYWdlIHRhYmxlIGVudHJ5IGFuZCBub3Rp ZnkgKFtwbWQvcHRlXXBfaHVnZV9jbGVhcl9mbHVzaF9ub3RpZnkoKSkKKyAgLSBzZXQgcGFnZSB0 YWJsZSBlbnRyeSB0byBwb2ludCB0byBuZXcgcGFnZQorCitJZiBjbGVhcmluZyB0aGUgcGFnZSB0 YWJsZSBlbnRyeSBpcyBub3QgZm9sbG93ZWQgYnkgYSBub3RpZnkgYmVmb3JlIHNldHRpbmcKK3Ro ZSBuZXcgcHRlL3BtZCB2YWx1ZSB0aGVuIHlvdSBjYW4gYnJlYWsgbWVtb3J5IG1vZGVsIGxpa2Ug QzExIG9yIEMrKzExIGZvcgordGhlIGRldmljZS4KKworQ29uc2lkZXIgdGhlIGZvbGxvd2luZyBz Y2VuYXJpbyAoZGV2aWNlIHVzZSBhIGZlYXR1cmUgc2ltaWxhciB0byBBVFMvUEFTSUQpOgorCitU d28gYWRkcmVzcyBhZGRyQSBhbmQgYWRkckIgc3VjaCB0aGF0IHxhZGRyQSAtIGFkZHJCfCA+PSBQ QUdFX1NJWkUgd2UgYXNzdW1lCit0aGV5IGFyZSB3cml0ZSBwcm90ZWN0ZWQgZm9yIENPVyAob3Ro ZXIgY2FzZSBvZiBCIGFwcGx5IHRvbykuCisKK1tUaW1lIE5dIC0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCitDUFUtdGhy ZWFkLTAgIHt0cnkgdG8gd3JpdGUgdG8gYWRkckF9CitDUFUtdGhyZWFkLTEgIHt0cnkgdG8gd3Jp dGUgdG8gYWRkckJ9CitDUFUtdGhyZWFkLTIgIHt9CitDUFUtdGhyZWFkLTMgIHt9CitERVYtdGhy ZWFkLTAgIHtyZWFkIGFkZHJBIGFuZCBwb3B1bGF0ZSBkZXZpY2UgVExCfQorREVWLXRocmVhZC0y ICB7cmVhZCBhZGRyQiBhbmQgcG9wdWxhdGUgZGV2aWNlIFRMQn0KK1tUaW1lIE4rMV0gLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tCitDUFUtdGhyZWFkLTAgIHtDT1dfc3RlcDA6IHttbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9y YW5nZV9zdGFydChhZGRyQSl9fQorQ1BVLXRocmVhZC0xICB7Q09XX3N0ZXAwOiB7bW11X25vdGlm aWVyX2ludmFsaWRhdGVfcmFuZ2Vfc3RhcnQoYWRkckIpfX0KK0NQVS10aHJlYWQtMiAge30KK0NQ VS10aHJlYWQtMyAge30KK0RFVi10aHJlYWQtMCAge30KK0RFVi10aHJlYWQtMiAge30KK1tUaW1l IE4rMl0gLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tCitDUFUtdGhyZWFkLTAgIHtDT1dfc3RlcDE6IHt1cGRhdGUgcGFnZSB0 YWJsZSB0byBwb2ludCB0byBuZXcgcGFnZSBmb3IgYWRkckF9fQorQ1BVLXRocmVhZC0xICB7Q09X X3N0ZXAxOiB7dXBkYXRlIHBhZ2UgdGFibGUgdG8gcG9pbnQgdG8gbmV3IHBhZ2UgZm9yIGFkZHJC fX0KK0NQVS10aHJlYWQtMiAge30KK0NQVS10aHJlYWQtMyAge30KK0RFVi10aHJlYWQtMCAge30K K0RFVi10aHJlYWQtMiAge30KK1tUaW1lIE4rM10gLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCitDUFUtdGhyZWFkLTAgIHtw cmVlbXB0ZWR9CitDUFUtdGhyZWFkLTEgIHtwcmVlbXB0ZWR9CitDUFUtdGhyZWFkLTIgIHt3cml0 ZSB0byBhZGRyQSB3aGljaCBpcyBhIHdyaXRlIHRvIG5ldyBwYWdlfQorQ1BVLXRocmVhZC0zICB7 fQorREVWLXRocmVhZC0wICB7fQorREVWLXRocmVhZC0yICB7fQorW1RpbWUgTiszXSAtLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0KK0NQVS10aHJlYWQtMCAge3ByZWVtcHRlZH0KK0NQVS10aHJlYWQtMSAge3ByZWVtcHRlZH0K K0NQVS10aHJlYWQtMiAge30KK0NQVS10aHJlYWQtMyAge3dyaXRlIHRvIGFkZHJCIHdoaWNoIGlz IGEgd3JpdGUgdG8gbmV3IHBhZ2V9CitERVYtdGhyZWFkLTAgIHt9CitERVYtdGhyZWFkLTIgIHt9 CitbVGltZSBOKzRdIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLQorQ1BVLXRocmVhZC0wICB7cHJlZW1wdGVkfQorQ1BVLXRo cmVhZC0xICB7Q09XX3N0ZXAzOiB7bW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2VfZW5kKGFk ZHJCKX19CitDUFUtdGhyZWFkLTIgIHt9CitDUFUtdGhyZWFkLTMgIHt9CitERVYtdGhyZWFkLTAg IHt9CitERVYtdGhyZWFkLTIgIHt9CitbVGltZSBOKzVdIC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQorQ1BVLXRocmVhZC0w ICB7cHJlZW1wdGVkfQorQ1BVLXRocmVhZC0xICB7fQorQ1BVLXRocmVhZC0yICB7fQorQ1BVLXRo cmVhZC0zICB7fQorREVWLXRocmVhZC0wICB7cmVhZCBhZGRyQSBmcm9tIG9sZCBwYWdlfQorREVW LXRocmVhZC0yICB7cmVhZCBhZGRyQiBmcm9tIG5ldyBwYWdlfQorCitTbyBoZXJlIGJlY2F1c2Ug YXQgdGltZSBOKzIgdGhlIGNsZWFyIHBhZ2UgdGFibGUgZW50cnkgd2FzIG5vdCBwYWlyIHdpdGgg YQorbm90aWZpY2F0aW9uIHRvIGludmFsaWRhdGUgdGhlIHNlY29uZGFyeSBUTEIsIHRoZSBkZXZp Y2Ugc2VlIHRoZSBuZXcgdmFsdWUgZm9yCithZGRyQiBiZWZvcmUgc2VpbmcgdGhlIG5ldyB2YWx1 ZSBmb3IgYWRkckEuIFRoaXMgYnJlYWsgdG90YWwgbWVtb3J5IG9yZGVyaW5nCitmb3IgdGhlIGRl dmljZS4KKworV2hlbiBjaGFuZ2luZyBhIHB0ZSB0byB3cml0ZSBwcm90ZWN0IG9yIHRvIHBvaW50 IHRvIGEgbmV3IHdyaXRlIHByb3RlY3RlZCBwYWdlCit3aXRoIHNhbWUgY29udGVudCAoS1NNKSBp dCBpcyBmaW5lIHRvIGRlbGF5IHRoZSBtbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZQorY2Fs bCB0byBtbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZV9lbmQoKSBvdXRzaWRlIHRoZSBwYWdl IHRhYmxlIGxvY2suIFRoaXMKK2lzIHRydWUgZXZlbiBpZiB0aGUgdGhyZWFkIGRvaW5nIHRoZSBw YWdlIHRhYmxlIHVwZGF0ZSBpcyBwcmVlbXB0ZWQgcmlnaHQgYWZ0ZXIKK3JlbGVhc2luZyBwYWdl IHRhYmxlIGxvY2sgYnV0IGJlZm9yZSBjYWxsIG1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3Jhbmdl X2VuZCgpLgpkaWZmIC0tZ2l0IGEvZnMvZGF4LmMgYi9mcy9kYXguYwppbmRleCBmM2E0NGE3YzE0 YjMuLjllYzc5NzQyNGU0ZiAxMDA2NDQKLS0tIGEvZnMvZGF4LmMKKysrIGIvZnMvZGF4LmMKQEAg LTYxNCw2ICs2MTQsMTMgQEAgc3RhdGljIHZvaWQgZGF4X21hcHBpbmdfZW50cnlfbWtjbGVhbihz dHJ1Y3QgYWRkcmVzc19zcGFjZSAqbWFwcGluZywKIAkJaWYgKGZvbGxvd19wdGVfcG1kKHZtYS0+ dm1fbW0sIGFkZHJlc3MsICZzdGFydCwgJmVuZCwgJnB0ZXAsICZwbWRwLCAmcHRsKSkKIAkJCWNv bnRpbnVlOwogCisJCS8qCisJCSAqIE5vIG5lZWQgdG8gY2FsbCBtbXVfbm90aWZpZXJfaW52YWxp ZGF0ZV9yYW5nZSgpIGFzIHdlIGFyZQorCQkgKiBkb3duZ3JhZGluZyBwYWdlIHRhYmxlIHByb3Rl Y3Rpb24gbm90IGNoYW5naW5nIGl0IHRvIHBvaW50CisJCSAqIHRvIGEgbmV3IHBhZ2UuCisJCSAq CisJCSAqIFNlZSBEb2N1bWVudGF0aW9uL3ZtL21tdV9ub3RpZmllci50eHQKKwkJICovCiAJCWlm IChwbWRwKSB7CiAjaWZkZWYgQ09ORklHX0ZTX0RBWF9QTUQKIAkJCXBtZF90IHBtZDsKQEAgLTYy OCw3ICs2MzUsNiBAQCBzdGF0aWMgdm9pZCBkYXhfbWFwcGluZ19lbnRyeV9ta2NsZWFuKHN0cnVj dCBhZGRyZXNzX3NwYWNlICptYXBwaW5nLAogCQkJcG1kID0gcG1kX3dycHJvdGVjdChwbWQpOwog CQkJcG1kID0gcG1kX21rY2xlYW4ocG1kKTsKIAkJCXNldF9wbWRfYXQodm1hLT52bV9tbSwgYWRk cmVzcywgcG1kcCwgcG1kKTsKLQkJCW1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3JhbmdlKHZtYS0+ dm1fbW0sIHN0YXJ0LCBlbmQpOwogdW5sb2NrX3BtZDoKIAkJCXNwaW5fdW5sb2NrKHB0bCk7CiAj ZW5kaWYKQEAgLTY0Myw3ICs2NDksNiBAQCBzdGF0aWMgdm9pZCBkYXhfbWFwcGluZ19lbnRyeV9t a2NsZWFuKHN0cnVjdCBhZGRyZXNzX3NwYWNlICptYXBwaW5nLAogCQkJcHRlID0gcHRlX3dycHJv dGVjdChwdGUpOwogCQkJcHRlID0gcHRlX21rY2xlYW4ocHRlKTsKIAkJCXNldF9wdGVfYXQodm1h LT52bV9tbSwgYWRkcmVzcywgcHRlcCwgcHRlKTsKLQkJCW1tdV9ub3RpZmllcl9pbnZhbGlkYXRl X3JhbmdlKHZtYS0+dm1fbW0sIHN0YXJ0LCBlbmQpOwogdW5sb2NrX3B0ZToKIAkJCXB0ZV91bm1h cF91bmxvY2socHRlcCwgcHRsKTsKIAkJfQpkaWZmIC0tZ2l0IGEvaW5jbHVkZS9saW51eC9tbXVf bm90aWZpZXIuaCBiL2luY2x1ZGUvbGludXgvbW11X25vdGlmaWVyLmgKaW5kZXggNjg2NmU4MTI2 OTgyLi40OWM5MjVjOTZiOGEgMTAwNjQ0Ci0tLSBhL2luY2x1ZGUvbGludXgvbW11X25vdGlmaWVy LmgKKysrIGIvaW5jbHVkZS9saW51eC9tbXVfbm90aWZpZXIuaApAQCAtMTU1LDcgKzE1NSw4IEBA IHN0cnVjdCBtbXVfbm90aWZpZXJfb3BzIHsKIAkgKiBzaGFyZWQgcGFnZS10YWJsZXMsIGl0IG5v dCBuZWNlc3NhcnkgdG8gaW1wbGVtZW50IHRoZQogCSAqIGludmFsaWRhdGVfcmFuZ2Vfc3RhcnQo KS9lbmQoKSBub3RpZmllcnMsIGFzCiAJICogaW52YWxpZGF0ZV9yYW5nZSgpIGFscmVhZCBjYXRj aGVzIHRoZSBwb2ludHMgaW4gdGltZSB3aGVuIGFuCi0JICogZXh0ZXJuYWwgVExCIHJhbmdlIG5l ZWRzIHRvIGJlIGZsdXNoZWQuCisJICogZXh0ZXJuYWwgVExCIHJhbmdlIG5lZWRzIHRvIGJlIGZs dXNoZWQuIEZvciBtb3JlIGluIGRlcHRoCisJICogZGlzY3Vzc2lvbiBvbiB0aGlzIHNlZSBEb2N1 bWVudGF0aW9uL3ZtL21tdV9ub3RpZmllci50eHQKIAkgKgogCSAqIFRoZSBpbnZhbGlkYXRlX3Jh bmdlKCkgZnVuY3Rpb24gaXMgY2FsbGVkIHVuZGVyIHRoZSBwdGwKIAkgKiBzcGluLWxvY2sgYW5k IG5vdCBhbGxvd2VkIHRvIHNsZWVwLgpkaWZmIC0tZ2l0IGEvbW0vaHVnZV9tZW1vcnkuYyBiL21t L2h1Z2VfbWVtb3J5LmMKaW5kZXggYzAzN2QzZDM0OTUwLi5mZjViYzY0N2I1MWQgMTAwNjQ0Ci0t LSBhL21tL2h1Z2VfbWVtb3J5LmMKKysrIGIvbW0vaHVnZV9tZW1vcnkuYwpAQCAtMTE4Niw4ICsx MTg2LDE1IEBAIHN0YXRpYyBpbnQgZG9faHVnZV9wbWRfd3BfcGFnZV9mYWxsYmFjayhzdHJ1Y3Qg dm1fZmF1bHQgKnZtZiwgcG1kX3Qgb3JpZ19wbWQsCiAJCWdvdG8gb3V0X2ZyZWVfcGFnZXM7CiAJ Vk1fQlVHX09OX1BBR0UoIVBhZ2VIZWFkKHBhZ2UpLCBwYWdlKTsKIAorCS8qCisJICogTGVhdmUg cG1kIGVtcHR5IHVudGlsIHB0ZSBpcyBmaWxsZWQgbm90ZSB3ZSBtdXN0IG5vdGlmeSBoZXJlIGFz CisJICogY29uY3VycmVudCBDUFUgdGhyZWFkIG1pZ2h0IHdyaXRlIHRvIG5ldyBwYWdlIGJlZm9y ZSB0aGUgY2FsbCB0bworCSAqIG1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3JhbmdlX2VuZCgpIGhh cHBlbnMgd2hpY2ggY2FuIGxlYWQgdG8gYQorCSAqIGRldmljZSBzZWVpbmcgbWVtb3J5IHdyaXRl IGluIGRpZmZlcmVudCBvcmRlciB0aGFuIENQVS4KKwkgKgorCSAqIFNlZSBEb2N1bWVudGF0aW9u L3ZtL21tdV9ub3RpZmllci50eHQKKwkgKi8KIAlwbWRwX2h1Z2VfY2xlYXJfZmx1c2hfbm90aWZ5 KHZtYSwgaGFkZHIsIHZtZi0+cG1kKTsKLQkvKiBsZWF2ZSBwbWQgZW1wdHkgdW50aWwgcHRlIGlz IGZpbGxlZCAqLwogCiAJcGd0YWJsZSA9IHBndGFibGVfdHJhbnNfaHVnZV93aXRoZHJhdyh2bWEt PnZtX21tLCB2bWYtPnBtZCk7CiAJcG1kX3BvcHVsYXRlKHZtYS0+dm1fbW0sICZfcG1kLCBwZ3Rh YmxlKTsKQEAgLTIwMjYsOCArMjAzMywxNSBAQCBzdGF0aWMgdm9pZCBfX3NwbGl0X2h1Z2VfemVy b19wYWdlX3BtZChzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSwKIAlwbWRfdCBfcG1kOwogCWlu dCBpOwogCi0JLyogbGVhdmUgcG1kIGVtcHR5IHVudGlsIHB0ZSBpcyBmaWxsZWQgKi8KLQlwbWRw X2h1Z2VfY2xlYXJfZmx1c2hfbm90aWZ5KHZtYSwgaGFkZHIsIHBtZCk7CisJLyoKKwkgKiBMZWF2 ZSBwbWQgZW1wdHkgdW50aWwgcHRlIGlzIGZpbGxlZCBub3RlIHRoYXQgaXQgaXMgZmluZSB0byBk ZWxheQorCSAqIG5vdGlmaWNhdGlvbiB1bnRpbCBtbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5n ZV9lbmQoKSBhcyB3ZSBhcmUKKwkgKiByZXBsYWNpbmcgYSB6ZXJvIHBtZCB3cml0ZSBwcm90ZWN0 ZWQgcGFnZSB3aXRoIGEgemVybyBwdGUgd3JpdGUKKwkgKiBwcm90ZWN0ZWQgcGFnZS4KKwkgKgor CSAqIFNlZSBEb2N1bWVudGF0aW9uL3ZtL21tdV9ub3RpZmllci50eHQKKwkgKi8KKwlwbWRwX2h1 Z2VfY2xlYXJfZmx1c2godm1hLCBoYWRkciwgcG1kKTsKIAogCXBndGFibGUgPSBwZ3RhYmxlX3Ry YW5zX2h1Z2Vfd2l0aGRyYXcobW0sIHBtZCk7CiAJcG1kX3BvcHVsYXRlKG1tLCAmX3BtZCwgcGd0 YWJsZSk7CmRpZmYgLS1naXQgYS9tbS9odWdldGxiLmMgYi9tbS9odWdldGxiLmMKaW5kZXggMTc2 OGVmYTRjNTAxLi42M2E2M2YxYjUzNmMgMTAwNjQ0Ci0tLSBhL21tL2h1Z2V0bGIuYworKysgYi9t bS9odWdldGxiLmMKQEAgLTMyNTQsOSArMzI1NCwxNCBAQCBpbnQgY29weV9odWdldGxiX3BhZ2Vf cmFuZ2Uoc3RydWN0IG1tX3N0cnVjdCAqZHN0LCBzdHJ1Y3QgbW1fc3RydWN0ICpzcmMsCiAJCQlz ZXRfaHVnZV9zd2FwX3B0ZV9hdChkc3QsIGFkZHIsIGRzdF9wdGUsIGVudHJ5LCBzeik7CiAJCX0g ZWxzZSB7CiAJCQlpZiAoY293KSB7CisJCQkJLyoKKwkJCQkgKiBObyBuZWVkIHRvIG5vdGlmeSBh cyB3ZSBhcmUgZG93bmdyYWRpbmcgcGFnZQorCQkJCSAqIHRhYmxlIHByb3RlY3Rpb24gbm90IGNo YW5naW5nIGl0IHRvIHBvaW50CisJCQkJICogdG8gYSBuZXcgcGFnZS4KKwkJCQkgKgorCQkJCSAq IFNlZSBEb2N1bWVudGF0aW9uL3ZtL21tdV9ub3RpZmllci50eHQKKwkJCQkgKi8KIAkJCQlodWdl X3B0ZXBfc2V0X3dycHJvdGVjdChzcmMsIGFkZHIsIHNyY19wdGUpOwotCQkJCW1tdV9ub3RpZmll cl9pbnZhbGlkYXRlX3JhbmdlKHNyYywgbW11bl9zdGFydCwKLQkJCQkJCQkJICAgbW11bl9lbmQp OwogCQkJfQogCQkJZW50cnkgPSBodWdlX3B0ZXBfZ2V0KHNyY19wdGUpOwogCQkJcHRlcGFnZSA9 IHB0ZV9wYWdlKGVudHJ5KTsKQEAgLTQyODgsNyArNDI5MywxMiBAQCB1bnNpZ25lZCBsb25nIGh1 Z2V0bGJfY2hhbmdlX3Byb3RlY3Rpb24oc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEsCiAJICog YW5kIHRoYXQgcGFnZSB0YWJsZSBiZSByZXVzZWQgYW5kIGZpbGxlZCB3aXRoIGp1bmsuCiAJICov CiAJZmx1c2hfaHVnZXRsYl90bGJfcmFuZ2Uodm1hLCBzdGFydCwgZW5kKTsKLQltbXVfbm90aWZp ZXJfaW52YWxpZGF0ZV9yYW5nZShtbSwgc3RhcnQsIGVuZCk7CisJLyoKKwkgKiBObyBuZWVkIHRv IGNhbGwgbW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2UoKSB3ZSBhcmUgZG93bmdyYWRpbmcK KwkgKiBwYWdlIHRhYmxlIHByb3RlY3Rpb24gbm90IGNoYW5naW5nIGl0IHRvIHBvaW50IHRvIGEg bmV3IHBhZ2UuCisJICoKKwkgKiBTZWUgRG9jdW1lbnRhdGlvbi92bS9tbXVfbm90aWZpZXIudHh0 CisJICovCiAJaV9tbWFwX3VubG9ja193cml0ZSh2bWEtPnZtX2ZpbGUtPmZfbWFwcGluZyk7CiAJ bW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2VfZW5kKG1tLCBzdGFydCwgZW5kKTsKIApkaWZm IC0tZ2l0IGEvbW0va3NtLmMgYi9tbS9rc20uYwppbmRleCA2Y2I2MGY0NmNjZTUuLmJlOGY0NTc2 Zjg0MiAxMDA2NDQKLS0tIGEvbW0va3NtLmMKKysrIGIvbW0va3NtLmMKQEAgLTEwNTIsOCArMTA1 MiwxMyBAQCBzdGF0aWMgaW50IHdyaXRlX3Byb3RlY3RfcGFnZShzdHJ1Y3Qgdm1fYXJlYV9zdHJ1 Y3QgKnZtYSwgc3RydWN0IHBhZ2UgKnBhZ2UsCiAJCSAqIFNvIHdlIGNsZWFyIHRoZSBwdGUgYW5k IGZsdXNoIHRoZSB0bGIgYmVmb3JlIHRoZSBjaGVjawogCQkgKiB0aGlzIGFzc3VyZSB1cyB0aGF0 IG5vIE9fRElSRUNUIGNhbiBoYXBwZW4gYWZ0ZXIgdGhlIGNoZWNrCiAJCSAqIG9yIGluIHRoZSBt aWRkbGUgb2YgdGhlIGNoZWNrLgorCQkgKgorCQkgKiBObyBuZWVkIHRvIG5vdGlmeSBhcyB3ZSBh cmUgZG93bmdyYWRpbmcgcGFnZSB0YWJsZSB0byByZWFkCisJCSAqIG9ubHkgbm90IGNoYW5naW5n IGl0IHRvIHBvaW50IHRvIGEgbmV3IHBhZ2UuCisJCSAqCisJCSAqIFNlZSBEb2N1bWVudGF0aW9u L3ZtL21tdV9ub3RpZmllci50eHQKIAkJICovCi0JCWVudHJ5ID0gcHRlcF9jbGVhcl9mbHVzaF9u b3RpZnkodm1hLCBwdm13LmFkZHJlc3MsIHB2bXcucHRlKTsKKwkJZW50cnkgPSBwdGVwX2NsZWFy X2ZsdXNoKHZtYSwgcHZtdy5hZGRyZXNzLCBwdm13LnB0ZSk7CiAJCS8qCiAJCSAqIENoZWNrIHRo YXQgbm8gT19ESVJFQ1Qgb3Igc2ltaWxhciBJL08gaXMgaW4gcHJvZ3Jlc3Mgb24gdGhlCiAJCSAq IHBhZ2UKQEAgLTExMzYsNyArMTE0MSwxMyBAQCBzdGF0aWMgaW50IHJlcGxhY2VfcGFnZShzdHJ1 Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSwgc3RydWN0IHBhZ2UgKnBhZ2UsCiAJfQogCiAJZmx1c2hf Y2FjaGVfcGFnZSh2bWEsIGFkZHIsIHB0ZV9wZm4oKnB0ZXApKTsKLQlwdGVwX2NsZWFyX2ZsdXNo X25vdGlmeSh2bWEsIGFkZHIsIHB0ZXApOworCS8qCisJICogTm8gbmVlZCB0byBub3RpZnkgYXMg d2UgYXJlIHJlcGxhY2luZyBhIHJlYWQgb25seSBwYWdlIHdpdGggYW5vdGhlcgorCSAqIHJlYWQg b25seSBwYWdlIHdpdGggdGhlIHNhbWUgY29udGVudC4KKwkgKgorCSAqIFNlZSBEb2N1bWVudGF0 aW9uL3ZtL21tdV9ub3RpZmllci50eHQKKwkgKi8KKwlwdGVwX2NsZWFyX2ZsdXNoKHZtYSwgYWRk ciwgcHRlcCk7CiAJc2V0X3B0ZV9hdF9ub3RpZnkobW0sIGFkZHIsIHB0ZXAsIG5ld3B0ZSk7CiAK IAlwYWdlX3JlbW92ZV9ybWFwKHBhZ2UsIGZhbHNlKTsKZGlmZiAtLWdpdCBhL21tL3JtYXAuYyBi L21tL3JtYXAuYwppbmRleCAwNjE4MjYyNzg1MjAuLjZiNWEwZjIxOWFjMCAxMDA2NDQKLS0tIGEv bW0vcm1hcC5jCisrKyBiL21tL3JtYXAuYwpAQCAtOTM3LDEwICs5MzcsMTUgQEAgc3RhdGljIGJv b2wgcGFnZV9ta2NsZWFuX29uZShzdHJ1Y3QgcGFnZSAqcGFnZSwgc3RydWN0IHZtX2FyZWFfc3Ry dWN0ICp2bWEsCiAjZW5kaWYKIAkJfQogCi0JCWlmIChyZXQpIHsKLQkJCW1tdV9ub3RpZmllcl9p bnZhbGlkYXRlX3JhbmdlKHZtYS0+dm1fbW0sIGNzdGFydCwgY2VuZCk7CisJCS8qCisJCSAqIE5v IG5lZWQgdG8gY2FsbCBtbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZSgpIGFzIHdlIGFyZQor CQkgKiBkb3duZ3JhZGluZyBwYWdlIHRhYmxlIHByb3RlY3Rpb24gbm90IGNoYW5naW5nIGl0IHRv IHBvaW50CisJCSAqIHRvIGEgbmV3IHBhZ2UuCisJCSAqCisJCSAqIFNlZSBEb2N1bWVudGF0aW9u L3ZtL21tdV9ub3RpZmllci50eHQKKwkJICovCisJCWlmIChyZXQpCiAJCQkoKmNsZWFuZWQpKys7 Ci0JCX0KIAl9CiAKIAltbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZV9lbmQodm1hLT52bV9t bSwgc3RhcnQsIGVuZCk7CkBAIC0xNDI0LDYgKzE0MjksMTAgQEAgc3RhdGljIGJvb2wgdHJ5X3Rv X3VubWFwX29uZShzdHJ1Y3QgcGFnZSAqcGFnZSwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEs CiAJCQlpZiAocHRlX3NvZnRfZGlydHkocHRldmFsKSkKIAkJCQlzd3BfcHRlID0gcHRlX3N3cF9t a3NvZnRfZGlydHkoc3dwX3B0ZSk7CiAJCQlzZXRfcHRlX2F0KG1tLCBwdm13LmFkZHJlc3MsIHB2 bXcucHRlLCBzd3BfcHRlKTsKKwkJCS8qCisJCQkgKiBObyBuZWVkIHRvIGludmFsaWRhdGUgaGVy ZSBpdCB3aWxsIHN5bmNocm9uaXplIG9uCisJCQkgKiBhZ2FpbnN0IHRoZSBzcGVjaWFsIHN3YXAg bWlncmF0aW9uIHB0ZS4KKwkJCSAqLwogCQkJZ290byBkaXNjYXJkOwogCQl9CiAKQEAgLTE0ODEs NiArMTQ5MCw5IEBAIHN0YXRpYyBib29sIHRyeV90b191bm1hcF9vbmUoc3RydWN0IHBhZ2UgKnBh Z2UsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hLAogCQkJICogd2lsbCB0YWtlIGNhcmUgb2Yg dGhlIHJlc3QuCiAJCQkgKi8KIAkJCWRlY19tbV9jb3VudGVyKG1tLCBtbV9jb3VudGVyKHBhZ2Up KTsKKwkJCS8qIFdlIGhhdmUgdG8gaW52YWxpZGF0ZSBhcyB3ZSBjbGVhcmVkIHRoZSBwdGUgKi8K KwkJCW1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3JhbmdlKG1tLCBhZGRyZXNzLAorCQkJCQkJICAg ICAgYWRkcmVzcyArIFBBR0VfU0laRSk7CiAJCX0gZWxzZSBpZiAoSVNfRU5BQkxFRChDT05GSUdf TUlHUkFUSU9OKSAmJgogCQkJCShmbGFncyAmIChUVFVfTUlHUkFUSU9OfFRUVV9TUExJVF9GUkVF WkUpKSkgewogCQkJc3dwX2VudHJ5X3QgZW50cnk7CkBAIC0xNDk2LDYgKzE1MDgsMTAgQEAgc3Rh dGljIGJvb2wgdHJ5X3RvX3VubWFwX29uZShzdHJ1Y3QgcGFnZSAqcGFnZSwgc3RydWN0IHZtX2Fy ZWFfc3RydWN0ICp2bWEsCiAJCQlpZiAocHRlX3NvZnRfZGlydHkocHRldmFsKSkKIAkJCQlzd3Bf cHRlID0gcHRlX3N3cF9ta3NvZnRfZGlydHkoc3dwX3B0ZSk7CiAJCQlzZXRfcHRlX2F0KG1tLCBh ZGRyZXNzLCBwdm13LnB0ZSwgc3dwX3B0ZSk7CisJCQkvKgorCQkJICogTm8gbmVlZCB0byBpbnZh bGlkYXRlIGhlcmUgaXQgd2lsbCBzeW5jaHJvbml6ZSBvbgorCQkJICogYWdhaW5zdCB0aGUgc3Bl Y2lhbCBzd2FwIG1pZ3JhdGlvbiBwdGUuCisJCQkgKi8KIAkJfSBlbHNlIGlmIChQYWdlQW5vbihw YWdlKSkgewogCQkJc3dwX2VudHJ5X3QgZW50cnkgPSB7IC52YWwgPSBwYWdlX3ByaXZhdGUoc3Vi cGFnZSkgfTsKIAkJCXB0ZV90IHN3cF9wdGU7CkBAIC0xNTA3LDYgKzE1MjMsOCBAQCBzdGF0aWMg Ym9vbCB0cnlfdG9fdW5tYXBfb25lKHN0cnVjdCBwYWdlICpwYWdlLCBzdHJ1Y3Qgdm1fYXJlYV9z dHJ1Y3QgKnZtYSwKIAkJCQlXQVJOX09OX09OQ0UoMSk7CiAJCQkJcmV0ID0gZmFsc2U7CiAJCQkJ LyogV2UgaGF2ZSB0byBpbnZhbGlkYXRlIGFzIHdlIGNsZWFyZWQgdGhlIHB0ZSAqLworCQkJCW1t dV9ub3RpZmllcl9pbnZhbGlkYXRlX3JhbmdlKG1tLCBhZGRyZXNzLAorCQkJCQkJCWFkZHJlc3Mg KyBQQUdFX1NJWkUpOwogCQkJCXBhZ2Vfdm1hX21hcHBlZF93YWxrX2RvbmUoJnB2bXcpOwogCQkJ CWJyZWFrOwogCQkJfQpAQCAtMTUxNCw2ICsxNTMyLDkgQEAgc3RhdGljIGJvb2wgdHJ5X3RvX3Vu bWFwX29uZShzdHJ1Y3QgcGFnZSAqcGFnZSwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEsCiAJ CQkvKiBNQURWX0ZSRUUgcGFnZSBjaGVjayAqLwogCQkJaWYgKCFQYWdlU3dhcEJhY2tlZChwYWdl KSkgewogCQkJCWlmICghUGFnZURpcnR5KHBhZ2UpKSB7CisJCQkJCS8qIEludmFsaWRhdGUgYXMg d2UgY2xlYXJlZCB0aGUgcHRlICovCisJCQkJCW1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3Jhbmdl KG1tLAorCQkJCQkJYWRkcmVzcywgYWRkcmVzcyArIFBBR0VfU0laRSk7CiAJCQkJCWRlY19tbV9j b3VudGVyKG1tLCBNTV9BTk9OUEFHRVMpOwogCQkJCQlnb3RvIGRpc2NhcmQ7CiAJCQkJfQpAQCAt MTU0NywxMyArMTU2OCwzOSBAQCBzdGF0aWMgYm9vbCB0cnlfdG9fdW5tYXBfb25lKHN0cnVjdCBw YWdlICpwYWdlLCBzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSwKIAkJCWlmIChwdGVfc29mdF9k aXJ0eShwdGV2YWwpKQogCQkJCXN3cF9wdGUgPSBwdGVfc3dwX21rc29mdF9kaXJ0eShzd3BfcHRl KTsKIAkJCXNldF9wdGVfYXQobW0sIGFkZHJlc3MsIHB2bXcucHRlLCBzd3BfcHRlKTsKLQkJfSBl bHNlCisJCQkvKiBJbnZhbGlkYXRlIGFzIHdlIGNsZWFyZWQgdGhlIHB0ZSAqLworCQkJbW11X25v dGlmaWVyX2ludmFsaWRhdGVfcmFuZ2UobW0sIGFkZHJlc3MsCisJCQkJCQkgICAgICBhZGRyZXNz ICsgUEFHRV9TSVpFKTsKKwkJfSBlbHNlIHsKKwkJCS8qCisJCQkgKiBXZSBzaG91bGQgbm90IG5l ZWQgdG8gbm90aWZ5IGhlcmUgYXMgd2UgcmVhY2ggdGhpcworCQkJICogY2FzZSBvbmx5IGZyb20g ZnJlZXplX3BhZ2UoKSBpdHNlbGYgb25seSBjYWxsIGZyb20KKwkJCSAqIHNwbGl0X2h1Z2VfcGFn ZV90b19saXN0KCkgc28gZXZlcnl0aGluZyBiZWxvdyBtdXN0CisJCQkgKiBiZSB0cnVlOgorCQkJ ICogICAtIHBhZ2UgaXMgbm90IGFub255bW91cworCQkJICogICAtIHBhZ2UgaXMgbG9ja2VkCisJ CQkgKgorCQkJICogU28gYXMgaXQgaXMgYSBsb2NrZWQgZmlsZSBiYWNrIHBhZ2UgdGh1cyBpdCBj YW4gbm90CisJCQkgKiBiZSByZW1vdmUgZnJvbSB0aGUgcGFnZSBjYWNoZSBhbmQgcmVwbGFjZSBi eSBhIG5ldworCQkJICogcGFnZSBiZWZvcmUgbW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2Vf ZW5kIHNvIG5vCisJCQkgKiBjb25jdXJyZW50IHRocmVhZCBtaWdodCB1cGRhdGUgaXRzIHBhZ2Ug dGFibGUgdG8KKwkJCSAqIHBvaW50IGF0IG5ldyBwYWdlIHdoaWxlIGEgZGV2aWNlIHN0aWxsIGlz IHVzaW5nIHRoaXMKKwkJCSAqIHBhZ2UuCisJCQkgKgorCQkJICogU2VlIERvY3VtZW50YXRpb24v dm0vbW11X25vdGlmaWVyLnR4dAorCQkJICovCiAJCQlkZWNfbW1fY291bnRlcihtbSwgbW1fY291 bnRlcl9maWxlKHBhZ2UpKTsKKwkJfQogZGlzY2FyZDoKKwkJLyoKKwkJICogTm8gbmVlZCB0byBj YWxsIG1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3JhbmdlKCkgaXQgaGFzIGJlCisJCSAqIGRvbmUg YWJvdmUgZm9yIGFsbCBjYXNlcyByZXF1aXJpbmcgaXQgdG8gaGFwcGVuIHVuZGVyIHBhZ2UKKwkJ ICogdGFibGUgbG9jayBiZWZvcmUgbW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2VfZW5kKCkK KwkJICoKKwkJICogU2VlIERvY3VtZW50YXRpb24vdm0vbW11X25vdGlmaWVyLnR4dAorCQkgKi8K IAkJcGFnZV9yZW1vdmVfcm1hcChzdWJwYWdlLCBQYWdlSHVnZShwYWdlKSk7CiAJCXB1dF9wYWdl KHBhZ2UpOwotCQltbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZShtbSwgYWRkcmVzcywKLQkJ CQkJICAgICAgYWRkcmVzcyArIFBBR0VfU0laRSk7CiAJfQogCiAJbW11X25vdGlmaWVyX2ludmFs aWRhdGVfcmFuZ2VfZW5kKHZtYS0+dm1fbW0sIHN0YXJ0LCBlbmQpOwotLSAKMi4xMy42CgpfX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwppb21tdSBtYWlsaW5n IGxpc3QKaW9tbXVAbGlzdHMubGludXgtZm91bmRhdGlvbi5vcmcKaHR0cHM6Ly9saXN0cy5saW51 eGZvdW5kYXRpb24ub3JnL21haWxtYW4vbGlzdGluZm8vaW9tbXU= From mboxrd@z Thu Jan 1 00:00:00 1970 From: jglisse@redhat.com Subject: [PATCH 2/2] mm/mmu_notifier: avoid call to invalidate_range() in range_end() Date: Mon, 16 Oct 2017 23:10:03 -0400 Message-ID: <20171017031003.7481-3-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20171017031003.7481-1-jglisse@redhat.com> Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org List-Id: iommu@lists.linux-foundation.org From: Jérôme Glisse This is an optimization patch that only affect mmu_notifier users which rely on the invalidate_range() callback. This patch avoids calling that callback twice in a row from inside __mmu_notifier_invalidate_range_end Existing pattern (before this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_end() mmu_notifier_invalidate_range() New pattern (after this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_only_end() We call the invalidate_range callback after clearing the page table under the page table lock and we skip the call to invalidate_range inside the __mmu_notifier_invalidate_range_end() function. Idea from Andrea Arcangeli Signed-off-by: Jérôme Glisse Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: David Woodhouse Cc: Alistair Popple Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Stephen Rothwell Cc: Andrew Donnellan Cc: iommu@lists.linux-foundation.org Cc: linuxppc-dev@lists.ozlabs.org --- include/linux/mmu_notifier.h | 17 ++++++++++++++-- mm/huge_memory.c | 46 ++++++++++++++++++++++++++++++++++++++++---- mm/memory.c | 6 +++++- mm/migrate.c | 15 ++++++++++++--- mm/mmu_notifier.c | 11 +++++++++-- 5 files changed, 83 insertions(+), 12 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 49c925c96b8a..6665c4624287 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -213,7 +213,8 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm, extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, unsigned long start, unsigned long end); extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end); + unsigned long start, unsigned long end, + bool only_end); extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end); @@ -267,7 +268,14 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, unsigned long start, unsigned long end) { if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_end(mm, start, end); + __mmu_notifier_invalidate_range_end(mm, start, end, false); +} + +static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end, true); } static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, @@ -438,6 +446,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, { } +static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ff5bc647b51d..b2912305994f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1220,7 +1220,12 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, page_remove_rmap(page, true); spin_unlock(vmf->ptl); - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pmdp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start, + mmun_end); ret |= VM_FAULT_WRITE; put_page(page); @@ -1369,7 +1374,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) } spin_unlock(vmf->ptl); out_mn: - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pmdp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start, + mmun_end); out: return ret; out_unlock: @@ -2021,7 +2031,12 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, out: spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PUD_SIZE); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pudp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(mm, haddr, haddr + + HPAGE_PUD_SIZE); } #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ @@ -2096,6 +2111,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR); return; } else if (is_huge_zero_pmd(*pmd)) { + /* + * FIXME: Do we want to invalidate secondary mmu by calling + * mmu_notifier_invalidate_range() see comments below inside + * __split_huge_pmd() ? + * + * We are going from a zero huge page write protected to zero + * small page also write protected so it does not seems useful + * to invalidate secondary mmu at this time. + */ return __split_huge_zero_page_pmd(vma, haddr, pmd); } @@ -2231,7 +2255,21 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, __split_huge_pmd_locked(vma, pmd, haddr, freeze); out: spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE); + /* + * No need to double call mmu_notifier->invalidate_range() callback. + * They are 3 cases to consider inside __split_huge_pmd_locked(): + * 1) pmdp_huge_clear_flush_notify() call invalidate_range() obvious + * 2) __split_huge_zero_page_pmd() read only zero page and any write + * fault will trigger a flush_notify before pointing to a new page + * (it is fine if the secondary mmu keeps pointing to the old zero + * page in the meantime) + * 3) Split a huge pmd into pte pointing to the same page. No need + * to invalidate secondary tlb entry they are all still valid. + * any further changes to individual pte will notify. So no need + * to call mmu_notifier->invalidate_range() + */ + mmu_notifier_invalidate_range_only_end(mm, haddr, haddr + + HPAGE_PMD_SIZE); } void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index 47cdf4e85c2d..8a0c410037d2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2555,7 +2555,11 @@ static int wp_page_copy(struct vm_fault *vmf) put_page(new_page); pte_unmap_unlock(vmf->pte, vmf->ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above ptep_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end); if (old_page) { /* * Don't let another task, with possibly unlocked vma, diff --git a/mm/migrate.c b/mm/migrate.c index e00814ca390e..2f0f8190cb6f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2088,7 +2088,11 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED); spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pmdp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end); /* Take an "isolate" reference and put new page on the LRU. */ get_page(new_page); @@ -2804,9 +2808,14 @@ static void migrate_vma_pages(struct migrate_vma *migrate) migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; } + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above ptep_clear_flush_notify() inside migrate_vma_insert_page() + * did already call it. + */ if (notified) - mmu_notifier_invalidate_range_end(mm, mmu_start, - migrate->end); + mmu_notifier_invalidate_range_only_end(mm, mmu_start, + migrate->end); } /* diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 314285284e6e..96edb33fd09a 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -190,7 +190,9 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start); void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, + unsigned long end, + bool only_end) { struct mmu_notifier *mn; int id; @@ -204,8 +206,13 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, * subsystem registers either invalidate_range_start()/end() or * invalidate_range(), so this will be no additional overhead * (besides the pointer check). + * + * We skip call to invalidate_range() if we know it is safe ie + * call site use mmu_notifier_invalidate_range_only_end() which + * is safe to do when we know that a call to invalidate_range() + * already happen under page table lock. */ - if (mn->ops->invalidate_range) + if (!only_end && mn->ops->invalidate_range) mn->ops->invalidate_range(mn, mm, start, end); if (mn->ops->invalidate_range_end) mn->ops->invalidate_range_end(mn, mm, start, end); -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Balbir Singh Subject: Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback Date: Thu, 19 Oct 2017 13:43:19 +1100 Message-ID: <20171019134319.1b856091@MiWiFi-R3-srv> References: <20171017031003.7481-1-jglisse@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20171017031003.7481-1-jglisse@redhat.com> Sender: owner-linux-mm@kvack.org To: jglisse@redhat.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org List-Id: iommu@lists.linux-foundation.org On Mon, 16 Oct 2017 23:10:01 -0400 jglisse@redhat.com wrote: > From: J=C3=A9r=C3=B4me Glisse >=20 > (Andrew you already have v1 in your queue of patch 1, patch 2 is new, > i think you can drop it patch 1 v1 for v2, v2 is bit more conservative > and i fixed typos) >=20 > All this only affect user of invalidate_range callback (at this time > CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in > drivers/iommu/amd_iommu_v2.c|intel-svm.c) >=20 > This patchset remove useless double call to mmu_notifier->invalidate_range > callback wherever it is safe to do so. The first patch just remove useless > call As in an extra call? Where does that come from? > and add documentation explaining why it is safe to do so. The second > patch go further by introducing mmu_notifier_invalidate_range_only_end() > which skip callback to invalidate_range this can be done when clearing a > pte, pmd or pud with notification which call invalidate_range right after > clearing under the page table lock. > Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Balbir Singh Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Thu, 19 Oct 2017 14:04:26 +1100 Message-ID: <20171019140426.21f51957@MiWiFi-R3-srv> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20171017031003.7481-2-jglisse@redhat.com> Sender: linux-next-owner@vger.kernel.org To: jglisse@redhat.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org, linux-next@vger.kernel.org List-Id: iommu@lists.linux-foundation.org On Mon, 16 Oct 2017 23:10:02 -0400 jglisse@redhat.com wrote: > From: J=C3=A9r=C3=B4me Glisse >=20 > + /* > + * No need to call mmu_notifier_invalidate_range() as we are > + * downgrading page table protection not changing it to point > + * to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > if (pmdp) { > #ifdef CONFIG_FS_DAX_PMD > pmd_t pmd; > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_= space *mapping, > pmd =3D pmd_wrprotect(pmd); > pmd =3D pmd_mkclean(pmd); > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); Could the secondary TLB still see the mapping as dirty and propagate the di= rty bit back? > unlock_pmd: > spin_unlock(ptl); > #endif > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_= space *mapping, > pte =3D pte_wrprotect(pte); > pte =3D pte_mkclean(pte); > set_pte_at(vma->vm_mm, address, ptep, pte); > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); Ditto > unlock_pte: > pte_unmap_unlock(ptep, ptl); > } > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 6866e8126982..49c925c96b8a 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > * shared page-tables, it not necessary to implement the > * invalidate_range_start()/end() notifiers, as > * invalidate_range() alread catches the points in time when an > - * external TLB range needs to be flushed. > + * external TLB range needs to be flushed. For more in depth > + * discussion on this see Documentation/vm/mmu_notifier.txt > * > * The invalidate_range() function is called under the ptl > * spin-lock and not allowed to sleep. > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index c037d3d34950..ff5bc647b51d 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_= fault *vmf, pmd_t orig_pmd, > goto out_free_pages; > VM_BUG_ON_PAGE(!PageHead(page), page); > =20 > + /* > + * Leave pmd empty until pte is filled note we must notify here as > + * concurrent CPU thread might write to new page before the call to > + * mmu_notifier_invalidate_range_end() happens which can lead to a > + * device seeing memory write in different order than CPU. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > - /* leave pmd empty until pte is filled */ > =20 > pgtable =3D pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > pmd_populate(vma->vm_mm, &_pmd, pgtable); > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_a= rea_struct *vma, > pmd_t _pmd; > int i; > =20 > - /* leave pmd empty until pte is filled */ > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > + /* > + * Leave pmd empty until pte is filled note that it is fine to delay > + * notification until mmu_notifier_invalidate_range_end() as we are > + * replacing a zero pmd write protected page with a zero pte write > + * protected page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > + pmdp_huge_clear_flush(vma, haddr, pmd); Shouldn't the secondary TLB know if the page size changed? > =20 > pgtable =3D pgtable_trans_huge_withdraw(mm, pmd); > pmd_populate(mm, &_pmd, pgtable); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 1768efa4c501..63a63f1b536c 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst,= struct mm_struct *src, > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > } else { > if (cow) { > + /* > + * No need to notify as we are downgrading page > + * table protection not changing it to point > + * to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > huge_ptep_set_wrprotect(src, addr, src_pte); OK.. so we could get write faults on write accesses from the device. > - mmu_notifier_invalidate_range(src, mmun_start, > - mmun_end); > } > entry =3D huge_ptep_get(src_pte); > ptepage =3D pte_page(entry); > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_= area_struct *vma, > * and that page table be reused and filled with junk. > */ > flush_hugetlb_tlb_range(vma, start, end); > - mmu_notifier_invalidate_range(mm, start, end); > + /* > + * No need to call mmu_notifier_invalidate_range() we are downgrading > + * page table protection not changing it to point to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > i_mmap_unlock_write(vma->vm_file->f_mapping); > mmu_notifier_invalidate_range_end(mm, start, end); > =20 > diff --git a/mm/ksm.c b/mm/ksm.c > index 6cb60f46cce5..be8f4576f842 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struc= t *vma, struct page *page, > * So we clear the pte and flush the tlb before the check > * this assure us that no O_DIRECT can happen after the check > * or in the middle of the check. > + * > + * No need to notify as we are downgrading page table to read > + * only not changing it to point to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > */ > - entry =3D ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > + entry =3D ptep_clear_flush(vma, pvmw.address, pvmw.pte); > /* > * Check that no O_DIRECT or similar I/O is in progress on the > * page > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma= , struct page *page, > } > =20 > flush_cache_page(vma, addr, pte_pfn(*ptep)); > - ptep_clear_flush_notify(vma, addr, ptep); > + /* > + * No need to notify as we are replacing a read only page with another > + * read only page with the same content. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > + ptep_clear_flush(vma, addr, ptep); > set_pte_at_notify(mm, addr, ptep, newpte); > =20 > page_remove_rmap(page, false); > diff --git a/mm/rmap.c b/mm/rmap.c > index 061826278520..6b5a0f219ac0 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, str= uct vm_area_struct *vma, > #endif > } > =20 > - if (ret) { > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > + /* > + * No need to call mmu_notifier_invalidate_range() as we are > + * downgrading page table protection not changing it to point > + * to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > + if (ret) > (*cleaned)++; > - } > } > =20 > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, st= ruct vm_area_struct *vma, > if (pte_soft_dirty(pteval)) > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > + /* > + * No need to invalidate here it will synchronize on > + * against the special swap migration pte. > + */ > goto discard; > } > =20 > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, str= uct vm_area_struct *vma, > * will take care of the rest. > */ > dec_mm_counter(mm, mm_counter(page)); > + /* We have to invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, address, > + address + PAGE_SIZE); > } else if (IS_ENABLED(CONFIG_MIGRATION) && > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > swp_entry_t entry; > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, st= ruct vm_area_struct *vma, > if (pte_soft_dirty(pteval)) > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); > set_pte_at(mm, address, pvmw.pte, swp_pte); > + /* > + * No need to invalidate here it will synchronize on > + * against the special swap migration pte. > + */ > } else if (PageAnon(page)) { > swp_entry_t entry =3D { .val =3D page_private(subpage) }; > pte_t swp_pte; > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, str= uct vm_area_struct *vma, > WARN_ON_ONCE(1); > ret =3D false; > /* We have to invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, address, > + address + PAGE_SIZE); > page_vma_mapped_walk_done(&pvmw); > break; > } > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, str= uct vm_area_struct *vma, > /* MADV_FREE page check */ > if (!PageSwapBacked(page)) { > if (!PageDirty(page)) { > + /* Invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, > + address, address + PAGE_SIZE); > dec_mm_counter(mm, MM_ANONPAGES); > goto discard; > } > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, s= truct vm_area_struct *vma, > if (pte_soft_dirty(pteval)) > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); > set_pte_at(mm, address, pvmw.pte, swp_pte); > - } else > + /* Invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, address, > + address + PAGE_SIZE); > + } else { > + /* > + * We should not need to notify here as we reach this > + * case only from freeze_page() itself only call from > + * split_huge_page_to_list() so everything below must > + * be true: > + * - page is not anonymous > + * - page is locked > + * > + * So as it is a locked file back page thus it can not > + * be remove from the page cache and replace by a new > + * page before mmu_notifier_invalidate_range_end so no > + * concurrent thread might update its page table to > + * point at new page while a device still is using this > + * page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > dec_mm_counter(mm, mm_counter_file(page)); > + } > discard: > + /* > + * No need to call mmu_notifier_invalidate_range() it has be > + * done above for all cases requiring it to happen under page > + * table lock before mmu_notifier_invalidate_range_end() > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > page_remove_rmap(subpage, PageHuge(page)); > put_page(page); > - mmu_notifier_invalidate_range(mm, address, > - address + PAGE_SIZE); > } > =20 > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); Looking at the patchset, I understand the efficiency, but I am concerned with correctness. Balbir Singh. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Glisse Subject: Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback Date: Wed, 18 Oct 2017 23:08:12 -0400 Message-ID: <20171019030812.GB5246@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171019134319.1b856091@MiWiFi-R3-srv> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: <20171019134319.1b856091@MiWiFi-R3-srv> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: Balbir Singh Cc: Andrea Arcangeli , Stephen Rothwell , Joerg Roedel , Benjamin Herrenschmidt , Andrew Donnellan , Alistair Popple , linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Michael Ellerman , Andrew Morton , David Woodhouse List-Id: iommu@lists.linux-foundation.org On Thu, Oct 19, 2017 at 01:43:19PM +1100, Balbir Singh wrote: > On Mon, 16 Oct 2017 23:10:01 -0400 > jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > = > > From: J=E9r=F4me Glisse > > = > > (Andrew you already have v1 in your queue of patch 1, patch 2 is new, > > i think you can drop it patch 1 v1 for v2, v2 is bit more conservative > > and i fixed typos) > > = > > All this only affect user of invalidate_range callback (at this time > > CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in > > drivers/iommu/amd_iommu_v2.c|intel-svm.c) > > = > > This patchset remove useless double call to mmu_notifier->invalidate_ra= nge > > callback wherever it is safe to do so. The first patch just remove usel= ess > > call > = > As in an extra call? Where does that come from? Before this patch you had the following pattern: mmu_notifier_invalidate_range_start(); take_page_table_lock() ... update_page_table() mmu_notifier_invalidate_range() ... drop_page_table_lock() mmu_notifier_invalidate_range_end(); It happens that mmu_notifier_invalidate_range_end() also make an unconditional call to mmu_notifier_invalidate_range() so in the above scenario you had 2 calls to mmu_notifier_invalidate_range() Obviously one of the 2 call is useless. In some case you can drop the first call (under the page table lock) this is what patch 1 does. In other cases you can drop the second call that happen inside mmu_notifier_invalidate_range_end() that is what patch 2 does. Hence why i am referring to useless double call. I have added more documentation to explain all this in the code and also under Documentation/vm/mmu_notifier.txt > = > > and add documentation explaining why it is safe to do so. The second > > patch go further by introducing mmu_notifier_invalidate_range_only_end() > > which skip callback to invalidate_range this can be done when clearing a > > pte, pmd or pud with notification which call invalidate_range right aft= er > > clearing under the page table lock. > > > = > Balbir Singh. > = From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Glisse Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Wed, 18 Oct 2017 23:28:12 -0400 Message-ID: <20171019032811.GC5246@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Return-path: Content-Disposition: inline In-Reply-To: <20171019140426.21f51957@MiWiFi-R3-srv> Sender: owner-linux-mm@kvack.org To: Balbir Singh Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org, linux-next@vger.kernel.org List-Id: iommu@lists.linux-foundation.org On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > On Mon, 16 Oct 2017 23:10:02 -0400 > jglisse@redhat.com wrote: > > > From: Jérôme Glisse > > > > + /* > > + * No need to call mmu_notifier_invalidate_range() as we are > > + * downgrading page table protection not changing it to point > > + * to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > if (pmdp) { > > #ifdef CONFIG_FS_DAX_PMD > > pmd_t pmd; > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > pmd = pmd_wrprotect(pmd); > > pmd = pmd_mkclean(pmd); > > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? I am assuming hardware does sane thing of setting the dirty bit only when walking the CPU page table when device does a write fault ie once the device get a write TLB entry the dirty is set by the IOMMU when walking the page table before returning the lookup result to the device and that it won't be set again latter (ie propagated back latter). I should probably have spell that out and maybe some of the ATS/PASID implementer did not do that. > > > unlock_pmd: > > spin_unlock(ptl); > > #endif > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > pte = pte_wrprotect(pte); > > pte = pte_mkclean(pte); > > set_pte_at(vma->vm_mm, address, ptep, pte); > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > Ditto > > > unlock_pte: > > pte_unmap_unlock(ptep, ptl); > > } > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index 6866e8126982..49c925c96b8a 100644 > > --- a/include/linux/mmu_notifier.h > > +++ b/include/linux/mmu_notifier.h > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > > * shared page-tables, it not necessary to implement the > > * invalidate_range_start()/end() notifiers, as > > * invalidate_range() alread catches the points in time when an > > - * external TLB range needs to be flushed. > > + * external TLB range needs to be flushed. For more in depth > > + * discussion on this see Documentation/vm/mmu_notifier.txt > > * > > * The invalidate_range() function is called under the ptl > > * spin-lock and not allowed to sleep. > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index c037d3d34950..ff5bc647b51d 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > goto out_free_pages; > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > + /* > > + * Leave pmd empty until pte is filled note we must notify here as > > + * concurrent CPU thread might write to new page before the call to > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > + * device seeing memory write in different order than CPU. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > - /* leave pmd empty until pte is filled */ > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > pmd_t _pmd; > > int i; > > > > - /* leave pmd empty until pte is filled */ > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > + /* > > + * Leave pmd empty until pte is filled note that it is fine to delay > > + * notification until mmu_notifier_invalidate_range_end() as we are > > + * replacing a zero pmd write protected page with a zero pte write > > + * protected page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > Shouldn't the secondary TLB know if the page size changed? It should not matter, we are talking virtual to physical on behalf of a device against a process address space. So the hardware should not care about the page size. Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero 4K pages is replace by something new then a device TLB shootdown will happen before the new page is set. Only issue i can think of is if the IOMMU TLB (if there is one) or the device TLB (you do expect that there is one) does not invalidate TLB entry if the TLB shootdown is smaller than the TLB entry. That would be idiotic but yes i know hardware bug. > > > > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > pmd_populate(mm, &_pmd, pgtable); > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 1768efa4c501..63a63f1b536c 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > > } else { > > if (cow) { > > + /* > > + * No need to notify as we are downgrading page > > + * table protection not changing it to point > > + * to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > huge_ptep_set_wrprotect(src, addr, src_pte); > > OK.. so we could get write faults on write accesses from the device. > > > - mmu_notifier_invalidate_range(src, mmun_start, > > - mmun_end); > > } > > entry = huge_ptep_get(src_pte); > > ptepage = pte_page(entry); > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > > * and that page table be reused and filled with junk. > > */ > > flush_hugetlb_tlb_range(vma, start, end); > > - mmu_notifier_invalidate_range(mm, start, end); > > + /* > > + * No need to call mmu_notifier_invalidate_range() we are downgrading > > + * page table protection not changing it to point to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > i_mmap_unlock_write(vma->vm_file->f_mapping); > > mmu_notifier_invalidate_range_end(mm, start, end); > > > > diff --git a/mm/ksm.c b/mm/ksm.c > > index 6cb60f46cce5..be8f4576f842 100644 > > --- a/mm/ksm.c > > +++ b/mm/ksm.c > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > > * So we clear the pte and flush the tlb before the check > > * this assure us that no O_DIRECT can happen after the check > > * or in the middle of the check. > > + * > > + * No need to notify as we are downgrading page table to read > > + * only not changing it to point to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > */ > > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > > /* > > * Check that no O_DIRECT or similar I/O is in progress on the > > * page > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > > } > > > > flush_cache_page(vma, addr, pte_pfn(*ptep)); > > - ptep_clear_flush_notify(vma, addr, ptep); > > + /* > > + * No need to notify as we are replacing a read only page with another > > + * read only page with the same content. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > + ptep_clear_flush(vma, addr, ptep); > > set_pte_at_notify(mm, addr, ptep, newpte); > > > > page_remove_rmap(page, false); > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 061826278520..6b5a0f219ac0 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > > #endif > > } > > > > - if (ret) { > > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > > + /* > > + * No need to call mmu_notifier_invalidate_range() as we are > > + * downgrading page table protection not changing it to point > > + * to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > + if (ret) > > (*cleaned)++; > > - } > > } > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > if (pte_soft_dirty(pteval)) > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > > + /* > > + * No need to invalidate here it will synchronize on > > + * against the special swap migration pte. > > + */ > > goto discard; > > } > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > * will take care of the rest. > > */ > > dec_mm_counter(mm, mm_counter(page)); > > + /* We have to invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, address, > > + address + PAGE_SIZE); > > } else if (IS_ENABLED(CONFIG_MIGRATION) && > > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > > swp_entry_t entry; > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > if (pte_soft_dirty(pteval)) > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > + /* > > + * No need to invalidate here it will synchronize on > > + * against the special swap migration pte. > > + */ > > } else if (PageAnon(page)) { > > swp_entry_t entry = { .val = page_private(subpage) }; > > pte_t swp_pte; > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > WARN_ON_ONCE(1); > > ret = false; > > /* We have to invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, address, > > + address + PAGE_SIZE); > > page_vma_mapped_walk_done(&pvmw); > > break; > > } > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > /* MADV_FREE page check */ > > if (!PageSwapBacked(page)) { > > if (!PageDirty(page)) { > > + /* Invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, > > + address, address + PAGE_SIZE); > > dec_mm_counter(mm, MM_ANONPAGES); > > goto discard; > > } > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > if (pte_soft_dirty(pteval)) > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > - } else > > + /* Invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, address, > > + address + PAGE_SIZE); > > + } else { > > + /* > > + * We should not need to notify here as we reach this > > + * case only from freeze_page() itself only call from > > + * split_huge_page_to_list() so everything below must > > + * be true: > > + * - page is not anonymous > > + * - page is locked > > + * > > + * So as it is a locked file back page thus it can not > > + * be remove from the page cache and replace by a new > > + * page before mmu_notifier_invalidate_range_end so no > > + * concurrent thread might update its page table to > > + * point at new page while a device still is using this > > + * page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > dec_mm_counter(mm, mm_counter_file(page)); > > + } > > discard: > > + /* > > + * No need to call mmu_notifier_invalidate_range() it has be > > + * done above for all cases requiring it to happen under page > > + * table lock before mmu_notifier_invalidate_range_end() > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > page_remove_rmap(subpage, PageHuge(page)); > > put_page(page); > > - mmu_notifier_invalidate_range(mm, address, > > - address + PAGE_SIZE); > > } > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > Looking at the patchset, I understand the efficiency, but I am concerned > with correctness. I am fine in holding this off from reaching Linus but only way to flush this issues out if any is to have this patch in linux-next or somewhere were they get a chance of being tested. Note that the second patch is always safe. I agree that this one might not be if hardware implementation is idiotic (well that would be my opinion and any opinion/point of view can be challenge :)) > > Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Balbir Singh Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Thu, 19 Oct 2017 21:53:11 +1100 Message-ID: References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20171019032811.GC5246@redhat.com> Sender: owner-linux-mm@kvack.org To: Jerome Glisse Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next List-Id: iommu@lists.linux-foundation.org On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: >> On Mon, 16 Oct 2017 23:10:02 -0400 >> jglisse@redhat.com wrote: >> >> > From: J=C3=A9r=C3=B4me Glisse >> > >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() as we a= re >> > + * downgrading page table protection not changing it to po= int >> > + * to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > if (pmdp) { >> > #ifdef CONFIG_FS_DAX_PMD >> > pmd_t pmd; >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct addre= ss_space *mapping, >> > pmd =3D pmd_wrprotect(pmd); >> > pmd =3D pmd_mkclean(pmd); >> > set_pmd_at(vma->vm_mm, address, pmdp, pmd); >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, e= nd); >> >> Could the secondary TLB still see the mapping as dirty and propagate the= dirty bit back? > > I am assuming hardware does sane thing of setting the dirty bit only > when walking the CPU page table when device does a write fault ie > once the device get a write TLB entry the dirty is set by the IOMMU > when walking the page table before returning the lookup result to the > device and that it won't be set again latter (ie propagated back > latter). > The other possibility is that the hardware things the page is writable and already marked dirty. It allows writes and does not set the dirty bit? > I should probably have spell that out and maybe some of the ATS/PASID > implementer did not do that. > >> >> > unlock_pmd: >> > spin_unlock(ptl); >> > #endif >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct addre= ss_space *mapping, >> > pte =3D pte_wrprotect(pte); >> > pte =3D pte_mkclean(pte); >> > set_pte_at(vma->vm_mm, address, ptep, pte); >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, e= nd); >> >> Ditto >> >> > unlock_pte: >> > pte_unmap_unlock(ptep, ptl); >> > } >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier= .h >> > index 6866e8126982..49c925c96b8a 100644 >> > --- a/include/linux/mmu_notifier.h >> > +++ b/include/linux/mmu_notifier.h >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { >> > * shared page-tables, it not necessary to implement the >> > * invalidate_range_start()/end() notifiers, as >> > * invalidate_range() alread catches the points in time when an >> > - * external TLB range needs to be flushed. >> > + * external TLB range needs to be flushed. For more in depth >> > + * discussion on this see Documentation/vm/mmu_notifier.txt >> > * >> > * The invalidate_range() function is called under the ptl >> > * spin-lock and not allowed to sleep. >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> > index c037d3d34950..ff5bc647b51d 100644 >> > --- a/mm/huge_memory.c >> > +++ b/mm/huge_memory.c >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct = vm_fault *vmf, pmd_t orig_pmd, >> > goto out_free_pages; >> > VM_BUG_ON_PAGE(!PageHead(page), page); >> > >> > + /* >> > + * Leave pmd empty until pte is filled note we must notify here as >> > + * concurrent CPU thread might write to new page before the call t= o >> > + * mmu_notifier_invalidate_range_end() happens which can lead to a >> > + * device seeing memory write in different order than CPU. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); >> > - /* leave pmd empty until pte is filled */ >> > >> > pgtable =3D pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); >> > pmd_populate(vma->vm_mm, &_pmd, pgtable); >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct v= m_area_struct *vma, >> > pmd_t _pmd; >> > int i; >> > >> > - /* leave pmd empty until pte is filled */ >> > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); >> > + /* >> > + * Leave pmd empty until pte is filled note that it is fine to del= ay >> > + * notification until mmu_notifier_invalidate_range_end() as we ar= e >> > + * replacing a zero pmd write protected page with a zero pte write >> > + * protected page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > + pmdp_huge_clear_flush(vma, haddr, pmd); >> >> Shouldn't the secondary TLB know if the page size changed? > > It should not matter, we are talking virtual to physical on behalf > of a device against a process address space. So the hardware should > not care about the page size. > Does that not indicate how much the device can access? Could it try to access more than what is mapped? > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > 4K pages is replace by something new then a device TLB shootdown will > happen before the new page is set. > > Only issue i can think of is if the IOMMU TLB (if there is one) or > the device TLB (you do expect that there is one) does not invalidate > TLB entry if the TLB shootdown is smaller than the TLB entry. That > would be idiotic but yes i know hardware bug. > > >> >> > >> > pgtable =3D pgtable_trans_huge_withdraw(mm, pmd); >> > pmd_populate(mm, &_pmd, pgtable); >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> > index 1768efa4c501..63a63f1b536c 100644 >> > --- a/mm/hugetlb.c >> > +++ b/mm/hugetlb.c >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *d= st, struct mm_struct *src, >> > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz= ); >> > } else { >> > if (cow) { >> > + /* >> > + * No need to notify as we are downgrading= page >> > + * table protection not changing it to poi= nt >> > + * to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > huge_ptep_set_wrprotect(src, addr, src_pte= ); >> >> OK.. so we could get write faults on write accesses from the device. >> >> > - mmu_notifier_invalidate_range(src, mmun_st= art, >> > - mmun_en= d); >> > } >> > entry =3D huge_ptep_get(src_pte); >> > ptepage =3D pte_page(entry); >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct = vm_area_struct *vma, >> > * and that page table be reused and filled with junk. >> > */ >> > flush_hugetlb_tlb_range(vma, start, end); >> > - mmu_notifier_invalidate_range(mm, start, end); >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() we are downgrad= ing >> > + * page table protection not changing it to point to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > i_mmap_unlock_write(vma->vm_file->f_mapping); >> > mmu_notifier_invalidate_range_end(mm, start, end); >> > >> > diff --git a/mm/ksm.c b/mm/ksm.c >> > index 6cb60f46cce5..be8f4576f842 100644 >> > --- a/mm/ksm.c >> > +++ b/mm/ksm.c >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_st= ruct *vma, struct page *page, >> > * So we clear the pte and flush the tlb before the check >> > * this assure us that no O_DIRECT can happen after the ch= eck >> > * or in the middle of the check. >> > + * >> > + * No need to notify as we are downgrading page table to r= ead >> > + * only not changing it to point to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > */ >> > - entry =3D ptep_clear_flush_notify(vma, pvmw.address, pvmw.= pte); >> > + entry =3D ptep_clear_flush(vma, pvmw.address, pvmw.pte); >> > /* >> > * Check that no O_DIRECT or similar I/O is in progress on= the >> > * page >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *= vma, struct page *page, >> > } >> > >> > flush_cache_page(vma, addr, pte_pfn(*ptep)); >> > - ptep_clear_flush_notify(vma, addr, ptep); >> > + /* >> > + * No need to notify as we are replacing a read only page with ano= ther >> > + * read only page with the same content. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > + ptep_clear_flush(vma, addr, ptep); >> > set_pte_at_notify(mm, addr, ptep, newpte); >> > >> > page_remove_rmap(page, false); >> > diff --git a/mm/rmap.c b/mm/rmap.c >> > index 061826278520..6b5a0f219ac0 100644 >> > --- a/mm/rmap.c >> > +++ b/mm/rmap.c >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, = struct vm_area_struct *vma, >> > #endif >> > } >> > >> > - if (ret) { >> > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, = cend); >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() as we a= re >> > + * downgrading page table protection not changing it to po= int >> > + * to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > + if (ret) >> > (*cleaned)++; >> > - } >> > } >> > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page,= struct vm_area_struct *vma, >> > if (pte_soft_dirty(pteval)) >> > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); >> > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); >> > + /* >> > + * No need to invalidate here it will synchronize = on >> > + * against the special swap migration pte. >> > + */ >> > goto discard; >> > } >> > >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, = struct vm_area_struct *vma, >> > * will take care of the rest. >> > */ >> > dec_mm_counter(mm, mm_counter(page)); >> > + /* We have to invalidate as we cleared the pte */ >> > + mmu_notifier_invalidate_range(mm, address, >> > + address + PAGE_SIZE)= ; >> > } else if (IS_ENABLED(CONFIG_MIGRATION) && >> > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))= ) { >> > swp_entry_t entry; >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page,= struct vm_area_struct *vma, >> > if (pte_soft_dirty(pteval)) >> > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); >> > set_pte_at(mm, address, pvmw.pte, swp_pte); >> > + /* >> > + * No need to invalidate here it will synchronize = on >> > + * against the special swap migration pte. >> > + */ >> > } else if (PageAnon(page)) { >> > swp_entry_t entry =3D { .val =3D page_private(subp= age) }; >> > pte_t swp_pte; >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, = struct vm_area_struct *vma, >> > WARN_ON_ONCE(1); >> > ret =3D false; >> > /* We have to invalidate as we cleared the= pte */ >> > + mmu_notifier_invalidate_range(mm, address, >> > + address + PAGE_SIZ= E); >> > page_vma_mapped_walk_done(&pvmw); >> > break; >> > } >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, = struct vm_area_struct *vma, >> > /* MADV_FREE page check */ >> > if (!PageSwapBacked(page)) { >> > if (!PageDirty(page)) { >> > + /* Invalidate as we cleared the pt= e */ >> > + mmu_notifier_invalidate_range(mm, >> > + address, address + PAGE_SI= ZE); >> > dec_mm_counter(mm, MM_ANONPAGES); >> > goto discard; >> > } >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page= , struct vm_area_struct *vma, >> > if (pte_soft_dirty(pteval)) >> > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); >> > set_pte_at(mm, address, pvmw.pte, swp_pte); >> > - } else >> > + /* Invalidate as we cleared the pte */ >> > + mmu_notifier_invalidate_range(mm, address, >> > + address + PAGE_SIZE)= ; >> > + } else { >> > + /* >> > + * We should not need to notify here as we reach t= his >> > + * case only from freeze_page() itself only call f= rom >> > + * split_huge_page_to_list() so everything below m= ust >> > + * be true: >> > + * - page is not anonymous >> > + * - page is locked >> > + * >> > + * So as it is a locked file back page thus it can= not >> > + * be remove from the page cache and replace by a = new >> > + * page before mmu_notifier_invalidate_range_end s= o no >> > + * concurrent thread might update its page table t= o >> > + * point at new page while a device still is using= this >> > + * page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > dec_mm_counter(mm, mm_counter_file(page)); >> > + } >> > discard: >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() it has = be >> > + * done above for all cases requiring it to happen under p= age >> > + * table lock before mmu_notifier_invalidate_range_end() >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > page_remove_rmap(subpage, PageHuge(page)); >> > put_page(page); >> > - mmu_notifier_invalidate_range(mm, address, >> > - address + PAGE_SIZE); >> > } >> > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); >> >> Looking at the patchset, I understand the efficiency, but I am concerned >> with correctness. > > I am fine in holding this off from reaching Linus but only way to flush t= his > issues out if any is to have this patch in linux-next or somewhere were t= hey > get a chance of being tested. > Yep, I would like to see some additional testing around npu and get Alistai= r Popple to comment as well > Note that the second patch is always safe. I agree that this one might > not be if hardware implementation is idiotic (well that would be my > opinion and any opinion/point of view can be challenge :)) You mean the only_end variant that avoids shootdown after pmd/pte changes that avoid the _start/_end and have just the only_end variant? That seemed reasonable to me, but I've not tested it or evaluated it in depth Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Glisse Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Thu, 19 Oct 2017 12:58:23 -0400 Message-ID: <20171019165823.GA3044@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Balbir Singh Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next List-Id: iommu@lists.linux-foundation.org On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > >> On Mon, 16 Oct 2017 23:10:02 -0400 > >> jglisse@redhat.com wrote: > >> > >> > From: Jérôme Glisse > >> > > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() as we are > >> > + * downgrading page table protection not changing it to point > >> > + * to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > if (pmdp) { > >> > #ifdef CONFIG_FS_DAX_PMD > >> > pmd_t pmd; > >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > >> > pmd = pmd_wrprotect(pmd); > >> > pmd = pmd_mkclean(pmd); > >> > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > >> > >> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? > > > > I am assuming hardware does sane thing of setting the dirty bit only > > when walking the CPU page table when device does a write fault ie > > once the device get a write TLB entry the dirty is set by the IOMMU > > when walking the page table before returning the lookup result to the > > device and that it won't be set again latter (ie propagated back > > latter). > > > > The other possibility is that the hardware things the page is writable > and already > marked dirty. It allows writes and does not set the dirty bit? I thought about this some more and the patch can not regress anything that is not broken today. So if we assume that device can propagate dirty bit because it can cache the write protection than all current code is broken for two reasons: First one is current code clear pte entry, build a new pte value with write protection and update pte entry with new pte value. So any PASID/ ATS platform that allows device to cache the write bit and set dirty bit anytime after that can race during that window and you would loose the dirty bit of the device. That is not that bad as you are gonna propagate the dirty bit to the struct page. Second one is if the dirty bit is propagated back to the new write protected pte. Quick look at code it seems that when we zap pte or or mkclean we don't check that the pte has write permission but only care about the dirty bit. So it should not have any bad consequence. After this patch only the second window is bigger and thus more likely to happen. But nothing sinister should happen from that. > > > I should probably have spell that out and maybe some of the ATS/PASID > > implementer did not do that. > > > >> > >> > unlock_pmd: > >> > spin_unlock(ptl); > >> > #endif > >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > >> > pte = pte_wrprotect(pte); > >> > pte = pte_mkclean(pte); > >> > set_pte_at(vma->vm_mm, address, ptep, pte); > >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > >> > >> Ditto > >> > >> > unlock_pte: > >> > pte_unmap_unlock(ptep, ptl); > >> > } > >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > >> > index 6866e8126982..49c925c96b8a 100644 > >> > --- a/include/linux/mmu_notifier.h > >> > +++ b/include/linux/mmu_notifier.h > >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > >> > * shared page-tables, it not necessary to implement the > >> > * invalidate_range_start()/end() notifiers, as > >> > * invalidate_range() alread catches the points in time when an > >> > - * external TLB range needs to be flushed. > >> > + * external TLB range needs to be flushed. For more in depth > >> > + * discussion on this see Documentation/vm/mmu_notifier.txt > >> > * > >> > * The invalidate_range() function is called under the ptl > >> > * spin-lock and not allowed to sleep. > >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > >> > index c037d3d34950..ff5bc647b51d 100644 > >> > --- a/mm/huge_memory.c > >> > +++ b/mm/huge_memory.c > >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > >> > goto out_free_pages; > >> > VM_BUG_ON_PAGE(!PageHead(page), page); > >> > > >> > + /* > >> > + * Leave pmd empty until pte is filled note we must notify here as > >> > + * concurrent CPU thread might write to new page before the call to > >> > + * mmu_notifier_invalidate_range_end() happens which can lead to a > >> > + * device seeing memory write in different order than CPU. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > >> > - /* leave pmd empty until pte is filled */ > >> > > >> > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > >> > pmd_populate(vma->vm_mm, &_pmd, pgtable); > >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > >> > pmd_t _pmd; > >> > int i; > >> > > >> > - /* leave pmd empty until pte is filled */ > >> > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > >> > + /* > >> > + * Leave pmd empty until pte is filled note that it is fine to delay > >> > + * notification until mmu_notifier_invalidate_range_end() as we are > >> > + * replacing a zero pmd write protected page with a zero pte write > >> > + * protected page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > + pmdp_huge_clear_flush(vma, haddr, pmd); > >> > >> Shouldn't the secondary TLB know if the page size changed? > > > > It should not matter, we are talking virtual to physical on behalf > > of a device against a process address space. So the hardware should > > not care about the page size. > > > > Does that not indicate how much the device can access? Could it try > to access more than what is mapped? Assuming device has huge TLB and 2MB huge page with 4K small page. You are going from one 1 TLB covering a 2MB zero page to 512 TLB each covering 4K. Both case is read only and both case are pointing to same data (ie zero). It is fine to delay the TLB invalidate on the device to the call of mmu_notifier_invalidate_range_end(). The device will keep using the huge TLB for a little longer but both CPU and device are looking at same data. Now if there is a racing thread that replace one of the 512 zeor page after the split but before mmu_notifier_invalidate_range_end() that code path would call mmu_notifier_invalidate_range() before changing the pte to point to something else. Which should shoot down the device TLB (it would be a serious device bug if this did not work). > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > > 4K pages is replace by something new then a device TLB shootdown will > > happen before the new page is set. > > > > Only issue i can think of is if the IOMMU TLB (if there is one) or > > the device TLB (you do expect that there is one) does not invalidate > > TLB entry if the TLB shootdown is smaller than the TLB entry. That > > would be idiotic but yes i know hardware bug. > > > > > >> > >> > > >> > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > >> > pmd_populate(mm, &_pmd, pgtable); > >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> > index 1768efa4c501..63a63f1b536c 100644 > >> > --- a/mm/hugetlb.c > >> > +++ b/mm/hugetlb.c > >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > >> > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > >> > } else { > >> > if (cow) { > >> > + /* > >> > + * No need to notify as we are downgrading page > >> > + * table protection not changing it to point > >> > + * to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > huge_ptep_set_wrprotect(src, addr, src_pte); > >> > >> OK.. so we could get write faults on write accesses from the device. > >> > >> > - mmu_notifier_invalidate_range(src, mmun_start, > >> > - mmun_end); > >> > } > >> > entry = huge_ptep_get(src_pte); > >> > ptepage = pte_page(entry); > >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > >> > * and that page table be reused and filled with junk. > >> > */ > >> > flush_hugetlb_tlb_range(vma, start, end); > >> > - mmu_notifier_invalidate_range(mm, start, end); > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() we are downgrading > >> > + * page table protection not changing it to point to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > i_mmap_unlock_write(vma->vm_file->f_mapping); > >> > mmu_notifier_invalidate_range_end(mm, start, end); > >> > > >> > diff --git a/mm/ksm.c b/mm/ksm.c > >> > index 6cb60f46cce5..be8f4576f842 100644 > >> > --- a/mm/ksm.c > >> > +++ b/mm/ksm.c > >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > >> > * So we clear the pte and flush the tlb before the check > >> > * this assure us that no O_DIRECT can happen after the check > >> > * or in the middle of the check. > >> > + * > >> > + * No need to notify as we are downgrading page table to read > >> > + * only not changing it to point to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > */ > >> > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > >> > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > >> > /* > >> > * Check that no O_DIRECT or similar I/O is in progress on the > >> > * page > >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > >> > } > >> > > >> > flush_cache_page(vma, addr, pte_pfn(*ptep)); > >> > - ptep_clear_flush_notify(vma, addr, ptep); > >> > + /* > >> > + * No need to notify as we are replacing a read only page with another > >> > + * read only page with the same content. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > + ptep_clear_flush(vma, addr, ptep); > >> > set_pte_at_notify(mm, addr, ptep, newpte); > >> > > >> > page_remove_rmap(page, false); > >> > diff --git a/mm/rmap.c b/mm/rmap.c > >> > index 061826278520..6b5a0f219ac0 100644 > >> > --- a/mm/rmap.c > >> > +++ b/mm/rmap.c > >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > >> > #endif > >> > } > >> > > >> > - if (ret) { > >> > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() as we are > >> > + * downgrading page table protection not changing it to point > >> > + * to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > + if (ret) > >> > (*cleaned)++; > >> > - } > >> > } > >> > > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > if (pte_soft_dirty(pteval)) > >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); > >> > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > >> > + /* > >> > + * No need to invalidate here it will synchronize on > >> > + * against the special swap migration pte. > >> > + */ > >> > goto discard; > >> > } > >> > > >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > * will take care of the rest. > >> > */ > >> > dec_mm_counter(mm, mm_counter(page)); > >> > + /* We have to invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, address, > >> > + address + PAGE_SIZE); > >> > } else if (IS_ENABLED(CONFIG_MIGRATION) && > >> > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > >> > swp_entry_t entry; > >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > if (pte_soft_dirty(pteval)) > >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); > >> > set_pte_at(mm, address, pvmw.pte, swp_pte); > >> > + /* > >> > + * No need to invalidate here it will synchronize on > >> > + * against the special swap migration pte. > >> > + */ > >> > } else if (PageAnon(page)) { > >> > swp_entry_t entry = { .val = page_private(subpage) }; > >> > pte_t swp_pte; > >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > WARN_ON_ONCE(1); > >> > ret = false; > >> > /* We have to invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, address, > >> > + address + PAGE_SIZE); > >> > page_vma_mapped_walk_done(&pvmw); > >> > break; > >> > } > >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > /* MADV_FREE page check */ > >> > if (!PageSwapBacked(page)) { > >> > if (!PageDirty(page)) { > >> > + /* Invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, > >> > + address, address + PAGE_SIZE); > >> > dec_mm_counter(mm, MM_ANONPAGES); > >> > goto discard; > >> > } > >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > if (pte_soft_dirty(pteval)) > >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); > >> > set_pte_at(mm, address, pvmw.pte, swp_pte); > >> > - } else > >> > + /* Invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, address, > >> > + address + PAGE_SIZE); > >> > + } else { > >> > + /* > >> > + * We should not need to notify here as we reach this > >> > + * case only from freeze_page() itself only call from > >> > + * split_huge_page_to_list() so everything below must > >> > + * be true: > >> > + * - page is not anonymous > >> > + * - page is locked > >> > + * > >> > + * So as it is a locked file back page thus it can not > >> > + * be remove from the page cache and replace by a new > >> > + * page before mmu_notifier_invalidate_range_end so no > >> > + * concurrent thread might update its page table to > >> > + * point at new page while a device still is using this > >> > + * page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > dec_mm_counter(mm, mm_counter_file(page)); > >> > + } > >> > discard: > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() it has be > >> > + * done above for all cases requiring it to happen under page > >> > + * table lock before mmu_notifier_invalidate_range_end() > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > page_remove_rmap(subpage, PageHuge(page)); > >> > put_page(page); > >> > - mmu_notifier_invalidate_range(mm, address, > >> > - address + PAGE_SIZE); > >> > } > >> > > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > >> > >> Looking at the patchset, I understand the efficiency, but I am concerned > >> with correctness. > > > > I am fine in holding this off from reaching Linus but only way to flush this > > issues out if any is to have this patch in linux-next or somewhere were they > > get a chance of being tested. > > > > Yep, I would like to see some additional testing around npu and get Alistair > Popple to comment as well I think this patch is fine. The only one race window that it might make bigger should have no bad consequences. > > > Note that the second patch is always safe. I agree that this one might > > not be if hardware implementation is idiotic (well that would be my > > opinion and any opinion/point of view can be challenge :)) > > > You mean the only_end variant that avoids shootdown after pmd/pte changes > that avoid the _start/_end and have just the only_end variant? That seemed > reasonable to me, but I've not tested it or evaluated it in depth Yes, patch 2/2 in this serie is definitly fine. It invalidate the device TLB right after clearing pte entry and avoid latter unecessary invalidation of same TLB. Jérôme -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Balbir Singh Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Sat, 21 Oct 2017 16:54:40 +1100 Message-ID: <1508565280.5662.6.camel@gmail.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> <20171019165823.GA3044@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Return-path: In-Reply-To: <20171019165823.GA3044-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: Jerome Glisse Cc: Andrea Arcangeli , Stephen Rothwell , Joerg Roedel , Benjamin Herrenschmidt , Andrew Donnellan , "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , "linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , linux-mm , iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, linux-next , Michael Ellerman , Alistair Popple , Andrew Morton , Linus Torvalds , David Woodhouse List-Id: iommu@lists.linux-foundation.org T24gVGh1LCAyMDE3LTEwLTE5IGF0IDEyOjU4IC0wNDAwLCBKZXJvbWUgR2xpc3NlIHdyb3RlOgo+ IE9uIFRodSwgT2N0IDE5LCAyMDE3IGF0IDA5OjUzOjExUE0gKzExMDAsIEJhbGJpciBTaW5naCB3 cm90ZToKPiA+IE9uIFRodSwgT2N0IDE5LCAyMDE3IGF0IDI6MjggUE0sIEplcm9tZSBHbGlzc2Ug PGpnbGlzc2VAcmVkaGF0LmNvbT4gd3JvdGU6Cj4gPiA+IE9uIFRodSwgT2N0IDE5LCAyMDE3IGF0 IDAyOjA0OjI2UE0gKzExMDAsIEJhbGJpciBTaW5naCB3cm90ZToKPiA+ID4gPiBPbiBNb24sIDE2 IE9jdCAyMDE3IDIzOjEwOjAyIC0wNDAwCj4gPiA+ID4gamdsaXNzZUByZWRoYXQuY29tIHdyb3Rl Ogo+ID4gPiA+IAo+ID4gPiA+ID4gRnJvbTogSsOpcsO0bWUgR2xpc3NlIDxqZ2xpc3NlQHJlZGhh dC5jb20+Cj4gPiA+ID4gPiAKPiA+ID4gPiA+ICsgICAgICAgICAgIC8qCj4gPiA+ID4gPiArICAg ICAgICAgICAgKiBObyBuZWVkIHRvIGNhbGwgbW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2Uo KSBhcyB3ZSBhcmUKPiA+ID4gPiA+ICsgICAgICAgICAgICAqIGRvd25ncmFkaW5nIHBhZ2UgdGFi bGUgcHJvdGVjdGlvbiBub3QgY2hhbmdpbmcgaXQgdG8gcG9pbnQKPiA+ID4gPiA+ICsgICAgICAg ICAgICAqIHRvIGEgbmV3IHBhZ2UuCj4gPiA+ID4gPiArICAgICAgICAgICAgKgo+ID4gPiA+ID4g KyAgICAgICAgICAgICogU2VlIERvY3VtZW50YXRpb24vdm0vbW11X25vdGlmaWVyLnR4dAo+ID4g PiA+ID4gKyAgICAgICAgICAgICovCj4gPiA+ID4gPiAgICAgICAgICAgICBpZiAocG1kcCkgewo+ ID4gPiA+ID4gICNpZmRlZiBDT05GSUdfRlNfREFYX1BNRAo+ID4gPiA+ID4gICAgICAgICAgICAg ICAgICAgICBwbWRfdCBwbWQ7Cj4gPiA+ID4gPiBAQCAtNjI4LDcgKzYzNSw2IEBAIHN0YXRpYyB2 b2lkIGRheF9tYXBwaW5nX2VudHJ5X21rY2xlYW4oc3RydWN0IGFkZHJlc3Nfc3BhY2UgKm1hcHBp bmcsCj4gPiA+ID4gPiAgICAgICAgICAgICAgICAgICAgIHBtZCA9IHBtZF93cnByb3RlY3QocG1k KTsKPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAgcG1kID0gcG1kX21rY2xlYW4ocG1kKTsK PiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAgc2V0X3BtZF9hdCh2bWEtPnZtX21tLCBhZGRy ZXNzLCBwbWRwLCBwbWQpOwo+ID4gPiA+ID4gLSAgICAgICAgICAgICAgICAgICBtbXVfbm90aWZp ZXJfaW52YWxpZGF0ZV9yYW5nZSh2bWEtPnZtX21tLCBzdGFydCwgZW5kKTsKPiA+ID4gPiAKPiA+ ID4gPiBDb3VsZCB0aGUgc2Vjb25kYXJ5IFRMQiBzdGlsbCBzZWUgdGhlIG1hcHBpbmcgYXMgZGly dHkgYW5kIHByb3BhZ2F0ZSB0aGUgZGlydHkgYml0IGJhY2s/Cj4gPiA+IAo+ID4gPiBJIGFtIGFz c3VtaW5nIGhhcmR3YXJlIGRvZXMgc2FuZSB0aGluZyBvZiBzZXR0aW5nIHRoZSBkaXJ0eSBiaXQg b25seQo+ID4gPiB3aGVuIHdhbGtpbmcgdGhlIENQVSBwYWdlIHRhYmxlIHdoZW4gZGV2aWNlIGRv ZXMgYSB3cml0ZSBmYXVsdCBpZQo+ID4gPiBvbmNlIHRoZSBkZXZpY2UgZ2V0IGEgd3JpdGUgVExC IGVudHJ5IHRoZSBkaXJ0eSBpcyBzZXQgYnkgdGhlIElPTU1VCj4gPiA+IHdoZW4gd2Fsa2luZyB0 aGUgcGFnZSB0YWJsZSBiZWZvcmUgcmV0dXJuaW5nIHRoZSBsb29rdXAgcmVzdWx0IHRvIHRoZQo+ ID4gPiBkZXZpY2UgYW5kIHRoYXQgaXQgd29uJ3QgYmUgc2V0IGFnYWluIGxhdHRlciAoaWUgcHJv cGFnYXRlZCBiYWNrCj4gPiA+IGxhdHRlcikuCj4gPiA+IAo+ID4gCj4gPiBUaGUgb3RoZXIgcG9z c2liaWxpdHkgaXMgdGhhdCB0aGUgaGFyZHdhcmUgdGhpbmdzIHRoZSBwYWdlIGlzIHdyaXRhYmxl Cj4gPiBhbmQgYWxyZWFkeQo+ID4gbWFya2VkIGRpcnR5LiBJdCBhbGxvd3Mgd3JpdGVzIGFuZCBk b2VzIG5vdCBzZXQgdGhlIGRpcnR5IGJpdD8KPiAKPiBJIHRob3VnaHQgYWJvdXQgdGhpcyBzb21l IG1vcmUgYW5kIHRoZSBwYXRjaCBjYW4gbm90IHJlZ3Jlc3MgYW55dGhpbmcKPiB0aGF0IGlzIG5v dCBicm9rZW4gdG9kYXkuIFNvIGlmIHdlIGFzc3VtZSB0aGF0IGRldmljZSBjYW4gcHJvcGFnYXRl Cj4gZGlydHkgYml0IGJlY2F1c2UgaXQgY2FuIGNhY2hlIHRoZSB3cml0ZSBwcm90ZWN0aW9uIHRo YW4gYWxsIGN1cnJlbnQKPiBjb2RlIGlzIGJyb2tlbiBmb3IgdHdvIHJlYXNvbnM6Cj4gCj4gRmly c3Qgb25lIGlzIGN1cnJlbnQgY29kZSBjbGVhciBwdGUgZW50cnksIGJ1aWxkIGEgbmV3IHB0ZSB2 YWx1ZSB3aXRoCj4gd3JpdGUgcHJvdGVjdGlvbiBhbmQgdXBkYXRlIHB0ZSBlbnRyeSB3aXRoIG5l dyBwdGUgdmFsdWUuIFNvIGFueSBQQVNJRC8KPiBBVFMgcGxhdGZvcm0gdGhhdCBhbGxvd3MgZGV2 aWNlIHRvIGNhY2hlIHRoZSB3cml0ZSBiaXQgYW5kIHNldCBkaXJ0eQo+IGJpdCBhbnl0aW1lIGFm dGVyIHRoYXQgY2FuIHJhY2UgZHVyaW5nIHRoYXQgd2luZG93IGFuZCB5b3Ugd291bGQgbG9vc2UK PiB0aGUgZGlydHkgYml0IG9mIHRoZSBkZXZpY2UuIFRoYXQgaXMgbm90IHRoYXQgYmFkIGFzIHlv dSBhcmUgZ29ubmEKPiBwcm9wYWdhdGUgdGhlIGRpcnR5IGJpdCB0byB0aGUgc3RydWN0IHBhZ2Uu CgpCdXQgdGhleSBzdGF5IGNvbnNpc3RlbnQgd2l0aCB0aGUgbm90aWZpZXJzLCBzbyBmcm9tIHRo ZSBPUyBwZXJzcGVjdGl2ZQppdCBub3RpZmllcyBvZiBhbnkgUFRFIGNoYW5nZXMgYXMgdGhleSBo YXBwZW4uIFdoZW4gdGhlIEFUUyBwbGF0Zm9ybSBzZWVzCmludmFsaWRhdGlvbiwgaXQgaW52YWxp ZGF0ZXMgaXQncyBQVEUncyBhcyB3ZWxsLgoKSSB3YXMgc3BlYWtpbmcgb2YgdGhlIGNhc2Ugd2hl cmUgdGhlIEFUUyBwbGF0Zm9ybSBjb3VsZCBhc3N1bWUgaXQgaGFzCndyaXRlIGFjY2VzcyBhbmQg aGFzIG5vdCBzZWVuIGFueSBpbnZhbGlkYXRpb24sIHRoZSBPUyBjb3VsZCByZXR1cm4KYmFjayB0 byB1c2VyIHNwYWNlIG9yIHRoZSBjYWxsZXIgd2l0aCB3cml0ZSBiaXQgY2xlYXIsIGJ1dCB0aGUg QVRTCnBsYXRmb3JtIGNvdWxkIHN0aWxsIGRvIGEgd3JpdGUgc2luY2UgaXQncyBub3Qgc2VlbiB0 aGUgaW52YWxpZGF0aW9uLgoKPiAKPiBTZWNvbmQgb25lIGlzIGlmIHRoZSBkaXJ0eSBiaXQgaXMg cHJvcGFnYXRlZCBiYWNrIHRvIHRoZSBuZXcgd3JpdGUKPiBwcm90ZWN0ZWQgcHRlLiBRdWljayBs b29rIGF0IGNvZGUgaXQgc2VlbXMgdGhhdCB3aGVuIHdlIHphcCBwdGUgb3IKPiBvciBta2NsZWFu IHdlIGRvbid0IGNoZWNrIHRoYXQgdGhlIHB0ZSBoYXMgd3JpdGUgcGVybWlzc2lvbiBidXQgb25s eQo+IGNhcmUgYWJvdXQgdGhlIGRpcnR5IGJpdC4gU28gaXQgc2hvdWxkIG5vdCBoYXZlIGFueSBi YWQgY29uc2VxdWVuY2UuCj4gCj4gQWZ0ZXIgdGhpcyBwYXRjaCBvbmx5IHRoZSBzZWNvbmQgd2lu ZG93IGlzIGJpZ2dlciBhbmQgdGh1cyBtb3JlIGxpa2VseQo+IHRvIGhhcHBlbi4gQnV0IG5vdGhp bmcgc2luaXN0ZXIgc2hvdWxkIGhhcHBlbiBmcm9tIHRoYXQuCj4gCj4gCj4gPiAKPiA+ID4gSSBz aG91bGQgcHJvYmFibHkgaGF2ZSBzcGVsbCB0aGF0IG91dCBhbmQgbWF5YmUgc29tZSBvZiB0aGUg QVRTL1BBU0lECj4gPiA+IGltcGxlbWVudGVyIGRpZCBub3QgZG8gdGhhdC4KPiA+ID4gCj4gPiA+ ID4gCj4gPiA+ID4gPiAgdW5sb2NrX3BtZDoKPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAg c3Bpbl91bmxvY2socHRsKTsKPiA+ID4gPiA+ICAjZW5kaWYKPiA+ID4gPiA+IEBAIC02NDMsNyAr NjQ5LDYgQEAgc3RhdGljIHZvaWQgZGF4X21hcHBpbmdfZW50cnlfbWtjbGVhbihzdHJ1Y3QgYWRk cmVzc19zcGFjZSAqbWFwcGluZywKPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAgcHRlID0g cHRlX3dycHJvdGVjdChwdGUpOwo+ID4gPiA+ID4gICAgICAgICAgICAgICAgICAgICBwdGUgPSBw dGVfbWtjbGVhbihwdGUpOwo+ID4gPiA+ID4gICAgICAgICAgICAgICAgICAgICBzZXRfcHRlX2F0 KHZtYS0+dm1fbW0sIGFkZHJlc3MsIHB0ZXAsIHB0ZSk7Cj4gPiA+ID4gPiAtICAgICAgICAgICAg ICAgICAgIG1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3JhbmdlKHZtYS0+dm1fbW0sIHN0YXJ0LCBl bmQpOwo+ID4gPiA+IAo+ID4gPiA+IERpdHRvCj4gPiA+ID4gCj4gPiA+ID4gPiAgdW5sb2NrX3B0 ZToKPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAgcHRlX3VubWFwX3VubG9jayhwdGVwLCBw dGwpOwo+ID4gPiA+ID4gICAgICAgICAgICAgfQo+ID4gPiA+ID4gZGlmZiAtLWdpdCBhL2luY2x1 ZGUvbGludXgvbW11X25vdGlmaWVyLmggYi9pbmNsdWRlL2xpbnV4L21tdV9ub3RpZmllci5oCj4g PiA+ID4gPiBpbmRleCA2ODY2ZTgxMjY5ODIuLjQ5YzkyNWM5NmI4YSAxMDA2NDQKPiA+ID4gPiA+ IC0tLSBhL2luY2x1ZGUvbGludXgvbW11X25vdGlmaWVyLmgKPiA+ID4gPiA+ICsrKyBiL2luY2x1 ZGUvbGludXgvbW11X25vdGlmaWVyLmgKPiA+ID4gPiA+IEBAIC0xNTUsNyArMTU1LDggQEAgc3Ry dWN0IG1tdV9ub3RpZmllcl9vcHMgewo+ID4gPiA+ID4gICAgICAqIHNoYXJlZCBwYWdlLXRhYmxl cywgaXQgbm90IG5lY2Vzc2FyeSB0byBpbXBsZW1lbnQgdGhlCj4gPiA+ID4gPiAgICAgICogaW52 YWxpZGF0ZV9yYW5nZV9zdGFydCgpL2VuZCgpIG5vdGlmaWVycywgYXMKPiA+ID4gPiA+ICAgICAg KiBpbnZhbGlkYXRlX3JhbmdlKCkgYWxyZWFkIGNhdGNoZXMgdGhlIHBvaW50cyBpbiB0aW1lIHdo ZW4gYW4KPiA+ID4gPiA+IC0gICAgKiBleHRlcm5hbCBUTEIgcmFuZ2UgbmVlZHMgdG8gYmUgZmx1 c2hlZC4KPiA+ID4gPiA+ICsgICAgKiBleHRlcm5hbCBUTEIgcmFuZ2UgbmVlZHMgdG8gYmUgZmx1 c2hlZC4gRm9yIG1vcmUgaW4gZGVwdGgKPiA+ID4gPiA+ICsgICAgKiBkaXNjdXNzaW9uIG9uIHRo aXMgc2VlIERvY3VtZW50YXRpb24vdm0vbW11X25vdGlmaWVyLnR4dAo+ID4gPiA+ID4gICAgICAq Cj4gPiA+ID4gPiAgICAgICogVGhlIGludmFsaWRhdGVfcmFuZ2UoKSBmdW5jdGlvbiBpcyBjYWxs ZWQgdW5kZXIgdGhlIHB0bAo+ID4gPiA+ID4gICAgICAqIHNwaW4tbG9jayBhbmQgbm90IGFsbG93 ZWQgdG8gc2xlZXAuCj4gPiA+ID4gPiBkaWZmIC0tZ2l0IGEvbW0vaHVnZV9tZW1vcnkuYyBiL21t L2h1Z2VfbWVtb3J5LmMKPiA+ID4gPiA+IGluZGV4IGMwMzdkM2QzNDk1MC4uZmY1YmM2NDdiNTFk IDEwMDY0NAo+ID4gPiA+ID4gLS0tIGEvbW0vaHVnZV9tZW1vcnkuYwo+ID4gPiA+ID4gKysrIGIv bW0vaHVnZV9tZW1vcnkuYwo+ID4gPiA+ID4gQEAgLTExODYsOCArMTE4NiwxNSBAQCBzdGF0aWMg aW50IGRvX2h1Z2VfcG1kX3dwX3BhZ2VfZmFsbGJhY2soc3RydWN0IHZtX2ZhdWx0ICp2bWYsIHBt ZF90IG9yaWdfcG1kLAo+ID4gPiA+ID4gICAgICAgICAgICAgZ290byBvdXRfZnJlZV9wYWdlczsK PiA+ID4gPiA+ICAgICBWTV9CVUdfT05fUEFHRSghUGFnZUhlYWQocGFnZSksIHBhZ2UpOwo+ID4g PiA+ID4gCj4gPiA+ID4gPiArICAgLyoKPiA+ID4gPiA+ICsgICAgKiBMZWF2ZSBwbWQgZW1wdHkg dW50aWwgcHRlIGlzIGZpbGxlZCBub3RlIHdlIG11c3Qgbm90aWZ5IGhlcmUgYXMKPiA+ID4gPiA+ ICsgICAgKiBjb25jdXJyZW50IENQVSB0aHJlYWQgbWlnaHQgd3JpdGUgdG8gbmV3IHBhZ2UgYmVm b3JlIHRoZSBjYWxsIHRvCj4gPiA+ID4gPiArICAgICogbW11X25vdGlmaWVyX2ludmFsaWRhdGVf cmFuZ2VfZW5kKCkgaGFwcGVucyB3aGljaCBjYW4gbGVhZCB0byBhCj4gPiA+ID4gPiArICAgICog ZGV2aWNlIHNlZWluZyBtZW1vcnkgd3JpdGUgaW4gZGlmZmVyZW50IG9yZGVyIHRoYW4gQ1BVLgo+ ID4gPiA+ID4gKyAgICAqCj4gPiA+ID4gPiArICAgICogU2VlIERvY3VtZW50YXRpb24vdm0vbW11 X25vdGlmaWVyLnR4dAo+ID4gPiA+ID4gKyAgICAqLwo+ID4gPiA+ID4gICAgIHBtZHBfaHVnZV9j bGVhcl9mbHVzaF9ub3RpZnkodm1hLCBoYWRkciwgdm1mLT5wbWQpOwo+ID4gPiA+ID4gLSAgIC8q IGxlYXZlIHBtZCBlbXB0eSB1bnRpbCBwdGUgaXMgZmlsbGVkICovCj4gPiA+ID4gPiAKPiA+ID4g PiA+ICAgICBwZ3RhYmxlID0gcGd0YWJsZV90cmFuc19odWdlX3dpdGhkcmF3KHZtYS0+dm1fbW0s IHZtZi0+cG1kKTsKPiA+ID4gPiA+ICAgICBwbWRfcG9wdWxhdGUodm1hLT52bV9tbSwgJl9wbWQs IHBndGFibGUpOwo+ID4gPiA+ID4gQEAgLTIwMjYsOCArMjAzMywxNSBAQCBzdGF0aWMgdm9pZCBf X3NwbGl0X2h1Z2VfemVyb19wYWdlX3BtZChzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSwKPiA+ ID4gPiA+ICAgICBwbWRfdCBfcG1kOwo+ID4gPiA+ID4gICAgIGludCBpOwo+ID4gPiA+ID4gCj4g PiA+ID4gPiAtICAgLyogbGVhdmUgcG1kIGVtcHR5IHVudGlsIHB0ZSBpcyBmaWxsZWQgKi8KPiA+ ID4gPiA+IC0gICBwbWRwX2h1Z2VfY2xlYXJfZmx1c2hfbm90aWZ5KHZtYSwgaGFkZHIsIHBtZCk7 Cj4gPiA+ID4gPiArICAgLyoKPiA+ID4gPiA+ICsgICAgKiBMZWF2ZSBwbWQgZW1wdHkgdW50aWwg cHRlIGlzIGZpbGxlZCBub3RlIHRoYXQgaXQgaXMgZmluZSB0byBkZWxheQo+ID4gPiA+ID4gKyAg ICAqIG5vdGlmaWNhdGlvbiB1bnRpbCBtbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZV9lbmQo KSBhcyB3ZSBhcmUKPiA+ID4gPiA+ICsgICAgKiByZXBsYWNpbmcgYSB6ZXJvIHBtZCB3cml0ZSBw cm90ZWN0ZWQgcGFnZSB3aXRoIGEgemVybyBwdGUgd3JpdGUKPiA+ID4gPiA+ICsgICAgKiBwcm90 ZWN0ZWQgcGFnZS4KPiA+ID4gPiA+ICsgICAgKgo+ID4gPiA+ID4gKyAgICAqIFNlZSBEb2N1bWVu dGF0aW9uL3ZtL21tdV9ub3RpZmllci50eHQKPiA+ID4gPiA+ICsgICAgKi8KPiA+ID4gPiA+ICsg ICBwbWRwX2h1Z2VfY2xlYXJfZmx1c2godm1hLCBoYWRkciwgcG1kKTsKPiA+ID4gPiAKPiA+ID4g PiBTaG91bGRuJ3QgdGhlIHNlY29uZGFyeSBUTEIga25vdyBpZiB0aGUgcGFnZSBzaXplIGNoYW5n ZWQ/Cj4gPiA+IAo+ID4gPiBJdCBzaG91bGQgbm90IG1hdHRlciwgd2UgYXJlIHRhbGtpbmcgdmly dHVhbCB0byBwaHlzaWNhbCBvbiBiZWhhbGYKPiA+ID4gb2YgYSBkZXZpY2UgYWdhaW5zdCBhIHBy b2Nlc3MgYWRkcmVzcyBzcGFjZS4gU28gdGhlIGhhcmR3YXJlIHNob3VsZAo+ID4gPiBub3QgY2Fy ZSBhYm91dCB0aGUgcGFnZSBzaXplLgo+ID4gPiAKPiA+IAo+ID4gRG9lcyB0aGF0IG5vdCBpbmRp Y2F0ZSBob3cgbXVjaCB0aGUgZGV2aWNlIGNhbiBhY2Nlc3M/IENvdWxkIGl0IHRyeQo+ID4gdG8g YWNjZXNzIG1vcmUgdGhhbiB3aGF0IGlzIG1hcHBlZD8KPiAKPiBBc3N1bWluZyBkZXZpY2UgaGFz IGh1Z2UgVExCIGFuZCAyTUIgaHVnZSBwYWdlIHdpdGggNEsgc21hbGwgcGFnZS4KPiBZb3UgYXJl IGdvaW5nIGZyb20gb25lIDEgVExCIGNvdmVyaW5nIGEgMk1CIHplcm8gcGFnZSB0byA1MTIgVExC Cj4gZWFjaCBjb3ZlcmluZyA0Sy4gQm90aCBjYXNlIGlzIHJlYWQgb25seSBhbmQgYm90aCBjYXNl IGFyZSBwb2ludGluZwo+IHRvIHNhbWUgZGF0YSAoaWUgemVybykuCj4gCj4gSXQgaXMgZmluZSB0 byBkZWxheSB0aGUgVExCIGludmFsaWRhdGUgb24gdGhlIGRldmljZSB0byB0aGUgY2FsbCBvZgo+ IG1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3JhbmdlX2VuZCgpLiBUaGUgZGV2aWNlIHdpbGwga2Vl cCB1c2luZyB0aGUKPiBodWdlIFRMQiBmb3IgYSBsaXR0bGUgbG9uZ2VyIGJ1dCBib3RoIENQVSBh bmQgZGV2aWNlIGFyZSBsb29raW5nIGF0Cj4gc2FtZSBkYXRhLgo+IAo+IE5vdyBpZiB0aGVyZSBp cyBhIHJhY2luZyB0aHJlYWQgdGhhdCByZXBsYWNlIG9uZSBvZiB0aGUgNTEyIHplb3IgcGFnZQo+ IGFmdGVyIHRoZSBzcGxpdCBidXQgYmVmb3JlIG1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3Jhbmdl X2VuZCgpIHRoYXQKPiBjb2RlIHBhdGggd291bGQgY2FsbCBtbXVfbm90aWZpZXJfaW52YWxpZGF0 ZV9yYW5nZSgpIGJlZm9yZSBjaGFuZ2luZwo+IHRoZSBwdGUgdG8gcG9pbnQgdG8gc29tZXRoaW5n IGVsc2UuIFdoaWNoIHNob3VsZCBzaG9vdCBkb3duIHRoZSBkZXZpY2UKPiBUTEIgKGl0IHdvdWxk IGJlIGEgc2VyaW91cyBkZXZpY2UgYnVnIGlmIHRoaXMgZGlkIG5vdCB3b3JrKS4KCk9LLi4gVGhp cyBzZWVtcyByZWFzb25hYmxlLCBidXQgSSdkIHJlYWxseSBsaWtlIHRvIHNlZSBpZiBpdCBjYW4g YmUKdGVzdGVkCgo+IAo+IAo+ID4gCj4gPiA+IE1vcmVvdmVyIGlmIGFueSBvZiB0aGUgbmV3IDUx MiAoYXNzdW1pbmcgMk1CIGh1Z2UgYW5kIDRLIHBhZ2VzKSB6ZXJvCj4gPiA+IDRLIHBhZ2VzIGlz IHJlcGxhY2UgYnkgc29tZXRoaW5nIG5ldyB0aGVuIGEgZGV2aWNlIFRMQiBzaG9vdGRvd24gd2ls bAo+ID4gPiBoYXBwZW4gYmVmb3JlIHRoZSBuZXcgcGFnZSBpcyBzZXQuCj4gPiA+IAo+ID4gPiBP bmx5IGlzc3VlIGkgY2FuIHRoaW5rIG9mIGlzIGlmIHRoZSBJT01NVSBUTEIgKGlmIHRoZXJlIGlz IG9uZSkgb3IKPiA+ID4gdGhlIGRldmljZSBUTEIgKHlvdSBkbyBleHBlY3QgdGhhdCB0aGVyZSBp cyBvbmUpIGRvZXMgbm90IGludmFsaWRhdGUKPiA+ID4gVExCIGVudHJ5IGlmIHRoZSBUTEIgc2hv b3Rkb3duIGlzIHNtYWxsZXIgdGhhbiB0aGUgVExCIGVudHJ5LiBUaGF0Cj4gPiA+IHdvdWxkIGJl IGlkaW90aWMgYnV0IHllcyBpIGtub3cgaGFyZHdhcmUgYnVnLgo+ID4gPiAKPiA+ID4gCj4gPiA+ ID4gCj4gPiA+ID4gPiAKPiA+ID4gPiA+ICAgICBwZ3RhYmxlID0gcGd0YWJsZV90cmFuc19odWdl X3dpdGhkcmF3KG1tLCBwbWQpOwo+ID4gPiA+ID4gICAgIHBtZF9wb3B1bGF0ZShtbSwgJl9wbWQs IHBndGFibGUpOwo+ID4gPiA+ID4gZGlmZiAtLWdpdCBhL21tL2h1Z2V0bGIuYyBiL21tL2h1Z2V0 bGIuYwo+ID4gPiA+ID4gaW5kZXggMTc2OGVmYTRjNTAxLi42M2E2M2YxYjUzNmMgMTAwNjQ0Cj4g PiA+ID4gPiAtLS0gYS9tbS9odWdldGxiLmMKPiA+ID4gPiA+ICsrKyBiL21tL2h1Z2V0bGIuYwo+ ID4gPiA+ID4gQEAgLTMyNTQsOSArMzI1NCwxNCBAQCBpbnQgY29weV9odWdldGxiX3BhZ2VfcmFu Z2Uoc3RydWN0IG1tX3N0cnVjdCAqZHN0LCBzdHJ1Y3QgbW1fc3RydWN0ICpzcmMsCj4gPiA+ID4g PiAgICAgICAgICAgICAgICAgICAgIHNldF9odWdlX3N3YXBfcHRlX2F0KGRzdCwgYWRkciwgZHN0 X3B0ZSwgZW50cnksIHN6KTsKPiA+ID4gPiA+ICAgICAgICAgICAgIH0gZWxzZSB7Cj4gPiA+ID4g PiAgICAgICAgICAgICAgICAgICAgIGlmIChjb3cpIHsKPiA+ID4gPiA+ICsgICAgICAgICAgICAg ICAgICAgICAgICAgICAvKgo+ID4gPiA+ID4gKyAgICAgICAgICAgICAgICAgICAgICAgICAgICAq IE5vIG5lZWQgdG8gbm90aWZ5IGFzIHdlIGFyZSBkb3duZ3JhZGluZyBwYWdlCj4gPiA+ID4gPiAr ICAgICAgICAgICAgICAgICAgICAgICAgICAgICogdGFibGUgcHJvdGVjdGlvbiBub3QgY2hhbmdp bmcgaXQgdG8gcG9pbnQKPiA+ID4gPiA+ICsgICAgICAgICAgICAgICAgICAgICAgICAgICAgKiB0 byBhIG5ldyBwYWdlLgo+ID4gPiA+ID4gKyAgICAgICAgICAgICAgICAgICAgICAgICAgICAqCj4g PiA+ID4gPiArICAgICAgICAgICAgICAgICAgICAgICAgICAgICogU2VlIERvY3VtZW50YXRpb24v dm0vbW11X25vdGlmaWVyLnR4dAo+ID4gPiA+ID4gKyAgICAgICAgICAgICAgICAgICAgICAgICAg ICAqLwo+ID4gPiA+ID4gICAgICAgICAgICAgICAgICAgICAgICAgICAgIGh1Z2VfcHRlcF9zZXRf d3Jwcm90ZWN0KHNyYywgYWRkciwgc3JjX3B0ZSk7Cj4gPiA+ID4gCj4gPiA+ID4gT0suLiBzbyB3 ZSBjb3VsZCBnZXQgd3JpdGUgZmF1bHRzIG9uIHdyaXRlIGFjY2Vzc2VzIGZyb20gdGhlIGRldmlj ZS4KPiA+ID4gPiAKPiA+ID4gPiA+IC0gICAgICAgICAgICAgICAgICAgICAgICAgICBtbXVfbm90 aWZpZXJfaW52YWxpZGF0ZV9yYW5nZShzcmMsIG1tdW5fc3RhcnQsCj4gPiA+ID4gPiAtICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBt bXVuX2VuZCk7Cj4gPiA+ID4gPiAgICAgICAgICAgICAgICAgICAgIH0KPiA+ID4gPiA+ICAgICAg ICAgICAgICAgICAgICAgZW50cnkgPSBodWdlX3B0ZXBfZ2V0KHNyY19wdGUpOwo+ID4gPiA+ID4g ICAgICAgICAgICAgICAgICAgICBwdGVwYWdlID0gcHRlX3BhZ2UoZW50cnkpOwo+ID4gPiA+ID4g QEAgLTQyODgsNyArNDI5MywxMiBAQCB1bnNpZ25lZCBsb25nIGh1Z2V0bGJfY2hhbmdlX3Byb3Rl Y3Rpb24oc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEsCj4gPiA+ID4gPiAgICAgICogYW5kIHRo YXQgcGFnZSB0YWJsZSBiZSByZXVzZWQgYW5kIGZpbGxlZCB3aXRoIGp1bmsuCj4gPiA+ID4gPiAg ICAgICovCj4gPiA+ID4gPiAgICAgZmx1c2hfaHVnZXRsYl90bGJfcmFuZ2Uodm1hLCBzdGFydCwg ZW5kKTsKPiA+ID4gPiA+IC0gICBtbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZShtbSwgc3Rh cnQsIGVuZCk7Cj4gPiA+ID4gPiArICAgLyoKPiA+ID4gPiA+ICsgICAgKiBObyBuZWVkIHRvIGNh bGwgbW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2UoKSB3ZSBhcmUgZG93bmdyYWRpbmcKPiA+ ID4gPiA+ICsgICAgKiBwYWdlIHRhYmxlIHByb3RlY3Rpb24gbm90IGNoYW5naW5nIGl0IHRvIHBv aW50IHRvIGEgbmV3IHBhZ2UuCj4gPiA+ID4gPiArICAgICoKPiA+ID4gPiA+ICsgICAgKiBTZWUg RG9jdW1lbnRhdGlvbi92bS9tbXVfbm90aWZpZXIudHh0Cj4gPiA+ID4gPiArICAgICovCj4gPiA+ ID4gPiAgICAgaV9tbWFwX3VubG9ja193cml0ZSh2bWEtPnZtX2ZpbGUtPmZfbWFwcGluZyk7Cj4g PiA+ID4gPiAgICAgbW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2VfZW5kKG1tLCBzdGFydCwg ZW5kKTsKPiA+ID4gPiA+IAo+ID4gPiA+ID4gZGlmZiAtLWdpdCBhL21tL2tzbS5jIGIvbW0va3Nt LmMKPiA+ID4gPiA+IGluZGV4IDZjYjYwZjQ2Y2NlNS4uYmU4ZjQ1NzZmODQyIDEwMDY0NAo+ID4g PiA+ID4gLS0tIGEvbW0va3NtLmMKPiA+ID4gPiA+ICsrKyBiL21tL2tzbS5jCj4gPiA+ID4gPiBA QCAtMTA1Miw4ICsxMDUyLDEzIEBAIHN0YXRpYyBpbnQgd3JpdGVfcHJvdGVjdF9wYWdlKHN0cnVj dCB2bV9hcmVhX3N0cnVjdCAqdm1hLCBzdHJ1Y3QgcGFnZSAqcGFnZSwKPiA+ID4gPiA+ICAgICAg ICAgICAgICAqIFNvIHdlIGNsZWFyIHRoZSBwdGUgYW5kIGZsdXNoIHRoZSB0bGIgYmVmb3JlIHRo ZSBjaGVjawo+ID4gPiA+ID4gICAgICAgICAgICAgICogdGhpcyBhc3N1cmUgdXMgdGhhdCBubyBP X0RJUkVDVCBjYW4gaGFwcGVuIGFmdGVyIHRoZSBjaGVjawo+ID4gPiA+ID4gICAgICAgICAgICAg ICogb3IgaW4gdGhlIG1pZGRsZSBvZiB0aGUgY2hlY2suCj4gPiA+ID4gPiArICAgICAgICAgICAg Kgo+ID4gPiA+ID4gKyAgICAgICAgICAgICogTm8gbmVlZCB0byBub3RpZnkgYXMgd2UgYXJlIGRv d25ncmFkaW5nIHBhZ2UgdGFibGUgdG8gcmVhZAo+ID4gPiA+ID4gKyAgICAgICAgICAgICogb25s eSBub3QgY2hhbmdpbmcgaXQgdG8gcG9pbnQgdG8gYSBuZXcgcGFnZS4KPiA+ID4gPiA+ICsgICAg ICAgICAgICAqCj4gPiA+ID4gPiArICAgICAgICAgICAgKiBTZWUgRG9jdW1lbnRhdGlvbi92bS9t bXVfbm90aWZpZXIudHh0Cj4gPiA+ID4gPiAgICAgICAgICAgICAgKi8KPiA+ID4gPiA+IC0gICAg ICAgICAgIGVudHJ5ID0gcHRlcF9jbGVhcl9mbHVzaF9ub3RpZnkodm1hLCBwdm13LmFkZHJlc3Ms IHB2bXcucHRlKTsKPiA+ID4gPiA+ICsgICAgICAgICAgIGVudHJ5ID0gcHRlcF9jbGVhcl9mbHVz aCh2bWEsIHB2bXcuYWRkcmVzcywgcHZtdy5wdGUpOwo+ID4gPiA+ID4gICAgICAgICAgICAgLyoK PiA+ID4gPiA+ICAgICAgICAgICAgICAqIENoZWNrIHRoYXQgbm8gT19ESVJFQ1Qgb3Igc2ltaWxh ciBJL08gaXMgaW4gcHJvZ3Jlc3Mgb24gdGhlCj4gPiA+ID4gPiAgICAgICAgICAgICAgKiBwYWdl Cj4gPiA+ID4gPiBAQCAtMTEzNiw3ICsxMTQxLDEzIEBAIHN0YXRpYyBpbnQgcmVwbGFjZV9wYWdl KHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hLCBzdHJ1Y3QgcGFnZSAqcGFnZSwKPiA+ID4gPiA+ ICAgICB9Cj4gPiA+ID4gPiAKPiA+ID4gPiA+ICAgICBmbHVzaF9jYWNoZV9wYWdlKHZtYSwgYWRk ciwgcHRlX3BmbigqcHRlcCkpOwo+ID4gPiA+ID4gLSAgIHB0ZXBfY2xlYXJfZmx1c2hfbm90aWZ5 KHZtYSwgYWRkciwgcHRlcCk7Cj4gPiA+ID4gPiArICAgLyoKPiA+ID4gPiA+ICsgICAgKiBObyBu ZWVkIHRvIG5vdGlmeSBhcyB3ZSBhcmUgcmVwbGFjaW5nIGEgcmVhZCBvbmx5IHBhZ2Ugd2l0aCBh bm90aGVyCj4gPiA+ID4gPiArICAgICogcmVhZCBvbmx5IHBhZ2Ugd2l0aCB0aGUgc2FtZSBjb250 ZW50Lgo+ID4gPiA+ID4gKyAgICAqCj4gPiA+ID4gPiArICAgICogU2VlIERvY3VtZW50YXRpb24v dm0vbW11X25vdGlmaWVyLnR4dAo+ID4gPiA+ID4gKyAgICAqLwo+ID4gPiA+ID4gKyAgIHB0ZXBf Y2xlYXJfZmx1c2godm1hLCBhZGRyLCBwdGVwKTsKPiA+ID4gPiA+ICAgICBzZXRfcHRlX2F0X25v dGlmeShtbSwgYWRkciwgcHRlcCwgbmV3cHRlKTsKPiA+ID4gPiA+IAo+ID4gPiA+ID4gICAgIHBh Z2VfcmVtb3ZlX3JtYXAocGFnZSwgZmFsc2UpOwo+ID4gPiA+ID4gZGlmZiAtLWdpdCBhL21tL3Jt YXAuYyBiL21tL3JtYXAuYwo+ID4gPiA+ID4gaW5kZXggMDYxODI2Mjc4NTIwLi42YjVhMGYyMTlh YzAgMTAwNjQ0Cj4gPiA+ID4gPiAtLS0gYS9tbS9ybWFwLmMKPiA+ID4gPiA+ICsrKyBiL21tL3Jt YXAuYwo+ID4gPiA+ID4gQEAgLTkzNywxMCArOTM3LDE1IEBAIHN0YXRpYyBib29sIHBhZ2VfbWtj bGVhbl9vbmUoc3RydWN0IHBhZ2UgKnBhZ2UsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hLAo+ ID4gPiA+ID4gICNlbmRpZgo+ID4gPiA+ID4gICAgICAgICAgICAgfQo+ID4gPiA+ID4gCj4gPiA+ ID4gPiAtICAgICAgICAgICBpZiAocmV0KSB7Cj4gPiA+ID4gPiAtICAgICAgICAgICAgICAgICAg IG1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3JhbmdlKHZtYS0+dm1fbW0sIGNzdGFydCwgY2VuZCk7 Cj4gPiA+ID4gPiArICAgICAgICAgICAvKgo+ID4gPiA+ID4gKyAgICAgICAgICAgICogTm8gbmVl ZCB0byBjYWxsIG1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3JhbmdlKCkgYXMgd2UgYXJlCj4gPiA+ ID4gPiArICAgICAgICAgICAgKiBkb3duZ3JhZGluZyBwYWdlIHRhYmxlIHByb3RlY3Rpb24gbm90 IGNoYW5naW5nIGl0IHRvIHBvaW50Cj4gPiA+ID4gPiArICAgICAgICAgICAgKiB0byBhIG5ldyBw YWdlLgo+ID4gPiA+ID4gKyAgICAgICAgICAgICoKPiA+ID4gPiA+ICsgICAgICAgICAgICAqIFNl ZSBEb2N1bWVudGF0aW9uL3ZtL21tdV9ub3RpZmllci50eHQKPiA+ID4gPiA+ICsgICAgICAgICAg ICAqLwo+ID4gPiA+ID4gKyAgICAgICAgICAgaWYgKHJldCkKPiA+ID4gPiA+ICAgICAgICAgICAg ICAgICAgICAgKCpjbGVhbmVkKSsrOwo+ID4gPiA+ID4gLSAgICAgICAgICAgfQo+ID4gPiA+ID4g ICAgIH0KPiA+ID4gPiA+IAo+ID4gPiA+ID4gICAgIG1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3Jh bmdlX2VuZCh2bWEtPnZtX21tLCBzdGFydCwgZW5kKTsKPiA+ID4gPiA+IEBAIC0xNDI0LDYgKzE0 MjksMTAgQEAgc3RhdGljIGJvb2wgdHJ5X3RvX3VubWFwX29uZShzdHJ1Y3QgcGFnZSAqcGFnZSwg c3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEsCj4gPiA+ID4gPiAgICAgICAgICAgICAgICAgICAg IGlmIChwdGVfc29mdF9kaXJ0eShwdGV2YWwpKQo+ID4gPiA+ID4gICAgICAgICAgICAgICAgICAg ICAgICAgICAgIHN3cF9wdGUgPSBwdGVfc3dwX21rc29mdF9kaXJ0eShzd3BfcHRlKTsKPiA+ID4g PiA+ICAgICAgICAgICAgICAgICAgICAgc2V0X3B0ZV9hdChtbSwgcHZtdy5hZGRyZXNzLCBwdm13 LnB0ZSwgc3dwX3B0ZSk7Cj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgIC8qCj4gPiA+ID4g PiArICAgICAgICAgICAgICAgICAgICAqIE5vIG5lZWQgdG8gaW52YWxpZGF0ZSBoZXJlIGl0IHdp bGwgc3luY2hyb25pemUgb24KPiA+ID4gPiA+ICsgICAgICAgICAgICAgICAgICAgICogYWdhaW5z dCB0aGUgc3BlY2lhbCBzd2FwIG1pZ3JhdGlvbiBwdGUuCj4gPiA+ID4gPiArICAgICAgICAgICAg ICAgICAgICAqLwo+ID4gPiA+ID4gICAgICAgICAgICAgICAgICAgICBnb3RvIGRpc2NhcmQ7Cj4g PiA+ID4gPiAgICAgICAgICAgICB9Cj4gPiA+ID4gPiAKPiA+ID4gPiA+IEBAIC0xNDgxLDYgKzE0 OTAsOSBAQCBzdGF0aWMgYm9vbCB0cnlfdG9fdW5tYXBfb25lKHN0cnVjdCBwYWdlICpwYWdlLCBz dHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSwKPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAg ICogd2lsbCB0YWtlIGNhcmUgb2YgdGhlIHJlc3QuCj4gPiA+ID4gPiAgICAgICAgICAgICAgICAg ICAgICAqLwo+ID4gPiA+ID4gICAgICAgICAgICAgICAgICAgICBkZWNfbW1fY291bnRlcihtbSwg bW1fY291bnRlcihwYWdlKSk7Cj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgIC8qIFdlIGhh dmUgdG8gaW52YWxpZGF0ZSBhcyB3ZSBjbGVhcmVkIHRoZSBwdGUgKi8KPiA+ID4gPiA+ICsgICAg ICAgICAgICAgICAgICAgbW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2UobW0sIGFkZHJlc3Ms Cj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgIGFkZHJlc3MgKyBQQUdFX1NJWkUpOwo+ID4gPiA+ID4gICAgICAgICAgICAgfSBlbHNlIGlm IChJU19FTkFCTEVEKENPTkZJR19NSUdSQVRJT04pICYmCj4gPiA+ID4gPiAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgKGZsYWdzICYgKFRUVV9NSUdSQVRJT058VFRVX1NQTElUX0ZSRUVaRSkp KSB7Cj4gPiA+ID4gPiAgICAgICAgICAgICAgICAgICAgIHN3cF9lbnRyeV90IGVudHJ5Owo+ID4g PiA+ID4gQEAgLTE0OTYsNiArMTUwOCwxMCBAQCBzdGF0aWMgYm9vbCB0cnlfdG9fdW5tYXBfb25l KHN0cnVjdCBwYWdlICpwYWdlLCBzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSwKPiA+ID4gPiA+ ICAgICAgICAgICAgICAgICAgICAgaWYgKHB0ZV9zb2Z0X2RpcnR5KHB0ZXZhbCkpCj4gPiA+ID4g PiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgc3dwX3B0ZSA9IHB0ZV9zd3BfbWtzb2Z0X2Rp cnR5KHN3cF9wdGUpOwo+ID4gPiA+ID4gICAgICAgICAgICAgICAgICAgICBzZXRfcHRlX2F0KG1t LCBhZGRyZXNzLCBwdm13LnB0ZSwgc3dwX3B0ZSk7Cj4gPiA+ID4gPiArICAgICAgICAgICAgICAg ICAgIC8qCj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgICAqIE5vIG5lZWQgdG8gaW52YWxp ZGF0ZSBoZXJlIGl0IHdpbGwgc3luY2hyb25pemUgb24KPiA+ID4gPiA+ICsgICAgICAgICAgICAg ICAgICAgICogYWdhaW5zdCB0aGUgc3BlY2lhbCBzd2FwIG1pZ3JhdGlvbiBwdGUuCj4gPiA+ID4g PiArICAgICAgICAgICAgICAgICAgICAqLwo+ID4gPiA+ID4gICAgICAgICAgICAgfSBlbHNlIGlm IChQYWdlQW5vbihwYWdlKSkgewo+ID4gPiA+ID4gICAgICAgICAgICAgICAgICAgICBzd3BfZW50 cnlfdCBlbnRyeSA9IHsgLnZhbCA9IHBhZ2VfcHJpdmF0ZShzdWJwYWdlKSB9Owo+ID4gPiA+ID4g ICAgICAgICAgICAgICAgICAgICBwdGVfdCBzd3BfcHRlOwo+ID4gPiA+ID4gQEAgLTE1MDcsNiAr MTUyMyw4IEBAIHN0YXRpYyBib29sIHRyeV90b191bm1hcF9vbmUoc3RydWN0IHBhZ2UgKnBhZ2Us IHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hLAo+ID4gPiA+ID4gICAgICAgICAgICAgICAgICAg ICAgICAgICAgIFdBUk5fT05fT05DRSgxKTsKPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAg ICAgICAgICByZXQgPSBmYWxzZTsKPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAvKiBXZSBoYXZlIHRvIGludmFsaWRhdGUgYXMgd2UgY2xlYXJlZCB0aGUgcHRlICovCj4gPiA+ ID4gPiArICAgICAgICAgICAgICAgICAgICAgICAgICAgbW11X25vdGlmaWVyX2ludmFsaWRhdGVf cmFuZ2UobW0sIGFkZHJlc3MsCj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgYWRkcmVzcyArIFBBR0VfU0laRSk7Cj4gPiA+ID4gPiAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgcGFnZV92bWFfbWFwcGVkX3dhbGtfZG9uZSgmcHZt dyk7Cj4gPiA+ID4gPiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgYnJlYWs7Cj4gPiA+ID4g PiAgICAgICAgICAgICAgICAgICAgIH0KPiA+ID4gPiA+IEBAIC0xNTE0LDYgKzE1MzIsOSBAQCBz dGF0aWMgYm9vbCB0cnlfdG9fdW5tYXBfb25lKHN0cnVjdCBwYWdlICpwYWdlLCBzdHJ1Y3Qgdm1f YXJlYV9zdHJ1Y3QgKnZtYSwKPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAgLyogTUFEVl9G UkVFIHBhZ2UgY2hlY2sgKi8KPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAgaWYgKCFQYWdl U3dhcEJhY2tlZChwYWdlKSkgewo+ID4gPiA+ID4gICAgICAgICAgICAgICAgICAgICAgICAgICAg IGlmICghUGFnZURpcnR5KHBhZ2UpKSB7Cj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAvKiBJbnZhbGlkYXRlIGFzIHdlIGNsZWFyZWQgdGhlIHB0ZSAqLwo+ID4g PiA+ID4gKyAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbW11X25vdGlmaWVyX2lu dmFsaWRhdGVfcmFuZ2UobW0sCj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgIGFkZHJlc3MsIGFkZHJlc3MgKyBQQUdFX1NJWkUpOwo+ID4gPiA+ID4g ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGVjX21tX2NvdW50ZXIobW0sIE1N X0FOT05QQUdFUyk7Cj4gPiA+ID4gPiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICBnb3RvIGRpc2NhcmQ7Cj4gPiA+ID4gPiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgfQo+ ID4gPiA+ID4gQEAgLTE1NDcsMTMgKzE1NjgsMzkgQEAgc3RhdGljIGJvb2wgdHJ5X3RvX3VubWFw X29uZShzdHJ1Y3QgcGFnZSAqcGFnZSwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEsCj4gPiA+ ID4gPiAgICAgICAgICAgICAgICAgICAgIGlmIChwdGVfc29mdF9kaXJ0eShwdGV2YWwpKQo+ID4g PiA+ID4gICAgICAgICAgICAgICAgICAgICAgICAgICAgIHN3cF9wdGUgPSBwdGVfc3dwX21rc29m dF9kaXJ0eShzd3BfcHRlKTsKPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAgc2V0X3B0ZV9h dChtbSwgYWRkcmVzcywgcHZtdy5wdGUsIHN3cF9wdGUpOwo+ID4gPiA+ID4gLSAgICAgICAgICAg fSBlbHNlCj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgIC8qIEludmFsaWRhdGUgYXMgd2Ug Y2xlYXJlZCB0aGUgcHRlICovCj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgIG1tdV9ub3Rp Zmllcl9pbnZhbGlkYXRlX3JhbmdlKG1tLCBhZGRyZXNzLAo+ID4gPiA+ID4gKyAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBhZGRyZXNzICsgUEFHRV9TSVpF KTsKPiA+ID4gPiA+ICsgICAgICAgICAgIH0gZWxzZSB7Cj4gPiA+ID4gPiArICAgICAgICAgICAg ICAgICAgIC8qCj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgICAqIFdlIHNob3VsZCBub3Qg bmVlZCB0byBub3RpZnkgaGVyZSBhcyB3ZSByZWFjaCB0aGlzCj4gPiA+ID4gPiArICAgICAgICAg ICAgICAgICAgICAqIGNhc2Ugb25seSBmcm9tIGZyZWV6ZV9wYWdlKCkgaXRzZWxmIG9ubHkgY2Fs bCBmcm9tCj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgICAqIHNwbGl0X2h1Z2VfcGFnZV90 b19saXN0KCkgc28gZXZlcnl0aGluZyBiZWxvdyBtdXN0Cj4gPiA+ID4gPiArICAgICAgICAgICAg ICAgICAgICAqIGJlIHRydWU6Cj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgICAqICAgLSBw YWdlIGlzIG5vdCBhbm9ueW1vdXMKPiA+ID4gPiA+ICsgICAgICAgICAgICAgICAgICAgICogICAt IHBhZ2UgaXMgbG9ja2VkCj4gPiA+ID4gPiArICAgICAgICAgICAgICAgICAgICAqCj4gPiA+ID4g PiArICAgICAgICAgICAgICAgICAgICAqIFNvIGFzIGl0IGlzIGEgbG9ja2VkIGZpbGUgYmFjayBw YWdlIHRodXMgaXQgY2FuIG5vdAo+ID4gPiA+ID4gKyAgICAgICAgICAgICAgICAgICAgKiBiZSBy ZW1vdmUgZnJvbSB0aGUgcGFnZSBjYWNoZSBhbmQgcmVwbGFjZSBieSBhIG5ldwo+ID4gPiA+ID4g KyAgICAgICAgICAgICAgICAgICAgKiBwYWdlIGJlZm9yZSBtbXVfbm90aWZpZXJfaW52YWxpZGF0 ZV9yYW5nZV9lbmQgc28gbm8KPiA+ID4gPiA+ICsgICAgICAgICAgICAgICAgICAgICogY29uY3Vy cmVudCB0aHJlYWQgbWlnaHQgdXBkYXRlIGl0cyBwYWdlIHRhYmxlIHRvCj4gPiA+ID4gPiArICAg ICAgICAgICAgICAgICAgICAqIHBvaW50IGF0IG5ldyBwYWdlIHdoaWxlIGEgZGV2aWNlIHN0aWxs IGlzIHVzaW5nIHRoaXMKPiA+ID4gPiA+ICsgICAgICAgICAgICAgICAgICAgICogcGFnZS4KPiA+ ID4gPiA+ICsgICAgICAgICAgICAgICAgICAgICoKPiA+ID4gPiA+ICsgICAgICAgICAgICAgICAg ICAgICogU2VlIERvY3VtZW50YXRpb24vdm0vbW11X25vdGlmaWVyLnR4dAo+ID4gPiA+ID4gKyAg ICAgICAgICAgICAgICAgICAgKi8KPiA+ID4gPiA+ICAgICAgICAgICAgICAgICAgICAgZGVjX21t X2NvdW50ZXIobW0sIG1tX2NvdW50ZXJfZmlsZShwYWdlKSk7Cj4gPiA+ID4gPiArICAgICAgICAg ICB9Cj4gPiA+ID4gPiAgZGlzY2FyZDoKPiA+ID4gPiA+ICsgICAgICAgICAgIC8qCj4gPiA+ID4g PiArICAgICAgICAgICAgKiBObyBuZWVkIHRvIGNhbGwgbW11X25vdGlmaWVyX2ludmFsaWRhdGVf cmFuZ2UoKSBpdCBoYXMgYmUKPiA+ID4gPiA+ICsgICAgICAgICAgICAqIGRvbmUgYWJvdmUgZm9y IGFsbCBjYXNlcyByZXF1aXJpbmcgaXQgdG8gaGFwcGVuIHVuZGVyIHBhZ2UKPiA+ID4gPiA+ICsg ICAgICAgICAgICAqIHRhYmxlIGxvY2sgYmVmb3JlIG1tdV9ub3RpZmllcl9pbnZhbGlkYXRlX3Jh bmdlX2VuZCgpCj4gPiA+ID4gPiArICAgICAgICAgICAgKgo+ID4gPiA+ID4gKyAgICAgICAgICAg ICogU2VlIERvY3VtZW50YXRpb24vdm0vbW11X25vdGlmaWVyLnR4dAo+ID4gPiA+ID4gKyAgICAg ICAgICAgICovCj4gPiA+ID4gPiAgICAgICAgICAgICBwYWdlX3JlbW92ZV9ybWFwKHN1YnBhZ2Us IFBhZ2VIdWdlKHBhZ2UpKTsKPiA+ID4gPiA+ICAgICAgICAgICAgIHB1dF9wYWdlKHBhZ2UpOwo+ ID4gPiA+ID4gLSAgICAgICAgICAgbW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2UobW0sIGFk ZHJlc3MsCj4gPiA+ID4gPiAtICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICBhZGRyZXNzICsgUEFHRV9TSVpFKTsKPiA+ID4gPiA+ICAgICB9Cj4gPiA+ID4gPiAKPiA+ID4g PiA+ICAgICBtbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZV9lbmQodm1hLT52bV9tbSwgc3Rh cnQsIGVuZCk7Cj4gPiA+ID4gCj4gPiA+ID4gTG9va2luZyBhdCB0aGUgcGF0Y2hzZXQsIEkgdW5k ZXJzdGFuZCB0aGUgZWZmaWNpZW5jeSwgYnV0IEkgYW0gY29uY2VybmVkCj4gPiA+ID4gd2l0aCBj b3JyZWN0bmVzcy4KPiA+ID4gCj4gPiA+IEkgYW0gZmluZSBpbiBob2xkaW5nIHRoaXMgb2ZmIGZy b20gcmVhY2hpbmcgTGludXMgYnV0IG9ubHkgd2F5IHRvIGZsdXNoIHRoaXMKPiA+ID4gaXNzdWVz IG91dCBpZiBhbnkgaXMgdG8gaGF2ZSB0aGlzIHBhdGNoIGluIGxpbnV4LW5leHQgb3Igc29tZXdo ZXJlIHdlcmUgdGhleQo+ID4gPiBnZXQgYSBjaGFuY2Ugb2YgYmVpbmcgdGVzdGVkLgo+ID4gPiAK PiA+IAo+ID4gWWVwLCBJIHdvdWxkIGxpa2UgdG8gc2VlIHNvbWUgYWRkaXRpb25hbCB0ZXN0aW5n IGFyb3VuZCBucHUgYW5kIGdldCBBbGlzdGFpcgo+ID4gUG9wcGxlIHRvIGNvbW1lbnQgYXMgd2Vs bAo+IAo+IEkgdGhpbmsgdGhpcyBwYXRjaCBpcyBmaW5lLiBUaGUgb25seSBvbmUgcmFjZSB3aW5k b3cgdGhhdCBpdCBtaWdodCBtYWtlCj4gYmlnZ2VyIHNob3VsZCBoYXZlIG5vIGJhZCBjb25zZXF1 ZW5jZXMuCj4gCj4gPiAKPiA+ID4gTm90ZSB0aGF0IHRoZSBzZWNvbmQgcGF0Y2ggaXMgYWx3YXlz IHNhZmUuIEkgYWdyZWUgdGhhdCB0aGlzIG9uZSBtaWdodAo+ID4gPiBub3QgYmUgaWYgaGFyZHdh cmUgaW1wbGVtZW50YXRpb24gaXMgaWRpb3RpYyAod2VsbCB0aGF0IHdvdWxkIGJlIG15Cj4gPiA+ IG9waW5pb24gYW5kIGFueSBvcGluaW9uL3BvaW50IG9mIHZpZXcgY2FuIGJlIGNoYWxsZW5nZSA6 KSkKPiA+IAo+ID4gCj4gPiBZb3UgbWVhbiB0aGUgb25seV9lbmQgdmFyaWFudCB0aGF0IGF2b2lk cyBzaG9vdGRvd24gYWZ0ZXIgcG1kL3B0ZSBjaGFuZ2VzCj4gPiB0aGF0IGF2b2lkIHRoZSBfc3Rh cnQvX2VuZCBhbmQgaGF2ZSBqdXN0IHRoZSBvbmx5X2VuZCB2YXJpYW50PyBUaGF0IHNlZW1lZAo+ ID4gcmVhc29uYWJsZSB0byBtZSwgYnV0IEkndmUgbm90IHRlc3RlZCBpdCBvciBldmFsdWF0ZWQg aXQgaW4gZGVwdGgKPiAKPiBZZXMsIHBhdGNoIDIvMiBpbiB0aGlzIHNlcmllIGlzIGRlZmluaXRs eSBmaW5lLiBJdCBpbnZhbGlkYXRlIHRoZSBkZXZpY2UKPiBUTEIgcmlnaHQgYWZ0ZXIgY2xlYXJp bmcgcHRlIGVudHJ5IGFuZCBhdm9pZCBsYXR0ZXIgdW5lY2Vzc2FyeSBpbnZhbGlkYXRpb24KPiBv ZiBzYW1lIFRMQi4KPiAKPiBKw6lyw7RtZQoKQmFsYmlyIFNpbmdoLgoKX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KaW9tbXUgbWFpbGluZyBsaXN0CmlvbW11 QGxpc3RzLmxpbnV4LWZvdW5kYXRpb24ub3JnCmh0dHBzOi8vbGlzdHMubGludXhmb3VuZGF0aW9u Lm9yZy9tYWlsbWFuL2xpc3RpbmZvL2lvbW11 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Glisse Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Sat, 21 Oct 2017 11:47:03 -0400 Message-ID: <20171021154703.GA30458@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> <20171019165823.GA3044@redhat.com> <1508565280.5662.6.camel@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Return-path: Content-Disposition: inline In-Reply-To: <1508565280.5662.6.camel@gmail.com> Sender: linux-next-owner@vger.kernel.org To: Balbir Singh Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next List-Id: iommu@lists.linux-foundation.org On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote: > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote: > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > > > > > On Mon, 16 Oct 2017 23:10:02 -0400 > > > > > jglisse@redhat.com wrote: > > > > > > > > > > > From: Jérôme Glisse > > > > > > > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() as we are > > > > > > + * downgrading page table protection not changing it to point > > > > > > + * to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > if (pmdp) { > > > > > > #ifdef CONFIG_FS_DAX_PMD > > > > > > pmd_t pmd; > > > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > > > > > pmd = pmd_wrprotect(pmd); > > > > > > pmd = pmd_mkclean(pmd); > > > > > > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > > > > > > > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? > > > > > > > > I am assuming hardware does sane thing of setting the dirty bit only > > > > when walking the CPU page table when device does a write fault ie > > > > once the device get a write TLB entry the dirty is set by the IOMMU > > > > when walking the page table before returning the lookup result to the > > > > device and that it won't be set again latter (ie propagated back > > > > latter). > > > > > > > > > > The other possibility is that the hardware things the page is writable > > > and already > > > marked dirty. It allows writes and does not set the dirty bit? > > > > I thought about this some more and the patch can not regress anything > > that is not broken today. So if we assume that device can propagate > > dirty bit because it can cache the write protection than all current > > code is broken for two reasons: > > > > First one is current code clear pte entry, build a new pte value with > > write protection and update pte entry with new pte value. So any PASID/ > > ATS platform that allows device to cache the write bit and set dirty > > bit anytime after that can race during that window and you would loose > > the dirty bit of the device. That is not that bad as you are gonna > > propagate the dirty bit to the struct page. > > But they stay consistent with the notifiers, so from the OS perspective > it notifies of any PTE changes as they happen. When the ATS platform sees > invalidation, it invalidates it's PTE's as well. > > I was speaking of the case where the ATS platform could assume it has > write access and has not seen any invalidation, the OS could return > back to user space or the caller with write bit clear, but the ATS > platform could still do a write since it's not seen the invalidation. I understood what you said and what is above apply. I am removing only one of the invalidation not both. So with that patch the invalidation is delayed after the page table lock drop but before dax/page_mkclean returns. Hence any further activity will be read only on any device too once we exit those functions. The only difference is the window during which device can report dirty pte. Before that patch the 2 "~bogus~" window were small: First window between pmd/pte_get_clear_flush and set_pte/pmd Second window between set_pte/pmd and mmu_notifier_invalidate_range The first window stay the same, the second window is bigger, potentialy lot bigger if thread is prempted before mmu_notifier_invalidate_range_end But that is fine as in that case the page is reported as dirty and thus we are not missing anything and the kernel code does not care about seeing read only pte mark as dirty. > > > > > Second one is if the dirty bit is propagated back to the new write > > protected pte. Quick look at code it seems that when we zap pte or > > or mkclean we don't check that the pte has write permission but only > > care about the dirty bit. So it should not have any bad consequence. > > > > After this patch only the second window is bigger and thus more likely > > to happen. But nothing sinister should happen from that. > > > > > > > > > > > I should probably have spell that out and maybe some of the ATS/PASID > > > > implementer did not do that. > > > > > > > > > > > > > > > unlock_pmd: > > > > > > spin_unlock(ptl); > > > > > > #endif > > > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > > > > > pte = pte_wrprotect(pte); > > > > > > pte = pte_mkclean(pte); > > > > > > set_pte_at(vma->vm_mm, address, ptep, pte); > > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > > > > > > > > > Ditto > > > > > > > > > > > unlock_pte: > > > > > > pte_unmap_unlock(ptep, ptl); > > > > > > } > > > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > > > > > index 6866e8126982..49c925c96b8a 100644 > > > > > > --- a/include/linux/mmu_notifier.h > > > > > > +++ b/include/linux/mmu_notifier.h > > > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > > > > > > * shared page-tables, it not necessary to implement the > > > > > > * invalidate_range_start()/end() notifiers, as > > > > > > * invalidate_range() alread catches the points in time when an > > > > > > - * external TLB range needs to be flushed. > > > > > > + * external TLB range needs to be flushed. For more in depth > > > > > > + * discussion on this see Documentation/vm/mmu_notifier.txt > > > > > > * > > > > > > * The invalidate_range() function is called under the ptl > > > > > > * spin-lock and not allowed to sleep. > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > index c037d3d34950..ff5bc647b51d 100644 > > > > > > --- a/mm/huge_memory.c > > > > > > +++ b/mm/huge_memory.c > > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > > > > > goto out_free_pages; > > > > > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > > > > > > > > > + /* > > > > > > + * Leave pmd empty until pte is filled note we must notify here as > > > > > > + * concurrent CPU thread might write to new page before the call to > > > > > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > > > > > + * device seeing memory write in different order than CPU. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > > > > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > > > > > pmd_t _pmd; > > > > > > int i; > > > > > > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > > > + /* > > > > > > + * Leave pmd empty until pte is filled note that it is fine to delay > > > > > > + * notification until mmu_notifier_invalidate_range_end() as we are > > > > > > + * replacing a zero pmd write protected page with a zero pte write > > > > > > + * protected page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > > > > > > > > > Shouldn't the secondary TLB know if the page size changed? > > > > > > > > It should not matter, we are talking virtual to physical on behalf > > > > of a device against a process address space. So the hardware should > > > > not care about the page size. > > > > > > > > > > Does that not indicate how much the device can access? Could it try > > > to access more than what is mapped? > > > > Assuming device has huge TLB and 2MB huge page with 4K small page. > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB > > each covering 4K. Both case is read only and both case are pointing > > to same data (ie zero). > > > > It is fine to delay the TLB invalidate on the device to the call of > > mmu_notifier_invalidate_range_end(). The device will keep using the > > huge TLB for a little longer but both CPU and device are looking at > > same data. > > > > Now if there is a racing thread that replace one of the 512 zeor page > > after the split but before mmu_notifier_invalidate_range_end() that > > code path would call mmu_notifier_invalidate_range() before changing > > the pte to point to something else. Which should shoot down the device > > TLB (it would be a serious device bug if this did not work). > > OK.. This seems reasonable, but I'd really like to see if it can be > tested Well hard to test, many factors first each device might react differently. Device that only store TLB at 4k granularity are fine. Clever device that can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller than their TLB entry ie getting a 4K invalidation would not invalidate a 2MB TLB entry in the device. I consider this as buggy. I will go look at the PCIE ATS specification one more time and see if there is any wording related that. I might bring up a question to the PCIE standard body if not. Second factor is that it is a race between split zero and a write fault. I can probably do a crappy patch that msleep if split happens against a given mm to increase the race window. But i would be testing against one device (right now i can only access AMD IOMMUv2 devices with discret ATS GPU) > > > > > > > > > > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > > > > 4K pages is replace by something new then a device TLB shootdown will > > > > happen before the new page is set. > > > > > > > > Only issue i can think of is if the IOMMU TLB (if there is one) or > > > > the device TLB (you do expect that there is one) does not invalidate > > > > TLB entry if the TLB shootdown is smaller than the TLB entry. That > > > > would be idiotic but yes i know hardware bug. > > > > > > > > > > > > > > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > > > > > pmd_populate(mm, &_pmd, pgtable); > > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > > > index 1768efa4c501..63a63f1b536c 100644 > > > > > > --- a/mm/hugetlb.c > > > > > > +++ b/mm/hugetlb.c > > > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > > > > > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > > > > > > } else { > > > > > > if (cow) { > > > > > > + /* > > > > > > + * No need to notify as we are downgrading page > > > > > > + * table protection not changing it to point > > > > > > + * to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > huge_ptep_set_wrprotect(src, addr, src_pte); > > > > > > > > > > OK.. so we could get write faults on write accesses from the device. > > > > > > > > > > > - mmu_notifier_invalidate_range(src, mmun_start, > > > > > > - mmun_end); > > > > > > } > > > > > > entry = huge_ptep_get(src_pte); > > > > > > ptepage = pte_page(entry); > > > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > > > > > > * and that page table be reused and filled with junk. > > > > > > */ > > > > > > flush_hugetlb_tlb_range(vma, start, end); > > > > > > - mmu_notifier_invalidate_range(mm, start, end); > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() we are downgrading > > > > > > + * page table protection not changing it to point to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > i_mmap_unlock_write(vma->vm_file->f_mapping); > > > > > > mmu_notifier_invalidate_range_end(mm, start, end); > > > > > > > > > > > > diff --git a/mm/ksm.c b/mm/ksm.c > > > > > > index 6cb60f46cce5..be8f4576f842 100644 > > > > > > --- a/mm/ksm.c > > > > > > +++ b/mm/ksm.c > > > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > > > > > > * So we clear the pte and flush the tlb before the check > > > > > > * this assure us that no O_DIRECT can happen after the check > > > > > > * or in the middle of the check. > > > > > > + * > > > > > > + * No need to notify as we are downgrading page table to read > > > > > > + * only not changing it to point to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > */ > > > > > > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > > > > > > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > > > > > > /* > > > > > > * Check that no O_DIRECT or similar I/O is in progress on the > > > > > > * page > > > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > > > > > > } > > > > > > > > > > > > flush_cache_page(vma, addr, pte_pfn(*ptep)); > > > > > > - ptep_clear_flush_notify(vma, addr, ptep); > > > > > > + /* > > > > > > + * No need to notify as we are replacing a read only page with another > > > > > > + * read only page with the same content. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > + ptep_clear_flush(vma, addr, ptep); > > > > > > set_pte_at_notify(mm, addr, ptep, newpte); > > > > > > > > > > > > page_remove_rmap(page, false); > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > > > > index 061826278520..6b5a0f219ac0 100644 > > > > > > --- a/mm/rmap.c > > > > > > +++ b/mm/rmap.c > > > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > > > > > > #endif > > > > > > } > > > > > > > > > > > > - if (ret) { > > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() as we are > > > > > > + * downgrading page table protection not changing it to point > > > > > > + * to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > + if (ret) > > > > > > (*cleaned)++; > > > > > > - } > > > > > > } > > > > > > > > > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > if (pte_soft_dirty(pteval)) > > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > > > > > > + /* > > > > > > + * No need to invalidate here it will synchronize on > > > > > > + * against the special swap migration pte. > > > > > > + */ > > > > > > goto discard; > > > > > > } > > > > > > > > > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > * will take care of the rest. > > > > > > */ > > > > > > dec_mm_counter(mm, mm_counter(page)); > > > > > > + /* We have to invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > > + address + PAGE_SIZE); > > > > > > } else if (IS_ENABLED(CONFIG_MIGRATION) && > > > > > > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > > > > > > swp_entry_t entry; > > > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > if (pte_soft_dirty(pteval)) > > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > > > > > + /* > > > > > > + * No need to invalidate here it will synchronize on > > > > > > + * against the special swap migration pte. > > > > > > + */ > > > > > > } else if (PageAnon(page)) { > > > > > > swp_entry_t entry = { .val = page_private(subpage) }; > > > > > > pte_t swp_pte; > > > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > WARN_ON_ONCE(1); > > > > > > ret = false; > > > > > > /* We have to invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > > + address + PAGE_SIZE); > > > > > > page_vma_mapped_walk_done(&pvmw); > > > > > > break; > > > > > > } > > > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > /* MADV_FREE page check */ > > > > > > if (!PageSwapBacked(page)) { > > > > > > if (!PageDirty(page)) { > > > > > > + /* Invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, > > > > > > + address, address + PAGE_SIZE); > > > > > > dec_mm_counter(mm, MM_ANONPAGES); > > > > > > goto discard; > > > > > > } > > > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > if (pte_soft_dirty(pteval)) > > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > > > > > - } else > > > > > > + /* Invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > > + address + PAGE_SIZE); > > > > > > + } else { > > > > > > + /* > > > > > > + * We should not need to notify here as we reach this > > > > > > + * case only from freeze_page() itself only call from > > > > > > + * split_huge_page_to_list() so everything below must > > > > > > + * be true: > > > > > > + * - page is not anonymous > > > > > > + * - page is locked > > > > > > + * > > > > > > + * So as it is a locked file back page thus it can not > > > > > > + * be remove from the page cache and replace by a new > > > > > > + * page before mmu_notifier_invalidate_range_end so no > > > > > > + * concurrent thread might update its page table to > > > > > > + * point at new page while a device still is using this > > > > > > + * page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > dec_mm_counter(mm, mm_counter_file(page)); > > > > > > + } > > > > > > discard: > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() it has be > > > > > > + * done above for all cases requiring it to happen under page > > > > > > + * table lock before mmu_notifier_invalidate_range_end() > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > page_remove_rmap(subpage, PageHuge(page)); > > > > > > put_page(page); > > > > > > - mmu_notifier_invalidate_range(mm, address, > > > > > > - address + PAGE_SIZE); > > > > > > } > > > > > > > > > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > > > > > > > > > Looking at the patchset, I understand the efficiency, but I am concerned > > > > > with correctness. > > > > > > > > I am fine in holding this off from reaching Linus but only way to flush this > > > > issues out if any is to have this patch in linux-next or somewhere were they > > > > get a chance of being tested. > > > > > > > > > > Yep, I would like to see some additional testing around npu and get Alistair > > > Popple to comment as well > > > > I think this patch is fine. The only one race window that it might make > > bigger should have no bad consequences. > > > > > > > > > Note that the second patch is always safe. I agree that this one might > > > > not be if hardware implementation is idiotic (well that would be my > > > > opinion and any opinion/point of view can be challenge :)) > > > > > > > > > You mean the only_end variant that avoids shootdown after pmd/pte changes > > > that avoid the _start/_end and have just the only_end variant? That seemed > > > reasonable to me, but I've not tested it or evaluated it in depth > > > > Yes, patch 2/2 in this serie is definitly fine. It invalidate the device > > TLB right after clearing pte entry and avoid latter unecessary invalidation > > of same TLB. > > > > Jérôme > > Balbir Singh. > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Glisse Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Mon, 23 Oct 2017 16:35:01 -0400 Message-ID: <20171023203501.GA9371@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> <20171019165823.GA3044@redhat.com> <1508565280.5662.6.camel@gmail.com> <20171021154703.GA30458@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Return-path: Content-Disposition: inline In-Reply-To: <20171021154703.GA30458@redhat.com> Sender: linux-next-owner@vger.kernel.org To: Balbir Singh Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next List-Id: iommu@lists.linux-foundation.org On Sat, Oct 21, 2017 at 11:47:03AM -0400, Jerome Glisse wrote: > On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote: > > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote: > > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > > > > > > On Mon, 16 Oct 2017 23:10:02 -0400 > > > > > > jglisse@redhat.com wrote: > > > > > > > From: Jérôme Glisse [...] > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > > index c037d3d34950..ff5bc647b51d 100644 > > > > > > > --- a/mm/huge_memory.c > > > > > > > +++ b/mm/huge_memory.c > > > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > > > > > > goto out_free_pages; > > > > > > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > > > > > > > > > > > + /* > > > > > > > + * Leave pmd empty until pte is filled note we must notify here as > > > > > > > + * concurrent CPU thread might write to new page before the call to > > > > > > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > > > > > > + * device seeing memory write in different order than CPU. > > > > > > > + * > > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > > + */ > > > > > > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > > > > > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > > > > > > pmd_t _pmd; > > > > > > > int i; > > > > > > > > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > > > > + /* > > > > > > > + * Leave pmd empty until pte is filled note that it is fine to delay > > > > > > > + * notification until mmu_notifier_invalidate_range_end() as we are > > > > > > > + * replacing a zero pmd write protected page with a zero pte write > > > > > > > + * protected page. > > > > > > > + * > > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > > + */ > > > > > > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > > > > > > > > > > > Shouldn't the secondary TLB know if the page size changed? > > > > > > > > > > It should not matter, we are talking virtual to physical on behalf > > > > > of a device against a process address space. So the hardware should > > > > > not care about the page size. > > > > > > > > > > > > > Does that not indicate how much the device can access? Could it try > > > > to access more than what is mapped? > > > > > > Assuming device has huge TLB and 2MB huge page with 4K small page. > > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB > > > each covering 4K. Both case is read only and both case are pointing > > > to same data (ie zero). > > > > > > It is fine to delay the TLB invalidate on the device to the call of > > > mmu_notifier_invalidate_range_end(). The device will keep using the > > > huge TLB for a little longer but both CPU and device are looking at > > > same data. > > > > > > Now if there is a racing thread that replace one of the 512 zeor page > > > after the split but before mmu_notifier_invalidate_range_end() that > > > code path would call mmu_notifier_invalidate_range() before changing > > > the pte to point to something else. Which should shoot down the device > > > TLB (it would be a serious device bug if this did not work). > > > > OK.. This seems reasonable, but I'd really like to see if it can be > > tested > > Well hard to test, many factors first each device might react differently. > Device that only store TLB at 4k granularity are fine. Clever device that > can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller > than their TLB entry ie getting a 4K invalidation would not invalidate a > 2MB TLB entry in the device. I consider this as buggy. I will go look at > the PCIE ATS specification one more time and see if there is any wording > related that. I might bring up a question to the PCIE standard body if not. So inside PCIE ATS there is the definition of "minimum translation or invalidate size" which says 4096 bytes. So my understanding is that hardware must support 4K invalidation in all the case and thus we shoud be safe from possible hazard above. But none the less i will repost without the optimization for huge page to be more concervative as anyway we want to be correct before we care about last bit of optimization. Cheers, Jérôme From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yGKtw57rVzDrD6 for ; Tue, 17 Oct 2017 14:10:16 +1100 (AEDT) From: jglisse@redhat.com To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org Subject: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback Date: Mon, 16 Oct 2017 23:10:01 -0400 Message-Id: <20171017031003.7481-1-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Jérôme Glisse (Andrew you already have v1 in your queue of patch 1, patch 2 is new, i think you can drop it patch 1 v1 for v2, v2 is bit more conservative and i fixed typos) All this only affect user of invalidate_range callback (at this time CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in drivers/iommu/amd_iommu_v2.c|intel-svm.c) This patchset remove useless double call to mmu_notifier->invalidate_range callback wherever it is safe to do so. The first patch just remove useless call and add documentation explaining why it is safe to do so. The second patch go further by introducing mmu_notifier_invalidate_range_only_end() which skip callback to invalidate_range this can be done when clearing a pte, pmd or pud with notification which call invalidate_range right after clearing under the page table lock. It should improve performances but i am lacking hardware and benchmarks which might show an improvement. Maybe folks in cc can help here. Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: David Woodhouse Cc: Alistair Popple Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Stephen Rothwell Cc: Andrew Donnellan Cc: iommu@lists.linux-foundation.org Cc: linuxppc-dev@lists.ozlabs.org Jérôme Glisse (2): mm/mmu_notifier: avoid double notification when it is useless v2 mm/mmu_notifier: avoid call to invalidate_range() in range_end() Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++ fs/dax.c | 9 +++- include/linux/mmu_notifier.h | 20 +++++++-- mm/huge_memory.c | 66 ++++++++++++++++++++++++--- mm/hugetlb.c | 16 +++++-- mm/ksm.c | 15 ++++++- mm/memory.c | 6 ++- mm/migrate.c | 15 +++++-- mm/mmu_notifier.c | 11 ++++- mm/rmap.c | 59 ++++++++++++++++++++++--- 10 files changed, 281 insertions(+), 29 deletions(-) create mode 100644 Documentation/vm/mmu_notifier.txt -- 2.13.6 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yGKtx5cdxzDrD6 for ; Tue, 17 Oct 2017 14:10:17 +1100 (AEDT) From: jglisse@redhat.com To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org, linux-next@vger.kernel.org Subject: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Mon, 16 Oct 2017 23:10:02 -0400 Message-Id: <20171017031003.7481-2-jglisse@redhat.com> In-Reply-To: <20171017031003.7481-1-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Jérôme Glisse This patch only affects users of mmu_notifier->invalidate_range callback which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ... and it is an optimization for those users. Everyone else is unaffected by it. When clearing a pte/pmd we are given a choice to notify the event under the page table lock (notify version of *_clear_flush helpers do call the mmu_notifier_invalidate_range). But that notification is not necessary in all cases. This patches remove almost all cases where it is useless to have a call to mmu_notifier_invalidate_range before mmu_notifier_invalidate_range_end. It also adds documentation in all those cases explaining why. Below is a more in depth analysis of why this is fine to do this: For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a process virtual address space). There is only 2 cases when you need to notify those secondary TLB while holding page table lock when clearing a pte/pmd: A) page backing address is free before mmu_notifier_invalidate_range_end B) a page table entry is updated to point to a new page (COW, write fault on zero page, __replace_page(), ...) Case A is obvious you do not want to take the risk for the device to write to a page that might now be used by something completely different. Case B is more subtle. For correctness it requires the following sequence to happen: - take page table lock - clear page table entry and notify (pmd/pte_huge_clear_flush_notify()) - set page table entry to point to new page If clearing the page table entry is not followed by a notify before setting the new pte/pmd value then you can break memory model like C11 or C++11 for the device. Consider the following scenario (device use a feature similar to ATS/ PASID): Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume they are write protected for COW (other case of B apply too). [Time N] ----------------------------------------------------------------- CPU-thread-0 {try to write to addrA} CPU-thread-1 {try to write to addrB} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA and populate device TLB} DEV-thread-2 {read addrB and populate device TLB} [Time N+1] --------------------------------------------------------------- CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+2] --------------------------------------------------------------- CPU-thread-0 {COW_step1: {update page table point to new page for addrA}} CPU-thread-1 {COW_step1: {update page table point to new page for addrB}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {write to addrA which is a write to new page} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {} CPU-thread-3 {write to addrB which is a write to new page} DEV-thread-0 {} DEV-thread-2 {} [Time N+4] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+5] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA from old page} DEV-thread-2 {read addrB from new page} So here because at time N+2 the clear page table entry was not pair with a notification to invalidate the secondary TLB, the device see the new value for addrB before seing the new value for addrA. This break total memory ordering for the device. When changing a pte to write protect or to point to a new write protected page with same content (KSM) it is ok to delay invalidate_range callback to mmu_notifier_invalidate_range_end() outside the page table lock. This is true even if the thread doing page table update is preempted right after releasing page table lock before calling mmu_notifier_invalidate_range_end Changed since v1: - typos (thanks to Andrea) - Avoid unnecessary precaution in try_to_unmap() (Andrea) - Be more conservative in try_to_unmap_one() Signed-off-by: Jérôme Glisse Cc: Andrea Arcangeli Cc: Nadav Amit Cc: Linus Torvalds Cc: Andrew Morton Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: David Woodhouse Cc: Alistair Popple Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Stephen Rothwell Cc: Andrew Donnellan Cc: iommu@lists.linux-foundation.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-next@vger.kernel.org --- Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++ fs/dax.c | 9 +++- include/linux/mmu_notifier.h | 3 +- mm/huge_memory.c | 20 +++++++-- mm/hugetlb.c | 16 +++++-- mm/ksm.c | 15 ++++++- mm/rmap.c | 59 ++++++++++++++++++++++--- 7 files changed, 198 insertions(+), 17 deletions(-) create mode 100644 Documentation/vm/mmu_notifier.txt diff --git a/Documentation/vm/mmu_notifier.txt b/Documentation/vm/mmu_notifier.txt new file mode 100644 index 000000000000..23b462566bb7 --- /dev/null +++ b/Documentation/vm/mmu_notifier.txt @@ -0,0 +1,93 @@ +When do you need to notify inside page table lock ? + +When clearing a pte/pmd we are given a choice to notify the event through +(notify version of *_clear_flush call mmu_notifier_invalidate_range) under +the page table lock. But that notification is not necessary in all cases. + +For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use +thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a +process virtual address space). There is only 2 cases when you need to notify +those secondary TLB while holding page table lock when clearing a pte/pmd: + + A) page backing address is free before mmu_notifier_invalidate_range_end() + B) a page table entry is updated to point to a new page (COW, write fault + on zero page, __replace_page(), ...) + +Case A is obvious you do not want to take the risk for the device to write to +a page that might now be used by some completely different task. + +Case B is more subtle. For correctness it requires the following sequence to +happen: + - take page table lock + - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) + - set page table entry to point to new page + +If clearing the page table entry is not followed by a notify before setting +the new pte/pmd value then you can break memory model like C11 or C++11 for +the device. + +Consider the following scenario (device use a feature similar to ATS/PASID): + +Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume +they are write protected for COW (other case of B apply too). + +[Time N] -------------------------------------------------------------------- +CPU-thread-0 {try to write to addrA} +CPU-thread-1 {try to write to addrB} +CPU-thread-2 {} +CPU-thread-3 {} +DEV-thread-0 {read addrA and populate device TLB} +DEV-thread-2 {read addrB and populate device TLB} +[Time N+1] ------------------------------------------------------------------ +CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} +CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} +CPU-thread-2 {} +CPU-thread-3 {} +DEV-thread-0 {} +DEV-thread-2 {} +[Time N+2] ------------------------------------------------------------------ +CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} +CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} +CPU-thread-2 {} +CPU-thread-3 {} +DEV-thread-0 {} +DEV-thread-2 {} +[Time N+3] ------------------------------------------------------------------ +CPU-thread-0 {preempted} +CPU-thread-1 {preempted} +CPU-thread-2 {write to addrA which is a write to new page} +CPU-thread-3 {} +DEV-thread-0 {} +DEV-thread-2 {} +[Time N+3] ------------------------------------------------------------------ +CPU-thread-0 {preempted} +CPU-thread-1 {preempted} +CPU-thread-2 {} +CPU-thread-3 {write to addrB which is a write to new page} +DEV-thread-0 {} +DEV-thread-2 {} +[Time N+4] ------------------------------------------------------------------ +CPU-thread-0 {preempted} +CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} +CPU-thread-2 {} +CPU-thread-3 {} +DEV-thread-0 {} +DEV-thread-2 {} +[Time N+5] ------------------------------------------------------------------ +CPU-thread-0 {preempted} +CPU-thread-1 {} +CPU-thread-2 {} +CPU-thread-3 {} +DEV-thread-0 {read addrA from old page} +DEV-thread-2 {read addrB from new page} + +So here because at time N+2 the clear page table entry was not pair with a +notification to invalidate the secondary TLB, the device see the new value for +addrB before seing the new value for addrA. This break total memory ordering +for the device. + +When changing a pte to write protect or to point to a new write protected page +with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range +call to mmu_notifier_invalidate_range_end() outside the page table lock. This +is true even if the thread doing the page table update is preempted right after +releasing page table lock but before call mmu_notifier_invalidate_range_end(). diff --git a/fs/dax.c b/fs/dax.c index f3a44a7c14b3..9ec797424e4f 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -614,6 +614,13 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl)) continue; + /* + * No need to call mmu_notifier_invalidate_range() as we are + * downgrading page table protection not changing it to point + * to a new page. + * + * See Documentation/vm/mmu_notifier.txt + */ if (pmdp) { #ifdef CONFIG_FS_DAX_PMD pmd_t pmd; @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, pmd = pmd_wrprotect(pmd); pmd = pmd_mkclean(pmd); set_pmd_at(vma->vm_mm, address, pmdp, pmd); - mmu_notifier_invalidate_range(vma->vm_mm, start, end); unlock_pmd: spin_unlock(ptl); #endif @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, pte = pte_wrprotect(pte); pte = pte_mkclean(pte); set_pte_at(vma->vm_mm, address, ptep, pte); - mmu_notifier_invalidate_range(vma->vm_mm, start, end); unlock_pte: pte_unmap_unlock(ptep, ptl); } diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 6866e8126982..49c925c96b8a 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -155,7 +155,8 @@ struct mmu_notifier_ops { * shared page-tables, it not necessary to implement the * invalidate_range_start()/end() notifiers, as * invalidate_range() alread catches the points in time when an - * external TLB range needs to be flushed. + * external TLB range needs to be flushed. For more in depth + * discussion on this see Documentation/vm/mmu_notifier.txt * * The invalidate_range() function is called under the ptl * spin-lock and not allowed to sleep. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c037d3d34950..ff5bc647b51d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, goto out_free_pages; VM_BUG_ON_PAGE(!PageHead(page), page); + /* + * Leave pmd empty until pte is filled note we must notify here as + * concurrent CPU thread might write to new page before the call to + * mmu_notifier_invalidate_range_end() happens which can lead to a + * device seeing memory write in different order than CPU. + * + * See Documentation/vm/mmu_notifier.txt + */ pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); - /* leave pmd empty until pte is filled */ pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); pmd_populate(vma->vm_mm, &_pmd, pgtable); @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, pmd_t _pmd; int i; - /* leave pmd empty until pte is filled */ - pmdp_huge_clear_flush_notify(vma, haddr, pmd); + /* + * Leave pmd empty until pte is filled note that it is fine to delay + * notification until mmu_notifier_invalidate_range_end() as we are + * replacing a zero pmd write protected page with a zero pte write + * protected page. + * + * See Documentation/vm/mmu_notifier.txt + */ + pmdp_huge_clear_flush(vma, haddr, pmd); pgtable = pgtable_trans_huge_withdraw(mm, pmd); pmd_populate(mm, &_pmd, pgtable); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1768efa4c501..63a63f1b536c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); } else { if (cow) { + /* + * No need to notify as we are downgrading page + * table protection not changing it to point + * to a new page. + * + * See Documentation/vm/mmu_notifier.txt + */ huge_ptep_set_wrprotect(src, addr, src_pte); - mmu_notifier_invalidate_range(src, mmun_start, - mmun_end); } entry = huge_ptep_get(src_pte); ptepage = pte_page(entry); @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, * and that page table be reused and filled with junk. */ flush_hugetlb_tlb_range(vma, start, end); - mmu_notifier_invalidate_range(mm, start, end); + /* + * No need to call mmu_notifier_invalidate_range() we are downgrading + * page table protection not changing it to point to a new page. + * + * See Documentation/vm/mmu_notifier.txt + */ i_mmap_unlock_write(vma->vm_file->f_mapping); mmu_notifier_invalidate_range_end(mm, start, end); diff --git a/mm/ksm.c b/mm/ksm.c index 6cb60f46cce5..be8f4576f842 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, * So we clear the pte and flush the tlb before the check * this assure us that no O_DIRECT can happen after the check * or in the middle of the check. + * + * No need to notify as we are downgrading page table to read + * only not changing it to point to a new page. + * + * See Documentation/vm/mmu_notifier.txt */ - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); /* * Check that no O_DIRECT or similar I/O is in progress on the * page @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, } flush_cache_page(vma, addr, pte_pfn(*ptep)); - ptep_clear_flush_notify(vma, addr, ptep); + /* + * No need to notify as we are replacing a read only page with another + * read only page with the same content. + * + * See Documentation/vm/mmu_notifier.txt + */ + ptep_clear_flush(vma, addr, ptep); set_pte_at_notify(mm, addr, ptep, newpte); page_remove_rmap(page, false); diff --git a/mm/rmap.c b/mm/rmap.c index 061826278520..6b5a0f219ac0 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, #endif } - if (ret) { - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); + /* + * No need to call mmu_notifier_invalidate_range() as we are + * downgrading page table protection not changing it to point + * to a new page. + * + * See Documentation/vm/mmu_notifier.txt + */ + if (ret) (*cleaned)++; - } } mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); + /* + * No need to invalidate here it will synchronize on + * against the special swap migration pte. + */ goto discard; } @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * will take care of the rest. */ dec_mm_counter(mm, mm_counter(page)); + /* We have to invalidate as we cleared the pte */ + mmu_notifier_invalidate_range(mm, address, + address + PAGE_SIZE); } else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { swp_entry_t entry; @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); set_pte_at(mm, address, pvmw.pte, swp_pte); + /* + * No need to invalidate here it will synchronize on + * against the special swap migration pte. + */ } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(subpage) }; pte_t swp_pte; @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, WARN_ON_ONCE(1); ret = false; /* We have to invalidate as we cleared the pte */ + mmu_notifier_invalidate_range(mm, address, + address + PAGE_SIZE); page_vma_mapped_walk_done(&pvmw); break; } @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* MADV_FREE page check */ if (!PageSwapBacked(page)) { if (!PageDirty(page)) { + /* Invalidate as we cleared the pte */ + mmu_notifier_invalidate_range(mm, + address, address + PAGE_SIZE); dec_mm_counter(mm, MM_ANONPAGES); goto discard; } @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); set_pte_at(mm, address, pvmw.pte, swp_pte); - } else + /* Invalidate as we cleared the pte */ + mmu_notifier_invalidate_range(mm, address, + address + PAGE_SIZE); + } else { + /* + * We should not need to notify here as we reach this + * case only from freeze_page() itself only call from + * split_huge_page_to_list() so everything below must + * be true: + * - page is not anonymous + * - page is locked + * + * So as it is a locked file back page thus it can not + * be remove from the page cache and replace by a new + * page before mmu_notifier_invalidate_range_end so no + * concurrent thread might update its page table to + * point at new page while a device still is using this + * page. + * + * See Documentation/vm/mmu_notifier.txt + */ dec_mm_counter(mm, mm_counter_file(page)); + } discard: + /* + * No need to call mmu_notifier_invalidate_range() it has be + * done above for all cases requiring it to happen under page + * table lock before mmu_notifier_invalidate_range_end() + * + * See Documentation/vm/mmu_notifier.txt + */ page_remove_rmap(subpage, PageHuge(page)); put_page(page); - mmu_notifier_invalidate_range(mm, address, - address + PAGE_SIZE); } mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); -- 2.13.6 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yGKtz6gxvzDrD6 for ; Tue, 17 Oct 2017 14:10:19 +1100 (AEDT) From: jglisse@redhat.com To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org Subject: [PATCH 2/2] mm/mmu_notifier: avoid call to invalidate_range() in range_end() Date: Mon, 16 Oct 2017 23:10:03 -0400 Message-Id: <20171017031003.7481-3-jglisse@redhat.com> In-Reply-To: <20171017031003.7481-1-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Jérôme Glisse This is an optimization patch that only affect mmu_notifier users which rely on the invalidate_range() callback. This patch avoids calling that callback twice in a row from inside __mmu_notifier_invalidate_range_end Existing pattern (before this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_end() mmu_notifier_invalidate_range() New pattern (after this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_only_end() We call the invalidate_range callback after clearing the page table under the page table lock and we skip the call to invalidate_range inside the __mmu_notifier_invalidate_range_end() function. Idea from Andrea Arcangeli Signed-off-by: Jérôme Glisse Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: David Woodhouse Cc: Alistair Popple Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Stephen Rothwell Cc: Andrew Donnellan Cc: iommu@lists.linux-foundation.org Cc: linuxppc-dev@lists.ozlabs.org --- include/linux/mmu_notifier.h | 17 ++++++++++++++-- mm/huge_memory.c | 46 ++++++++++++++++++++++++++++++++++++++++---- mm/memory.c | 6 +++++- mm/migrate.c | 15 ++++++++++++--- mm/mmu_notifier.c | 11 +++++++++-- 5 files changed, 83 insertions(+), 12 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 49c925c96b8a..6665c4624287 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -213,7 +213,8 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm, extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, unsigned long start, unsigned long end); extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end); + unsigned long start, unsigned long end, + bool only_end); extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end); @@ -267,7 +268,14 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, unsigned long start, unsigned long end) { if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_end(mm, start, end); + __mmu_notifier_invalidate_range_end(mm, start, end, false); +} + +static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end, true); } static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, @@ -438,6 +446,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, { } +static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ff5bc647b51d..b2912305994f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1220,7 +1220,12 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, page_remove_rmap(page, true); spin_unlock(vmf->ptl); - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pmdp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start, + mmun_end); ret |= VM_FAULT_WRITE; put_page(page); @@ -1369,7 +1374,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) } spin_unlock(vmf->ptl); out_mn: - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pmdp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start, + mmun_end); out: return ret; out_unlock: @@ -2021,7 +2031,12 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, out: spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PUD_SIZE); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pudp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(mm, haddr, haddr + + HPAGE_PUD_SIZE); } #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ @@ -2096,6 +2111,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR); return; } else if (is_huge_zero_pmd(*pmd)) { + /* + * FIXME: Do we want to invalidate secondary mmu by calling + * mmu_notifier_invalidate_range() see comments below inside + * __split_huge_pmd() ? + * + * We are going from a zero huge page write protected to zero + * small page also write protected so it does not seems useful + * to invalidate secondary mmu at this time. + */ return __split_huge_zero_page_pmd(vma, haddr, pmd); } @@ -2231,7 +2255,21 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, __split_huge_pmd_locked(vma, pmd, haddr, freeze); out: spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE); + /* + * No need to double call mmu_notifier->invalidate_range() callback. + * They are 3 cases to consider inside __split_huge_pmd_locked(): + * 1) pmdp_huge_clear_flush_notify() call invalidate_range() obvious + * 2) __split_huge_zero_page_pmd() read only zero page and any write + * fault will trigger a flush_notify before pointing to a new page + * (it is fine if the secondary mmu keeps pointing to the old zero + * page in the meantime) + * 3) Split a huge pmd into pte pointing to the same page. No need + * to invalidate secondary tlb entry they are all still valid. + * any further changes to individual pte will notify. So no need + * to call mmu_notifier->invalidate_range() + */ + mmu_notifier_invalidate_range_only_end(mm, haddr, haddr + + HPAGE_PMD_SIZE); } void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index 47cdf4e85c2d..8a0c410037d2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2555,7 +2555,11 @@ static int wp_page_copy(struct vm_fault *vmf) put_page(new_page); pte_unmap_unlock(vmf->pte, vmf->ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above ptep_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end); if (old_page) { /* * Don't let another task, with possibly unlocked vma, diff --git a/mm/migrate.c b/mm/migrate.c index e00814ca390e..2f0f8190cb6f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2088,7 +2088,11 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED); spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pmdp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end); /* Take an "isolate" reference and put new page on the LRU. */ get_page(new_page); @@ -2804,9 +2808,14 @@ static void migrate_vma_pages(struct migrate_vma *migrate) migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; } + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above ptep_clear_flush_notify() inside migrate_vma_insert_page() + * did already call it. + */ if (notified) - mmu_notifier_invalidate_range_end(mm, mmu_start, - migrate->end); + mmu_notifier_invalidate_range_only_end(mm, mmu_start, + migrate->end); } /* diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 314285284e6e..96edb33fd09a 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -190,7 +190,9 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start); void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, + unsigned long end, + bool only_end) { struct mmu_notifier *mn; int id; @@ -204,8 +206,13 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, * subsystem registers either invalidate_range_start()/end() or * invalidate_range(), so this will be no additional overhead * (besides the pointer check). + * + * We skip call to invalidate_range() if we know it is safe ie + * call site use mmu_notifier_invalidate_range_only_end() which + * is safe to do when we know that a call to invalidate_range() + * already happen under page table lock. */ - if (mn->ops->invalidate_range) + if (!only_end && mn->ops->invalidate_range) mn->ops->invalidate_range(mn, mm, start, end); if (mn->ops->invalidate_range_end) mn->ops->invalidate_range_end(mn, mm, start, end); -- 2.13.6 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-x244.google.com (mail-pf0-x244.google.com [IPv6:2607:f8b0:400e:c00::244]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yHYC92gCZzDqBd for ; Thu, 19 Oct 2017 13:43:33 +1100 (AEDT) Received: by mail-pf0-x244.google.com with SMTP id t188so5414711pfd.10 for ; Wed, 18 Oct 2017 19:43:33 -0700 (PDT) Date: Thu, 19 Oct 2017 13:43:19 +1100 From: Balbir Singh To: jglisse@redhat.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org Subject: Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback Message-ID: <20171019134319.1b856091@MiWiFi-R3-srv> In-Reply-To: <20171017031003.7481-1-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Mon, 16 Oct 2017 23:10:01 -0400 jglisse@redhat.com wrote: > From: J=C3=A9r=C3=B4me Glisse >=20 > (Andrew you already have v1 in your queue of patch 1, patch 2 is new, > i think you can drop it patch 1 v1 for v2, v2 is bit more conservative > and i fixed typos) >=20 > All this only affect user of invalidate_range callback (at this time > CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in > drivers/iommu/amd_iommu_v2.c|intel-svm.c) >=20 > This patchset remove useless double call to mmu_notifier->invalidate_range > callback wherever it is safe to do so. The first patch just remove useless > call As in an extra call? Where does that come from? > and add documentation explaining why it is safe to do so. The second > patch go further by introducing mmu_notifier_invalidate_range_only_end() > which skip callback to invalidate_range this can be done when clearing a > pte, pmd or pud with notification which call invalidate_range right after > clearing under the page table lock. > Balbir Singh. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-x241.google.com (mail-pf0-x241.google.com [IPv6:2607:f8b0:400e:c00::241]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yHYgX6FxMzDqkm for ; Thu, 19 Oct 2017 14:04:40 +1100 (AEDT) Received: by mail-pf0-x241.google.com with SMTP id b79so5454783pfk.5 for ; Wed, 18 Oct 2017 20:04:40 -0700 (PDT) Date: Thu, 19 Oct 2017 14:04:26 +1100 From: Balbir Singh To: jglisse@redhat.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org, linux-next@vger.kernel.org Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Message-ID: <20171019140426.21f51957@MiWiFi-R3-srv> In-Reply-To: <20171017031003.7481-2-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Mon, 16 Oct 2017 23:10:02 -0400 jglisse@redhat.com wrote: > From: J=C3=A9r=C3=B4me Glisse >=20 > + /* > + * No need to call mmu_notifier_invalidate_range() as we are > + * downgrading page table protection not changing it to point > + * to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > if (pmdp) { > #ifdef CONFIG_FS_DAX_PMD > pmd_t pmd; > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_= space *mapping, > pmd =3D pmd_wrprotect(pmd); > pmd =3D pmd_mkclean(pmd); > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); Could the secondary TLB still see the mapping as dirty and propagate the di= rty bit back? > unlock_pmd: > spin_unlock(ptl); > #endif > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_= space *mapping, > pte =3D pte_wrprotect(pte); > pte =3D pte_mkclean(pte); > set_pte_at(vma->vm_mm, address, ptep, pte); > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); Ditto > unlock_pte: > pte_unmap_unlock(ptep, ptl); > } > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 6866e8126982..49c925c96b8a 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > * shared page-tables, it not necessary to implement the > * invalidate_range_start()/end() notifiers, as > * invalidate_range() alread catches the points in time when an > - * external TLB range needs to be flushed. > + * external TLB range needs to be flushed. For more in depth > + * discussion on this see Documentation/vm/mmu_notifier.txt > * > * The invalidate_range() function is called under the ptl > * spin-lock and not allowed to sleep. > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index c037d3d34950..ff5bc647b51d 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_= fault *vmf, pmd_t orig_pmd, > goto out_free_pages; > VM_BUG_ON_PAGE(!PageHead(page), page); > =20 > + /* > + * Leave pmd empty until pte is filled note we must notify here as > + * concurrent CPU thread might write to new page before the call to > + * mmu_notifier_invalidate_range_end() happens which can lead to a > + * device seeing memory write in different order than CPU. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > - /* leave pmd empty until pte is filled */ > =20 > pgtable =3D pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > pmd_populate(vma->vm_mm, &_pmd, pgtable); > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_a= rea_struct *vma, > pmd_t _pmd; > int i; > =20 > - /* leave pmd empty until pte is filled */ > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > + /* > + * Leave pmd empty until pte is filled note that it is fine to delay > + * notification until mmu_notifier_invalidate_range_end() as we are > + * replacing a zero pmd write protected page with a zero pte write > + * protected page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > + pmdp_huge_clear_flush(vma, haddr, pmd); Shouldn't the secondary TLB know if the page size changed? > =20 > pgtable =3D pgtable_trans_huge_withdraw(mm, pmd); > pmd_populate(mm, &_pmd, pgtable); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 1768efa4c501..63a63f1b536c 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst,= struct mm_struct *src, > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > } else { > if (cow) { > + /* > + * No need to notify as we are downgrading page > + * table protection not changing it to point > + * to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > huge_ptep_set_wrprotect(src, addr, src_pte); OK.. so we could get write faults on write accesses from the device. > - mmu_notifier_invalidate_range(src, mmun_start, > - mmun_end); > } > entry =3D huge_ptep_get(src_pte); > ptepage =3D pte_page(entry); > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_= area_struct *vma, > * and that page table be reused and filled with junk. > */ > flush_hugetlb_tlb_range(vma, start, end); > - mmu_notifier_invalidate_range(mm, start, end); > + /* > + * No need to call mmu_notifier_invalidate_range() we are downgrading > + * page table protection not changing it to point to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > i_mmap_unlock_write(vma->vm_file->f_mapping); > mmu_notifier_invalidate_range_end(mm, start, end); > =20 > diff --git a/mm/ksm.c b/mm/ksm.c > index 6cb60f46cce5..be8f4576f842 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struc= t *vma, struct page *page, > * So we clear the pte and flush the tlb before the check > * this assure us that no O_DIRECT can happen after the check > * or in the middle of the check. > + * > + * No need to notify as we are downgrading page table to read > + * only not changing it to point to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > */ > - entry =3D ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > + entry =3D ptep_clear_flush(vma, pvmw.address, pvmw.pte); > /* > * Check that no O_DIRECT or similar I/O is in progress on the > * page > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma= , struct page *page, > } > =20 > flush_cache_page(vma, addr, pte_pfn(*ptep)); > - ptep_clear_flush_notify(vma, addr, ptep); > + /* > + * No need to notify as we are replacing a read only page with another > + * read only page with the same content. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > + ptep_clear_flush(vma, addr, ptep); > set_pte_at_notify(mm, addr, ptep, newpte); > =20 > page_remove_rmap(page, false); > diff --git a/mm/rmap.c b/mm/rmap.c > index 061826278520..6b5a0f219ac0 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, str= uct vm_area_struct *vma, > #endif > } > =20 > - if (ret) { > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > + /* > + * No need to call mmu_notifier_invalidate_range() as we are > + * downgrading page table protection not changing it to point > + * to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > + if (ret) > (*cleaned)++; > - } > } > =20 > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, st= ruct vm_area_struct *vma, > if (pte_soft_dirty(pteval)) > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > + /* > + * No need to invalidate here it will synchronize on > + * against the special swap migration pte. > + */ > goto discard; > } > =20 > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, str= uct vm_area_struct *vma, > * will take care of the rest. > */ > dec_mm_counter(mm, mm_counter(page)); > + /* We have to invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, address, > + address + PAGE_SIZE); > } else if (IS_ENABLED(CONFIG_MIGRATION) && > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > swp_entry_t entry; > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, st= ruct vm_area_struct *vma, > if (pte_soft_dirty(pteval)) > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); > set_pte_at(mm, address, pvmw.pte, swp_pte); > + /* > + * No need to invalidate here it will synchronize on > + * against the special swap migration pte. > + */ > } else if (PageAnon(page)) { > swp_entry_t entry =3D { .val =3D page_private(subpage) }; > pte_t swp_pte; > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, str= uct vm_area_struct *vma, > WARN_ON_ONCE(1); > ret =3D false; > /* We have to invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, address, > + address + PAGE_SIZE); > page_vma_mapped_walk_done(&pvmw); > break; > } > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, str= uct vm_area_struct *vma, > /* MADV_FREE page check */ > if (!PageSwapBacked(page)) { > if (!PageDirty(page)) { > + /* Invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, > + address, address + PAGE_SIZE); > dec_mm_counter(mm, MM_ANONPAGES); > goto discard; > } > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, s= truct vm_area_struct *vma, > if (pte_soft_dirty(pteval)) > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); > set_pte_at(mm, address, pvmw.pte, swp_pte); > - } else > + /* Invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, address, > + address + PAGE_SIZE); > + } else { > + /* > + * We should not need to notify here as we reach this > + * case only from freeze_page() itself only call from > + * split_huge_page_to_list() so everything below must > + * be true: > + * - page is not anonymous > + * - page is locked > + * > + * So as it is a locked file back page thus it can not > + * be remove from the page cache and replace by a new > + * page before mmu_notifier_invalidate_range_end so no > + * concurrent thread might update its page table to > + * point at new page while a device still is using this > + * page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > dec_mm_counter(mm, mm_counter_file(page)); > + } > discard: > + /* > + * No need to call mmu_notifier_invalidate_range() it has be > + * done above for all cases requiring it to happen under page > + * table lock before mmu_notifier_invalidate_range_end() > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > page_remove_rmap(subpage, PageHuge(page)); > put_page(page); > - mmu_notifier_invalidate_range(mm, address, > - address + PAGE_SIZE); > } > =20 > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); Looking at the patchset, I understand the efficiency, but I am concerned with correctness. Balbir Singh. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yHYlm10qwzDqBd for ; Thu, 19 Oct 2017 14:08:19 +1100 (AEDT) Date: Wed, 18 Oct 2017 23:08:12 -0400 From: Jerome Glisse To: Balbir Singh Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org Subject: Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback Message-ID: <20171019030812.GB5246@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171019134319.1b856091@MiWiFi-R3-srv> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 In-Reply-To: <20171019134319.1b856091@MiWiFi-R3-srv> List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, Oct 19, 2017 at 01:43:19PM +1100, Balbir Singh wrote: > On Mon, 16 Oct 2017 23:10:01 -0400 > jglisse@redhat.com wrote: > > > From: Jérôme Glisse > > > > (Andrew you already have v1 in your queue of patch 1, patch 2 is new, > > i think you can drop it patch 1 v1 for v2, v2 is bit more conservative > > and i fixed typos) > > > > All this only affect user of invalidate_range callback (at this time > > CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in > > drivers/iommu/amd_iommu_v2.c|intel-svm.c) > > > > This patchset remove useless double call to mmu_notifier->invalidate_range > > callback wherever it is safe to do so. The first patch just remove useless > > call > > As in an extra call? Where does that come from? Before this patch you had the following pattern: mmu_notifier_invalidate_range_start(); take_page_table_lock() ... update_page_table() mmu_notifier_invalidate_range() ... drop_page_table_lock() mmu_notifier_invalidate_range_end(); It happens that mmu_notifier_invalidate_range_end() also make an unconditional call to mmu_notifier_invalidate_range() so in the above scenario you had 2 calls to mmu_notifier_invalidate_range() Obviously one of the 2 call is useless. In some case you can drop the first call (under the page table lock) this is what patch 1 does. In other cases you can drop the second call that happen inside mmu_notifier_invalidate_range_end() that is what patch 2 does. Hence why i am referring to useless double call. I have added more documentation to explain all this in the code and also under Documentation/vm/mmu_notifier.txt > > > and add documentation explaining why it is safe to do so. The second > > patch go further by introducing mmu_notifier_invalidate_range_only_end() > > which skip callback to invalidate_range this can be done when clearing a > > pte, pmd or pud with notification which call invalidate_range right after > > clearing under the page table lock. > > > > Balbir Singh. > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yHZBs38wTzDq5f for ; Thu, 19 Oct 2017 14:28:21 +1100 (AEDT) Date: Wed, 18 Oct 2017 23:28:12 -0400 From: Jerome Glisse To: Balbir Singh Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org, linux-next@vger.kernel.org Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Message-ID: <20171019032811.GC5246@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 In-Reply-To: <20171019140426.21f51957@MiWiFi-R3-srv> List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > On Mon, 16 Oct 2017 23:10:02 -0400 > jglisse@redhat.com wrote: > > > From: Jérôme Glisse > > > > + /* > > + * No need to call mmu_notifier_invalidate_range() as we are > > + * downgrading page table protection not changing it to point > > + * to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > if (pmdp) { > > #ifdef CONFIG_FS_DAX_PMD > > pmd_t pmd; > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > pmd = pmd_wrprotect(pmd); > > pmd = pmd_mkclean(pmd); > > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? I am assuming hardware does sane thing of setting the dirty bit only when walking the CPU page table when device does a write fault ie once the device get a write TLB entry the dirty is set by the IOMMU when walking the page table before returning the lookup result to the device and that it won't be set again latter (ie propagated back latter). I should probably have spell that out and maybe some of the ATS/PASID implementer did not do that. > > > unlock_pmd: > > spin_unlock(ptl); > > #endif > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > pte = pte_wrprotect(pte); > > pte = pte_mkclean(pte); > > set_pte_at(vma->vm_mm, address, ptep, pte); > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > Ditto > > > unlock_pte: > > pte_unmap_unlock(ptep, ptl); > > } > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index 6866e8126982..49c925c96b8a 100644 > > --- a/include/linux/mmu_notifier.h > > +++ b/include/linux/mmu_notifier.h > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > > * shared page-tables, it not necessary to implement the > > * invalidate_range_start()/end() notifiers, as > > * invalidate_range() alread catches the points in time when an > > - * external TLB range needs to be flushed. > > + * external TLB range needs to be flushed. For more in depth > > + * discussion on this see Documentation/vm/mmu_notifier.txt > > * > > * The invalidate_range() function is called under the ptl > > * spin-lock and not allowed to sleep. > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index c037d3d34950..ff5bc647b51d 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > goto out_free_pages; > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > + /* > > + * Leave pmd empty until pte is filled note we must notify here as > > + * concurrent CPU thread might write to new page before the call to > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > + * device seeing memory write in different order than CPU. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > - /* leave pmd empty until pte is filled */ > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > pmd_t _pmd; > > int i; > > > > - /* leave pmd empty until pte is filled */ > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > + /* > > + * Leave pmd empty until pte is filled note that it is fine to delay > > + * notification until mmu_notifier_invalidate_range_end() as we are > > + * replacing a zero pmd write protected page with a zero pte write > > + * protected page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > Shouldn't the secondary TLB know if the page size changed? It should not matter, we are talking virtual to physical on behalf of a device against a process address space. So the hardware should not care about the page size. Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero 4K pages is replace by something new then a device TLB shootdown will happen before the new page is set. Only issue i can think of is if the IOMMU TLB (if there is one) or the device TLB (you do expect that there is one) does not invalidate TLB entry if the TLB shootdown is smaller than the TLB entry. That would be idiotic but yes i know hardware bug. > > > > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > pmd_populate(mm, &_pmd, pgtable); > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 1768efa4c501..63a63f1b536c 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > > } else { > > if (cow) { > > + /* > > + * No need to notify as we are downgrading page > > + * table protection not changing it to point > > + * to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > huge_ptep_set_wrprotect(src, addr, src_pte); > > OK.. so we could get write faults on write accesses from the device. > > > - mmu_notifier_invalidate_range(src, mmun_start, > > - mmun_end); > > } > > entry = huge_ptep_get(src_pte); > > ptepage = pte_page(entry); > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > > * and that page table be reused and filled with junk. > > */ > > flush_hugetlb_tlb_range(vma, start, end); > > - mmu_notifier_invalidate_range(mm, start, end); > > + /* > > + * No need to call mmu_notifier_invalidate_range() we are downgrading > > + * page table protection not changing it to point to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > i_mmap_unlock_write(vma->vm_file->f_mapping); > > mmu_notifier_invalidate_range_end(mm, start, end); > > > > diff --git a/mm/ksm.c b/mm/ksm.c > > index 6cb60f46cce5..be8f4576f842 100644 > > --- a/mm/ksm.c > > +++ b/mm/ksm.c > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > > * So we clear the pte and flush the tlb before the check > > * this assure us that no O_DIRECT can happen after the check > > * or in the middle of the check. > > + * > > + * No need to notify as we are downgrading page table to read > > + * only not changing it to point to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > */ > > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > > /* > > * Check that no O_DIRECT or similar I/O is in progress on the > > * page > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > > } > > > > flush_cache_page(vma, addr, pte_pfn(*ptep)); > > - ptep_clear_flush_notify(vma, addr, ptep); > > + /* > > + * No need to notify as we are replacing a read only page with another > > + * read only page with the same content. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > + ptep_clear_flush(vma, addr, ptep); > > set_pte_at_notify(mm, addr, ptep, newpte); > > > > page_remove_rmap(page, false); > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 061826278520..6b5a0f219ac0 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > > #endif > > } > > > > - if (ret) { > > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > > + /* > > + * No need to call mmu_notifier_invalidate_range() as we are > > + * downgrading page table protection not changing it to point > > + * to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > + if (ret) > > (*cleaned)++; > > - } > > } > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > if (pte_soft_dirty(pteval)) > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > > + /* > > + * No need to invalidate here it will synchronize on > > + * against the special swap migration pte. > > + */ > > goto discard; > > } > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > * will take care of the rest. > > */ > > dec_mm_counter(mm, mm_counter(page)); > > + /* We have to invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, address, > > + address + PAGE_SIZE); > > } else if (IS_ENABLED(CONFIG_MIGRATION) && > > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > > swp_entry_t entry; > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > if (pte_soft_dirty(pteval)) > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > + /* > > + * No need to invalidate here it will synchronize on > > + * against the special swap migration pte. > > + */ > > } else if (PageAnon(page)) { > > swp_entry_t entry = { .val = page_private(subpage) }; > > pte_t swp_pte; > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > WARN_ON_ONCE(1); > > ret = false; > > /* We have to invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, address, > > + address + PAGE_SIZE); > > page_vma_mapped_walk_done(&pvmw); > > break; > > } > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > /* MADV_FREE page check */ > > if (!PageSwapBacked(page)) { > > if (!PageDirty(page)) { > > + /* Invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, > > + address, address + PAGE_SIZE); > > dec_mm_counter(mm, MM_ANONPAGES); > > goto discard; > > } > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > if (pte_soft_dirty(pteval)) > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > - } else > > + /* Invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, address, > > + address + PAGE_SIZE); > > + } else { > > + /* > > + * We should not need to notify here as we reach this > > + * case only from freeze_page() itself only call from > > + * split_huge_page_to_list() so everything below must > > + * be true: > > + * - page is not anonymous > > + * - page is locked > > + * > > + * So as it is a locked file back page thus it can not > > + * be remove from the page cache and replace by a new > > + * page before mmu_notifier_invalidate_range_end so no > > + * concurrent thread might update its page table to > > + * point at new page while a device still is using this > > + * page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > dec_mm_counter(mm, mm_counter_file(page)); > > + } > > discard: > > + /* > > + * No need to call mmu_notifier_invalidate_range() it has be > > + * done above for all cases requiring it to happen under page > > + * table lock before mmu_notifier_invalidate_range_end() > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > page_remove_rmap(subpage, PageHuge(page)); > > put_page(page); > > - mmu_notifier_invalidate_range(mm, address, > > - address + PAGE_SIZE); > > } > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > Looking at the patchset, I understand the efficiency, but I am concerned > with correctness. I am fine in holding this off from reaching Linus but only way to flush this issues out if any is to have this patch in linux-next or somewhere were they get a chance of being tested. Note that the second patch is always safe. I agree that this one might not be if hardware implementation is idiotic (well that would be my opinion and any opinion/point of view can be challenge :)) > > Balbir Singh. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-x243.google.com (mail-vk0-x243.google.com [IPv6:2607:f8b0:400c:c05::243]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yHm4C1XKjzDqBd for ; Thu, 19 Oct 2017 21:53:14 +1100 (AEDT) Received: by mail-vk0-x243.google.com with SMTP id q80so5089139vka.7 for ; Thu, 19 Oct 2017 03:53:14 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20171019032811.GC5246@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> From: Balbir Singh Date: Thu, 19 Oct 2017 21:53:11 +1100 Message-ID: Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 To: Jerome Glisse Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next Content-Type: text/plain; charset="UTF-8" List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: >> On Mon, 16 Oct 2017 23:10:02 -0400 >> jglisse@redhat.com wrote: >> >> > From: J=C3=A9r=C3=B4me Glisse >> > >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() as we a= re >> > + * downgrading page table protection not changing it to po= int >> > + * to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > if (pmdp) { >> > #ifdef CONFIG_FS_DAX_PMD >> > pmd_t pmd; >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct addre= ss_space *mapping, >> > pmd =3D pmd_wrprotect(pmd); >> > pmd =3D pmd_mkclean(pmd); >> > set_pmd_at(vma->vm_mm, address, pmdp, pmd); >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, e= nd); >> >> Could the secondary TLB still see the mapping as dirty and propagate the= dirty bit back? > > I am assuming hardware does sane thing of setting the dirty bit only > when walking the CPU page table when device does a write fault ie > once the device get a write TLB entry the dirty is set by the IOMMU > when walking the page table before returning the lookup result to the > device and that it won't be set again latter (ie propagated back > latter). > The other possibility is that the hardware things the page is writable and already marked dirty. It allows writes and does not set the dirty bit? > I should probably have spell that out and maybe some of the ATS/PASID > implementer did not do that. > >> >> > unlock_pmd: >> > spin_unlock(ptl); >> > #endif >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct addre= ss_space *mapping, >> > pte =3D pte_wrprotect(pte); >> > pte =3D pte_mkclean(pte); >> > set_pte_at(vma->vm_mm, address, ptep, pte); >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, e= nd); >> >> Ditto >> >> > unlock_pte: >> > pte_unmap_unlock(ptep, ptl); >> > } >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier= .h >> > index 6866e8126982..49c925c96b8a 100644 >> > --- a/include/linux/mmu_notifier.h >> > +++ b/include/linux/mmu_notifier.h >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { >> > * shared page-tables, it not necessary to implement the >> > * invalidate_range_start()/end() notifiers, as >> > * invalidate_range() alread catches the points in time when an >> > - * external TLB range needs to be flushed. >> > + * external TLB range needs to be flushed. For more in depth >> > + * discussion on this see Documentation/vm/mmu_notifier.txt >> > * >> > * The invalidate_range() function is called under the ptl >> > * spin-lock and not allowed to sleep. >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> > index c037d3d34950..ff5bc647b51d 100644 >> > --- a/mm/huge_memory.c >> > +++ b/mm/huge_memory.c >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct = vm_fault *vmf, pmd_t orig_pmd, >> > goto out_free_pages; >> > VM_BUG_ON_PAGE(!PageHead(page), page); >> > >> > + /* >> > + * Leave pmd empty until pte is filled note we must notify here as >> > + * concurrent CPU thread might write to new page before the call t= o >> > + * mmu_notifier_invalidate_range_end() happens which can lead to a >> > + * device seeing memory write in different order than CPU. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); >> > - /* leave pmd empty until pte is filled */ >> > >> > pgtable =3D pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); >> > pmd_populate(vma->vm_mm, &_pmd, pgtable); >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct v= m_area_struct *vma, >> > pmd_t _pmd; >> > int i; >> > >> > - /* leave pmd empty until pte is filled */ >> > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); >> > + /* >> > + * Leave pmd empty until pte is filled note that it is fine to del= ay >> > + * notification until mmu_notifier_invalidate_range_end() as we ar= e >> > + * replacing a zero pmd write protected page with a zero pte write >> > + * protected page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > + pmdp_huge_clear_flush(vma, haddr, pmd); >> >> Shouldn't the secondary TLB know if the page size changed? > > It should not matter, we are talking virtual to physical on behalf > of a device against a process address space. So the hardware should > not care about the page size. > Does that not indicate how much the device can access? Could it try to access more than what is mapped? > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > 4K pages is replace by something new then a device TLB shootdown will > happen before the new page is set. > > Only issue i can think of is if the IOMMU TLB (if there is one) or > the device TLB (you do expect that there is one) does not invalidate > TLB entry if the TLB shootdown is smaller than the TLB entry. That > would be idiotic but yes i know hardware bug. > > >> >> > >> > pgtable =3D pgtable_trans_huge_withdraw(mm, pmd); >> > pmd_populate(mm, &_pmd, pgtable); >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> > index 1768efa4c501..63a63f1b536c 100644 >> > --- a/mm/hugetlb.c >> > +++ b/mm/hugetlb.c >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *d= st, struct mm_struct *src, >> > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz= ); >> > } else { >> > if (cow) { >> > + /* >> > + * No need to notify as we are downgrading= page >> > + * table protection not changing it to poi= nt >> > + * to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > huge_ptep_set_wrprotect(src, addr, src_pte= ); >> >> OK.. so we could get write faults on write accesses from the device. >> >> > - mmu_notifier_invalidate_range(src, mmun_st= art, >> > - mmun_en= d); >> > } >> > entry =3D huge_ptep_get(src_pte); >> > ptepage =3D pte_page(entry); >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct = vm_area_struct *vma, >> > * and that page table be reused and filled with junk. >> > */ >> > flush_hugetlb_tlb_range(vma, start, end); >> > - mmu_notifier_invalidate_range(mm, start, end); >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() we are downgrad= ing >> > + * page table protection not changing it to point to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > i_mmap_unlock_write(vma->vm_file->f_mapping); >> > mmu_notifier_invalidate_range_end(mm, start, end); >> > >> > diff --git a/mm/ksm.c b/mm/ksm.c >> > index 6cb60f46cce5..be8f4576f842 100644 >> > --- a/mm/ksm.c >> > +++ b/mm/ksm.c >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_st= ruct *vma, struct page *page, >> > * So we clear the pte and flush the tlb before the check >> > * this assure us that no O_DIRECT can happen after the ch= eck >> > * or in the middle of the check. >> > + * >> > + * No need to notify as we are downgrading page table to r= ead >> > + * only not changing it to point to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > */ >> > - entry =3D ptep_clear_flush_notify(vma, pvmw.address, pvmw.= pte); >> > + entry =3D ptep_clear_flush(vma, pvmw.address, pvmw.pte); >> > /* >> > * Check that no O_DIRECT or similar I/O is in progress on= the >> > * page >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *= vma, struct page *page, >> > } >> > >> > flush_cache_page(vma, addr, pte_pfn(*ptep)); >> > - ptep_clear_flush_notify(vma, addr, ptep); >> > + /* >> > + * No need to notify as we are replacing a read only page with ano= ther >> > + * read only page with the same content. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > + ptep_clear_flush(vma, addr, ptep); >> > set_pte_at_notify(mm, addr, ptep, newpte); >> > >> > page_remove_rmap(page, false); >> > diff --git a/mm/rmap.c b/mm/rmap.c >> > index 061826278520..6b5a0f219ac0 100644 >> > --- a/mm/rmap.c >> > +++ b/mm/rmap.c >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, = struct vm_area_struct *vma, >> > #endif >> > } >> > >> > - if (ret) { >> > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, = cend); >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() as we a= re >> > + * downgrading page table protection not changing it to po= int >> > + * to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > + if (ret) >> > (*cleaned)++; >> > - } >> > } >> > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page,= struct vm_area_struct *vma, >> > if (pte_soft_dirty(pteval)) >> > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); >> > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); >> > + /* >> > + * No need to invalidate here it will synchronize = on >> > + * against the special swap migration pte. >> > + */ >> > goto discard; >> > } >> > >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, = struct vm_area_struct *vma, >> > * will take care of the rest. >> > */ >> > dec_mm_counter(mm, mm_counter(page)); >> > + /* We have to invalidate as we cleared the pte */ >> > + mmu_notifier_invalidate_range(mm, address, >> > + address + PAGE_SIZE)= ; >> > } else if (IS_ENABLED(CONFIG_MIGRATION) && >> > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))= ) { >> > swp_entry_t entry; >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page,= struct vm_area_struct *vma, >> > if (pte_soft_dirty(pteval)) >> > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); >> > set_pte_at(mm, address, pvmw.pte, swp_pte); >> > + /* >> > + * No need to invalidate here it will synchronize = on >> > + * against the special swap migration pte. >> > + */ >> > } else if (PageAnon(page)) { >> > swp_entry_t entry =3D { .val =3D page_private(subp= age) }; >> > pte_t swp_pte; >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, = struct vm_area_struct *vma, >> > WARN_ON_ONCE(1); >> > ret =3D false; >> > /* We have to invalidate as we cleared the= pte */ >> > + mmu_notifier_invalidate_range(mm, address, >> > + address + PAGE_SIZ= E); >> > page_vma_mapped_walk_done(&pvmw); >> > break; >> > } >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, = struct vm_area_struct *vma, >> > /* MADV_FREE page check */ >> > if (!PageSwapBacked(page)) { >> > if (!PageDirty(page)) { >> > + /* Invalidate as we cleared the pt= e */ >> > + mmu_notifier_invalidate_range(mm, >> > + address, address + PAGE_SI= ZE); >> > dec_mm_counter(mm, MM_ANONPAGES); >> > goto discard; >> > } >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page= , struct vm_area_struct *vma, >> > if (pte_soft_dirty(pteval)) >> > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); >> > set_pte_at(mm, address, pvmw.pte, swp_pte); >> > - } else >> > + /* Invalidate as we cleared the pte */ >> > + mmu_notifier_invalidate_range(mm, address, >> > + address + PAGE_SIZE)= ; >> > + } else { >> > + /* >> > + * We should not need to notify here as we reach t= his >> > + * case only from freeze_page() itself only call f= rom >> > + * split_huge_page_to_list() so everything below m= ust >> > + * be true: >> > + * - page is not anonymous >> > + * - page is locked >> > + * >> > + * So as it is a locked file back page thus it can= not >> > + * be remove from the page cache and replace by a = new >> > + * page before mmu_notifier_invalidate_range_end s= o no >> > + * concurrent thread might update its page table t= o >> > + * point at new page while a device still is using= this >> > + * page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > dec_mm_counter(mm, mm_counter_file(page)); >> > + } >> > discard: >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() it has = be >> > + * done above for all cases requiring it to happen under p= age >> > + * table lock before mmu_notifier_invalidate_range_end() >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > page_remove_rmap(subpage, PageHuge(page)); >> > put_page(page); >> > - mmu_notifier_invalidate_range(mm, address, >> > - address + PAGE_SIZE); >> > } >> > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); >> >> Looking at the patchset, I understand the efficiency, but I am concerned >> with correctness. > > I am fine in holding this off from reaching Linus but only way to flush t= his > issues out if any is to have this patch in linux-next or somewhere were t= hey > get a chance of being tested. > Yep, I would like to see some additional testing around npu and get Alistai= r Popple to comment as well > Note that the second patch is always safe. I agree that this one might > not be if hardware implementation is idiotic (well that would be my > opinion and any opinion/point of view can be challenge :)) You mean the only_end variant that avoids shootdown after pmd/pte changes that avoid the _start/_end and have just the only_end variant? That seemed reasonable to me, but I've not tested it or evaluated it in depth Balbir Singh. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yHw9h2YmvzDqG6 for ; Fri, 20 Oct 2017 03:58:31 +1100 (AEDT) Date: Thu, 19 Oct 2017 12:58:23 -0400 From: Jerome Glisse To: Balbir Singh Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Message-ID: <20171019165823.GA3044@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 In-Reply-To: List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > >> On Mon, 16 Oct 2017 23:10:02 -0400 > >> jglisse@redhat.com wrote: > >> > >> > From: Jérôme Glisse > >> > > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() as we are > >> > + * downgrading page table protection not changing it to point > >> > + * to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > if (pmdp) { > >> > #ifdef CONFIG_FS_DAX_PMD > >> > pmd_t pmd; > >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > >> > pmd = pmd_wrprotect(pmd); > >> > pmd = pmd_mkclean(pmd); > >> > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > >> > >> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? > > > > I am assuming hardware does sane thing of setting the dirty bit only > > when walking the CPU page table when device does a write fault ie > > once the device get a write TLB entry the dirty is set by the IOMMU > > when walking the page table before returning the lookup result to the > > device and that it won't be set again latter (ie propagated back > > latter). > > > > The other possibility is that the hardware things the page is writable > and already > marked dirty. It allows writes and does not set the dirty bit? I thought about this some more and the patch can not regress anything that is not broken today. So if we assume that device can propagate dirty bit because it can cache the write protection than all current code is broken for two reasons: First one is current code clear pte entry, build a new pte value with write protection and update pte entry with new pte value. So any PASID/ ATS platform that allows device to cache the write bit and set dirty bit anytime after that can race during that window and you would loose the dirty bit of the device. That is not that bad as you are gonna propagate the dirty bit to the struct page. Second one is if the dirty bit is propagated back to the new write protected pte. Quick look at code it seems that when we zap pte or or mkclean we don't check that the pte has write permission but only care about the dirty bit. So it should not have any bad consequence. After this patch only the second window is bigger and thus more likely to happen. But nothing sinister should happen from that. > > > I should probably have spell that out and maybe some of the ATS/PASID > > implementer did not do that. > > > >> > >> > unlock_pmd: > >> > spin_unlock(ptl); > >> > #endif > >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > >> > pte = pte_wrprotect(pte); > >> > pte = pte_mkclean(pte); > >> > set_pte_at(vma->vm_mm, address, ptep, pte); > >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > >> > >> Ditto > >> > >> > unlock_pte: > >> > pte_unmap_unlock(ptep, ptl); > >> > } > >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > >> > index 6866e8126982..49c925c96b8a 100644 > >> > --- a/include/linux/mmu_notifier.h > >> > +++ b/include/linux/mmu_notifier.h > >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > >> > * shared page-tables, it not necessary to implement the > >> > * invalidate_range_start()/end() notifiers, as > >> > * invalidate_range() alread catches the points in time when an > >> > - * external TLB range needs to be flushed. > >> > + * external TLB range needs to be flushed. For more in depth > >> > + * discussion on this see Documentation/vm/mmu_notifier.txt > >> > * > >> > * The invalidate_range() function is called under the ptl > >> > * spin-lock and not allowed to sleep. > >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > >> > index c037d3d34950..ff5bc647b51d 100644 > >> > --- a/mm/huge_memory.c > >> > +++ b/mm/huge_memory.c > >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > >> > goto out_free_pages; > >> > VM_BUG_ON_PAGE(!PageHead(page), page); > >> > > >> > + /* > >> > + * Leave pmd empty until pte is filled note we must notify here as > >> > + * concurrent CPU thread might write to new page before the call to > >> > + * mmu_notifier_invalidate_range_end() happens which can lead to a > >> > + * device seeing memory write in different order than CPU. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > >> > - /* leave pmd empty until pte is filled */ > >> > > >> > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > >> > pmd_populate(vma->vm_mm, &_pmd, pgtable); > >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > >> > pmd_t _pmd; > >> > int i; > >> > > >> > - /* leave pmd empty until pte is filled */ > >> > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > >> > + /* > >> > + * Leave pmd empty until pte is filled note that it is fine to delay > >> > + * notification until mmu_notifier_invalidate_range_end() as we are > >> > + * replacing a zero pmd write protected page with a zero pte write > >> > + * protected page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > + pmdp_huge_clear_flush(vma, haddr, pmd); > >> > >> Shouldn't the secondary TLB know if the page size changed? > > > > It should not matter, we are talking virtual to physical on behalf > > of a device against a process address space. So the hardware should > > not care about the page size. > > > > Does that not indicate how much the device can access? Could it try > to access more than what is mapped? Assuming device has huge TLB and 2MB huge page with 4K small page. You are going from one 1 TLB covering a 2MB zero page to 512 TLB each covering 4K. Both case is read only and both case are pointing to same data (ie zero). It is fine to delay the TLB invalidate on the device to the call of mmu_notifier_invalidate_range_end(). The device will keep using the huge TLB for a little longer but both CPU and device are looking at same data. Now if there is a racing thread that replace one of the 512 zeor page after the split but before mmu_notifier_invalidate_range_end() that code path would call mmu_notifier_invalidate_range() before changing the pte to point to something else. Which should shoot down the device TLB (it would be a serious device bug if this did not work). > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > > 4K pages is replace by something new then a device TLB shootdown will > > happen before the new page is set. > > > > Only issue i can think of is if the IOMMU TLB (if there is one) or > > the device TLB (you do expect that there is one) does not invalidate > > TLB entry if the TLB shootdown is smaller than the TLB entry. That > > would be idiotic but yes i know hardware bug. > > > > > >> > >> > > >> > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > >> > pmd_populate(mm, &_pmd, pgtable); > >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> > index 1768efa4c501..63a63f1b536c 100644 > >> > --- a/mm/hugetlb.c > >> > +++ b/mm/hugetlb.c > >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > >> > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > >> > } else { > >> > if (cow) { > >> > + /* > >> > + * No need to notify as we are downgrading page > >> > + * table protection not changing it to point > >> > + * to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > huge_ptep_set_wrprotect(src, addr, src_pte); > >> > >> OK.. so we could get write faults on write accesses from the device. > >> > >> > - mmu_notifier_invalidate_range(src, mmun_start, > >> > - mmun_end); > >> > } > >> > entry = huge_ptep_get(src_pte); > >> > ptepage = pte_page(entry); > >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > >> > * and that page table be reused and filled with junk. > >> > */ > >> > flush_hugetlb_tlb_range(vma, start, end); > >> > - mmu_notifier_invalidate_range(mm, start, end); > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() we are downgrading > >> > + * page table protection not changing it to point to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > i_mmap_unlock_write(vma->vm_file->f_mapping); > >> > mmu_notifier_invalidate_range_end(mm, start, end); > >> > > >> > diff --git a/mm/ksm.c b/mm/ksm.c > >> > index 6cb60f46cce5..be8f4576f842 100644 > >> > --- a/mm/ksm.c > >> > +++ b/mm/ksm.c > >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > >> > * So we clear the pte and flush the tlb before the check > >> > * this assure us that no O_DIRECT can happen after the check > >> > * or in the middle of the check. > >> > + * > >> > + * No need to notify as we are downgrading page table to read > >> > + * only not changing it to point to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > */ > >> > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > >> > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > >> > /* > >> > * Check that no O_DIRECT or similar I/O is in progress on the > >> > * page > >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > >> > } > >> > > >> > flush_cache_page(vma, addr, pte_pfn(*ptep)); > >> > - ptep_clear_flush_notify(vma, addr, ptep); > >> > + /* > >> > + * No need to notify as we are replacing a read only page with another > >> > + * read only page with the same content. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > + ptep_clear_flush(vma, addr, ptep); > >> > set_pte_at_notify(mm, addr, ptep, newpte); > >> > > >> > page_remove_rmap(page, false); > >> > diff --git a/mm/rmap.c b/mm/rmap.c > >> > index 061826278520..6b5a0f219ac0 100644 > >> > --- a/mm/rmap.c > >> > +++ b/mm/rmap.c > >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > >> > #endif > >> > } > >> > > >> > - if (ret) { > >> > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() as we are > >> > + * downgrading page table protection not changing it to point > >> > + * to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > + if (ret) > >> > (*cleaned)++; > >> > - } > >> > } > >> > > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > if (pte_soft_dirty(pteval)) > >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); > >> > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > >> > + /* > >> > + * No need to invalidate here it will synchronize on > >> > + * against the special swap migration pte. > >> > + */ > >> > goto discard; > >> > } > >> > > >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > * will take care of the rest. > >> > */ > >> > dec_mm_counter(mm, mm_counter(page)); > >> > + /* We have to invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, address, > >> > + address + PAGE_SIZE); > >> > } else if (IS_ENABLED(CONFIG_MIGRATION) && > >> > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > >> > swp_entry_t entry; > >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > if (pte_soft_dirty(pteval)) > >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); > >> > set_pte_at(mm, address, pvmw.pte, swp_pte); > >> > + /* > >> > + * No need to invalidate here it will synchronize on > >> > + * against the special swap migration pte. > >> > + */ > >> > } else if (PageAnon(page)) { > >> > swp_entry_t entry = { .val = page_private(subpage) }; > >> > pte_t swp_pte; > >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > WARN_ON_ONCE(1); > >> > ret = false; > >> > /* We have to invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, address, > >> > + address + PAGE_SIZE); > >> > page_vma_mapped_walk_done(&pvmw); > >> > break; > >> > } > >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > /* MADV_FREE page check */ > >> > if (!PageSwapBacked(page)) { > >> > if (!PageDirty(page)) { > >> > + /* Invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, > >> > + address, address + PAGE_SIZE); > >> > dec_mm_counter(mm, MM_ANONPAGES); > >> > goto discard; > >> > } > >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > if (pte_soft_dirty(pteval)) > >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); > >> > set_pte_at(mm, address, pvmw.pte, swp_pte); > >> > - } else > >> > + /* Invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, address, > >> > + address + PAGE_SIZE); > >> > + } else { > >> > + /* > >> > + * We should not need to notify here as we reach this > >> > + * case only from freeze_page() itself only call from > >> > + * split_huge_page_to_list() so everything below must > >> > + * be true: > >> > + * - page is not anonymous > >> > + * - page is locked > >> > + * > >> > + * So as it is a locked file back page thus it can not > >> > + * be remove from the page cache and replace by a new > >> > + * page before mmu_notifier_invalidate_range_end so no > >> > + * concurrent thread might update its page table to > >> > + * point at new page while a device still is using this > >> > + * page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > dec_mm_counter(mm, mm_counter_file(page)); > >> > + } > >> > discard: > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() it has be > >> > + * done above for all cases requiring it to happen under page > >> > + * table lock before mmu_notifier_invalidate_range_end() > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > page_remove_rmap(subpage, PageHuge(page)); > >> > put_page(page); > >> > - mmu_notifier_invalidate_range(mm, address, > >> > - address + PAGE_SIZE); > >> > } > >> > > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > >> > >> Looking at the patchset, I understand the efficiency, but I am concerned > >> with correctness. > > > > I am fine in holding this off from reaching Linus but only way to flush this > > issues out if any is to have this patch in linux-next or somewhere were they > > get a chance of being tested. > > > > Yep, I would like to see some additional testing around npu and get Alistair > Popple to comment as well I think this patch is fine. The only one race window that it might make bigger should have no bad consequences. > > > Note that the second patch is always safe. I agree that this one might > > not be if hardware implementation is idiotic (well that would be my > > opinion and any opinion/point of view can be challenge :)) > > > You mean the only_end variant that avoids shootdown after pmd/pte changes > that avoid the _start/_end and have just the only_end variant? That seemed > reasonable to me, but I've not tested it or evaluated it in depth Yes, patch 2/2 in this serie is definitly fine. It invalidate the device TLB right after clearing pte entry and avoid latter unecessary invalidation of same TLB. Jérôme From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-x244.google.com (mail-pf0-x244.google.com [IPv6:2607:f8b0:400e:c00::244]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yJsM1131yzDq5v for ; Sat, 21 Oct 2017 16:54:53 +1100 (AEDT) Received: by mail-pf0-x244.google.com with SMTP id d28so13572625pfe.2 for ; Fri, 20 Oct 2017 22:54:52 -0700 (PDT) Message-ID: <1508565280.5662.6.camel@gmail.com> Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 From: Balbir Singh To: Jerome Glisse Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next Date: Sat, 21 Oct 2017 16:54:40 +1100 In-Reply-To: <20171019165823.GA3044@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> <20171019165823.GA3044@redhat.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote: > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > > > > On Mon, 16 Oct 2017 23:10:02 -0400 > > > > jglisse@redhat.com wrote: > > > > > > > > > From: Jérôme Glisse > > > > > > > > > > + /* > > > > > + * No need to call mmu_notifier_invalidate_range() as we are > > > > > + * downgrading page table protection not changing it to point > > > > > + * to a new page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > if (pmdp) { > > > > > #ifdef CONFIG_FS_DAX_PMD > > > > > pmd_t pmd; > > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > > > > pmd = pmd_wrprotect(pmd); > > > > > pmd = pmd_mkclean(pmd); > > > > > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > > > > > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? > > > > > > I am assuming hardware does sane thing of setting the dirty bit only > > > when walking the CPU page table when device does a write fault ie > > > once the device get a write TLB entry the dirty is set by the IOMMU > > > when walking the page table before returning the lookup result to the > > > device and that it won't be set again latter (ie propagated back > > > latter). > > > > > > > The other possibility is that the hardware things the page is writable > > and already > > marked dirty. It allows writes and does not set the dirty bit? > > I thought about this some more and the patch can not regress anything > that is not broken today. So if we assume that device can propagate > dirty bit because it can cache the write protection than all current > code is broken for two reasons: > > First one is current code clear pte entry, build a new pte value with > write protection and update pte entry with new pte value. So any PASID/ > ATS platform that allows device to cache the write bit and set dirty > bit anytime after that can race during that window and you would loose > the dirty bit of the device. That is not that bad as you are gonna > propagate the dirty bit to the struct page. But they stay consistent with the notifiers, so from the OS perspective it notifies of any PTE changes as they happen. When the ATS platform sees invalidation, it invalidates it's PTE's as well. I was speaking of the case where the ATS platform could assume it has write access and has not seen any invalidation, the OS could return back to user space or the caller with write bit clear, but the ATS platform could still do a write since it's not seen the invalidation. > > Second one is if the dirty bit is propagated back to the new write > protected pte. Quick look at code it seems that when we zap pte or > or mkclean we don't check that the pte has write permission but only > care about the dirty bit. So it should not have any bad consequence. > > After this patch only the second window is bigger and thus more likely > to happen. But nothing sinister should happen from that. > > > > > > > I should probably have spell that out and maybe some of the ATS/PASID > > > implementer did not do that. > > > > > > > > > > > > unlock_pmd: > > > > > spin_unlock(ptl); > > > > > #endif > > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > > > > pte = pte_wrprotect(pte); > > > > > pte = pte_mkclean(pte); > > > > > set_pte_at(vma->vm_mm, address, ptep, pte); > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > > > > > > > Ditto > > > > > > > > > unlock_pte: > > > > > pte_unmap_unlock(ptep, ptl); > > > > > } > > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > > > > index 6866e8126982..49c925c96b8a 100644 > > > > > --- a/include/linux/mmu_notifier.h > > > > > +++ b/include/linux/mmu_notifier.h > > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > > > > > * shared page-tables, it not necessary to implement the > > > > > * invalidate_range_start()/end() notifiers, as > > > > > * invalidate_range() alread catches the points in time when an > > > > > - * external TLB range needs to be flushed. > > > > > + * external TLB range needs to be flushed. For more in depth > > > > > + * discussion on this see Documentation/vm/mmu_notifier.txt > > > > > * > > > > > * The invalidate_range() function is called under the ptl > > > > > * spin-lock and not allowed to sleep. > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > index c037d3d34950..ff5bc647b51d 100644 > > > > > --- a/mm/huge_memory.c > > > > > +++ b/mm/huge_memory.c > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > > > > goto out_free_pages; > > > > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > > > > > > > + /* > > > > > + * Leave pmd empty until pte is filled note we must notify here as > > > > > + * concurrent CPU thread might write to new page before the call to > > > > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > > > > + * device seeing memory write in different order than CPU. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > > > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > > > > pmd_t _pmd; > > > > > int i; > > > > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > > + /* > > > > > + * Leave pmd empty until pte is filled note that it is fine to delay > > > > > + * notification until mmu_notifier_invalidate_range_end() as we are > > > > > + * replacing a zero pmd write protected page with a zero pte write > > > > > + * protected page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > > > > > > > Shouldn't the secondary TLB know if the page size changed? > > > > > > It should not matter, we are talking virtual to physical on behalf > > > of a device against a process address space. So the hardware should > > > not care about the page size. > > > > > > > Does that not indicate how much the device can access? Could it try > > to access more than what is mapped? > > Assuming device has huge TLB and 2MB huge page with 4K small page. > You are going from one 1 TLB covering a 2MB zero page to 512 TLB > each covering 4K. Both case is read only and both case are pointing > to same data (ie zero). > > It is fine to delay the TLB invalidate on the device to the call of > mmu_notifier_invalidate_range_end(). The device will keep using the > huge TLB for a little longer but both CPU and device are looking at > same data. > > Now if there is a racing thread that replace one of the 512 zeor page > after the split but before mmu_notifier_invalidate_range_end() that > code path would call mmu_notifier_invalidate_range() before changing > the pte to point to something else. Which should shoot down the device > TLB (it would be a serious device bug if this did not work). OK.. This seems reasonable, but I'd really like to see if it can be tested > > > > > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > > > 4K pages is replace by something new then a device TLB shootdown will > > > happen before the new page is set. > > > > > > Only issue i can think of is if the IOMMU TLB (if there is one) or > > > the device TLB (you do expect that there is one) does not invalidate > > > TLB entry if the TLB shootdown is smaller than the TLB entry. That > > > would be idiotic but yes i know hardware bug. > > > > > > > > > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > > > > pmd_populate(mm, &_pmd, pgtable); > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > > index 1768efa4c501..63a63f1b536c 100644 > > > > > --- a/mm/hugetlb.c > > > > > +++ b/mm/hugetlb.c > > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > > > > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > > > > > } else { > > > > > if (cow) { > > > > > + /* > > > > > + * No need to notify as we are downgrading page > > > > > + * table protection not changing it to point > > > > > + * to a new page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > huge_ptep_set_wrprotect(src, addr, src_pte); > > > > > > > > OK.. so we could get write faults on write accesses from the device. > > > > > > > > > - mmu_notifier_invalidate_range(src, mmun_start, > > > > > - mmun_end); > > > > > } > > > > > entry = huge_ptep_get(src_pte); > > > > > ptepage = pte_page(entry); > > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > > > > > * and that page table be reused and filled with junk. > > > > > */ > > > > > flush_hugetlb_tlb_range(vma, start, end); > > > > > - mmu_notifier_invalidate_range(mm, start, end); > > > > > + /* > > > > > + * No need to call mmu_notifier_invalidate_range() we are downgrading > > > > > + * page table protection not changing it to point to a new page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > i_mmap_unlock_write(vma->vm_file->f_mapping); > > > > > mmu_notifier_invalidate_range_end(mm, start, end); > > > > > > > > > > diff --git a/mm/ksm.c b/mm/ksm.c > > > > > index 6cb60f46cce5..be8f4576f842 100644 > > > > > --- a/mm/ksm.c > > > > > +++ b/mm/ksm.c > > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > > > > > * So we clear the pte and flush the tlb before the check > > > > > * this assure us that no O_DIRECT can happen after the check > > > > > * or in the middle of the check. > > > > > + * > > > > > + * No need to notify as we are downgrading page table to read > > > > > + * only not changing it to point to a new page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > */ > > > > > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > > > > > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > > > > > /* > > > > > * Check that no O_DIRECT or similar I/O is in progress on the > > > > > * page > > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > > > > > } > > > > > > > > > > flush_cache_page(vma, addr, pte_pfn(*ptep)); > > > > > - ptep_clear_flush_notify(vma, addr, ptep); > > > > > + /* > > > > > + * No need to notify as we are replacing a read only page with another > > > > > + * read only page with the same content. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > + ptep_clear_flush(vma, addr, ptep); > > > > > set_pte_at_notify(mm, addr, ptep, newpte); > > > > > > > > > > page_remove_rmap(page, false); > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > > > index 061826278520..6b5a0f219ac0 100644 > > > > > --- a/mm/rmap.c > > > > > +++ b/mm/rmap.c > > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > > > > > #endif > > > > > } > > > > > > > > > > - if (ret) { > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > > > > > + /* > > > > > + * No need to call mmu_notifier_invalidate_range() as we are > > > > > + * downgrading page table protection not changing it to point > > > > > + * to a new page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > + if (ret) > > > > > (*cleaned)++; > > > > > - } > > > > > } > > > > > > > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > if (pte_soft_dirty(pteval)) > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > > > > > + /* > > > > > + * No need to invalidate here it will synchronize on > > > > > + * against the special swap migration pte. > > > > > + */ > > > > > goto discard; > > > > > } > > > > > > > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > * will take care of the rest. > > > > > */ > > > > > dec_mm_counter(mm, mm_counter(page)); > > > > > + /* We have to invalidate as we cleared the pte */ > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > + address + PAGE_SIZE); > > > > > } else if (IS_ENABLED(CONFIG_MIGRATION) && > > > > > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > > > > > swp_entry_t entry; > > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > if (pte_soft_dirty(pteval)) > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > > > > + /* > > > > > + * No need to invalidate here it will synchronize on > > > > > + * against the special swap migration pte. > > > > > + */ > > > > > } else if (PageAnon(page)) { > > > > > swp_entry_t entry = { .val = page_private(subpage) }; > > > > > pte_t swp_pte; > > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > WARN_ON_ONCE(1); > > > > > ret = false; > > > > > /* We have to invalidate as we cleared the pte */ > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > + address + PAGE_SIZE); > > > > > page_vma_mapped_walk_done(&pvmw); > > > > > break; > > > > > } > > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > /* MADV_FREE page check */ > > > > > if (!PageSwapBacked(page)) { > > > > > if (!PageDirty(page)) { > > > > > + /* Invalidate as we cleared the pte */ > > > > > + mmu_notifier_invalidate_range(mm, > > > > > + address, address + PAGE_SIZE); > > > > > dec_mm_counter(mm, MM_ANONPAGES); > > > > > goto discard; > > > > > } > > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > if (pte_soft_dirty(pteval)) > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > > > > - } else > > > > > + /* Invalidate as we cleared the pte */ > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > + address + PAGE_SIZE); > > > > > + } else { > > > > > + /* > > > > > + * We should not need to notify here as we reach this > > > > > + * case only from freeze_page() itself only call from > > > > > + * split_huge_page_to_list() so everything below must > > > > > + * be true: > > > > > + * - page is not anonymous > > > > > + * - page is locked > > > > > + * > > > > > + * So as it is a locked file back page thus it can not > > > > > + * be remove from the page cache and replace by a new > > > > > + * page before mmu_notifier_invalidate_range_end so no > > > > > + * concurrent thread might update its page table to > > > > > + * point at new page while a device still is using this > > > > > + * page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > dec_mm_counter(mm, mm_counter_file(page)); > > > > > + } > > > > > discard: > > > > > + /* > > > > > + * No need to call mmu_notifier_invalidate_range() it has be > > > > > + * done above for all cases requiring it to happen under page > > > > > + * table lock before mmu_notifier_invalidate_range_end() > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > page_remove_rmap(subpage, PageHuge(page)); > > > > > put_page(page); > > > > > - mmu_notifier_invalidate_range(mm, address, > > > > > - address + PAGE_SIZE); > > > > > } > > > > > > > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > > > > > > > Looking at the patchset, I understand the efficiency, but I am concerned > > > > with correctness. > > > > > > I am fine in holding this off from reaching Linus but only way to flush this > > > issues out if any is to have this patch in linux-next or somewhere were they > > > get a chance of being tested. > > > > > > > Yep, I would like to see some additional testing around npu and get Alistair > > Popple to comment as well > > I think this patch is fine. The only one race window that it might make > bigger should have no bad consequences. > > > > > > Note that the second patch is always safe. I agree that this one might > > > not be if hardware implementation is idiotic (well that would be my > > > opinion and any opinion/point of view can be challenge :)) > > > > > > You mean the only_end variant that avoids shootdown after pmd/pte changes > > that avoid the _start/_end and have just the only_end variant? That seemed > > reasonable to me, but I've not tested it or evaluated it in depth > > Yes, patch 2/2 in this serie is definitly fine. It invalidate the device > TLB right after clearing pte entry and avoid latter unecessary invalidation > of same TLB. > > Jérôme Balbir Singh. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yK6VS2zrvzDq5v for ; Sun, 22 Oct 2017 02:47:10 +1100 (AEDT) Date: Sat, 21 Oct 2017 11:47:03 -0400 From: Jerome Glisse To: Balbir Singh Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Message-ID: <20171021154703.GA30458@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> <20171019165823.GA3044@redhat.com> <1508565280.5662.6.camel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 In-Reply-To: <1508565280.5662.6.camel@gmail.com> List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote: > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote: > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > > > > > On Mon, 16 Oct 2017 23:10:02 -0400 > > > > > jglisse@redhat.com wrote: > > > > > > > > > > > From: Jérôme Glisse > > > > > > > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() as we are > > > > > > + * downgrading page table protection not changing it to point > > > > > > + * to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > if (pmdp) { > > > > > > #ifdef CONFIG_FS_DAX_PMD > > > > > > pmd_t pmd; > > > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > > > > > pmd = pmd_wrprotect(pmd); > > > > > > pmd = pmd_mkclean(pmd); > > > > > > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > > > > > > > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? > > > > > > > > I am assuming hardware does sane thing of setting the dirty bit only > > > > when walking the CPU page table when device does a write fault ie > > > > once the device get a write TLB entry the dirty is set by the IOMMU > > > > when walking the page table before returning the lookup result to the > > > > device and that it won't be set again latter (ie propagated back > > > > latter). > > > > > > > > > > The other possibility is that the hardware things the page is writable > > > and already > > > marked dirty. It allows writes and does not set the dirty bit? > > > > I thought about this some more and the patch can not regress anything > > that is not broken today. So if we assume that device can propagate > > dirty bit because it can cache the write protection than all current > > code is broken for two reasons: > > > > First one is current code clear pte entry, build a new pte value with > > write protection and update pte entry with new pte value. So any PASID/ > > ATS platform that allows device to cache the write bit and set dirty > > bit anytime after that can race during that window and you would loose > > the dirty bit of the device. That is not that bad as you are gonna > > propagate the dirty bit to the struct page. > > But they stay consistent with the notifiers, so from the OS perspective > it notifies of any PTE changes as they happen. When the ATS platform sees > invalidation, it invalidates it's PTE's as well. > > I was speaking of the case where the ATS platform could assume it has > write access and has not seen any invalidation, the OS could return > back to user space or the caller with write bit clear, but the ATS > platform could still do a write since it's not seen the invalidation. I understood what you said and what is above apply. I am removing only one of the invalidation not both. So with that patch the invalidation is delayed after the page table lock drop but before dax/page_mkclean returns. Hence any further activity will be read only on any device too once we exit those functions. The only difference is the window during which device can report dirty pte. Before that patch the 2 "~bogus~" window were small: First window between pmd/pte_get_clear_flush and set_pte/pmd Second window between set_pte/pmd and mmu_notifier_invalidate_range The first window stay the same, the second window is bigger, potentialy lot bigger if thread is prempted before mmu_notifier_invalidate_range_end But that is fine as in that case the page is reported as dirty and thus we are not missing anything and the kernel code does not care about seeing read only pte mark as dirty. > > > > > Second one is if the dirty bit is propagated back to the new write > > protected pte. Quick look at code it seems that when we zap pte or > > or mkclean we don't check that the pte has write permission but only > > care about the dirty bit. So it should not have any bad consequence. > > > > After this patch only the second window is bigger and thus more likely > > to happen. But nothing sinister should happen from that. > > > > > > > > > > > I should probably have spell that out and maybe some of the ATS/PASID > > > > implementer did not do that. > > > > > > > > > > > > > > > unlock_pmd: > > > > > > spin_unlock(ptl); > > > > > > #endif > > > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > > > > > pte = pte_wrprotect(pte); > > > > > > pte = pte_mkclean(pte); > > > > > > set_pte_at(vma->vm_mm, address, ptep, pte); > > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > > > > > > > > > Ditto > > > > > > > > > > > unlock_pte: > > > > > > pte_unmap_unlock(ptep, ptl); > > > > > > } > > > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > > > > > index 6866e8126982..49c925c96b8a 100644 > > > > > > --- a/include/linux/mmu_notifier.h > > > > > > +++ b/include/linux/mmu_notifier.h > > > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > > > > > > * shared page-tables, it not necessary to implement the > > > > > > * invalidate_range_start()/end() notifiers, as > > > > > > * invalidate_range() alread catches the points in time when an > > > > > > - * external TLB range needs to be flushed. > > > > > > + * external TLB range needs to be flushed. For more in depth > > > > > > + * discussion on this see Documentation/vm/mmu_notifier.txt > > > > > > * > > > > > > * The invalidate_range() function is called under the ptl > > > > > > * spin-lock and not allowed to sleep. > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > index c037d3d34950..ff5bc647b51d 100644 > > > > > > --- a/mm/huge_memory.c > > > > > > +++ b/mm/huge_memory.c > > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > > > > > goto out_free_pages; > > > > > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > > > > > > > > > + /* > > > > > > + * Leave pmd empty until pte is filled note we must notify here as > > > > > > + * concurrent CPU thread might write to new page before the call to > > > > > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > > > > > + * device seeing memory write in different order than CPU. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > > > > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > > > > > pmd_t _pmd; > > > > > > int i; > > > > > > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > > > + /* > > > > > > + * Leave pmd empty until pte is filled note that it is fine to delay > > > > > > + * notification until mmu_notifier_invalidate_range_end() as we are > > > > > > + * replacing a zero pmd write protected page with a zero pte write > > > > > > + * protected page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > > > > > > > > > Shouldn't the secondary TLB know if the page size changed? > > > > > > > > It should not matter, we are talking virtual to physical on behalf > > > > of a device against a process address space. So the hardware should > > > > not care about the page size. > > > > > > > > > > Does that not indicate how much the device can access? Could it try > > > to access more than what is mapped? > > > > Assuming device has huge TLB and 2MB huge page with 4K small page. > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB > > each covering 4K. Both case is read only and both case are pointing > > to same data (ie zero). > > > > It is fine to delay the TLB invalidate on the device to the call of > > mmu_notifier_invalidate_range_end(). The device will keep using the > > huge TLB for a little longer but both CPU and device are looking at > > same data. > > > > Now if there is a racing thread that replace one of the 512 zeor page > > after the split but before mmu_notifier_invalidate_range_end() that > > code path would call mmu_notifier_invalidate_range() before changing > > the pte to point to something else. Which should shoot down the device > > TLB (it would be a serious device bug if this did not work). > > OK.. This seems reasonable, but I'd really like to see if it can be > tested Well hard to test, many factors first each device might react differently. Device that only store TLB at 4k granularity are fine. Clever device that can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller than their TLB entry ie getting a 4K invalidation would not invalidate a 2MB TLB entry in the device. I consider this as buggy. I will go look at the PCIE ATS specification one more time and see if there is any wording related that. I might bring up a question to the PCIE standard body if not. Second factor is that it is a race between split zero and a write fault. I can probably do a crappy patch that msleep if split happens against a given mm to increase the race window. But i would be testing against one device (right now i can only access AMD IOMMUv2 devices with discret ATS GPU) > > > > > > > > > > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > > > > 4K pages is replace by something new then a device TLB shootdown will > > > > happen before the new page is set. > > > > > > > > Only issue i can think of is if the IOMMU TLB (if there is one) or > > > > the device TLB (you do expect that there is one) does not invalidate > > > > TLB entry if the TLB shootdown is smaller than the TLB entry. That > > > > would be idiotic but yes i know hardware bug. > > > > > > > > > > > > > > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > > > > > pmd_populate(mm, &_pmd, pgtable); > > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > > > index 1768efa4c501..63a63f1b536c 100644 > > > > > > --- a/mm/hugetlb.c > > > > > > +++ b/mm/hugetlb.c > > > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > > > > > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > > > > > > } else { > > > > > > if (cow) { > > > > > > + /* > > > > > > + * No need to notify as we are downgrading page > > > > > > + * table protection not changing it to point > > > > > > + * to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > huge_ptep_set_wrprotect(src, addr, src_pte); > > > > > > > > > > OK.. so we could get write faults on write accesses from the device. > > > > > > > > > > > - mmu_notifier_invalidate_range(src, mmun_start, > > > > > > - mmun_end); > > > > > > } > > > > > > entry = huge_ptep_get(src_pte); > > > > > > ptepage = pte_page(entry); > > > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > > > > > > * and that page table be reused and filled with junk. > > > > > > */ > > > > > > flush_hugetlb_tlb_range(vma, start, end); > > > > > > - mmu_notifier_invalidate_range(mm, start, end); > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() we are downgrading > > > > > > + * page table protection not changing it to point to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > i_mmap_unlock_write(vma->vm_file->f_mapping); > > > > > > mmu_notifier_invalidate_range_end(mm, start, end); > > > > > > > > > > > > diff --git a/mm/ksm.c b/mm/ksm.c > > > > > > index 6cb60f46cce5..be8f4576f842 100644 > > > > > > --- a/mm/ksm.c > > > > > > +++ b/mm/ksm.c > > > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > > > > > > * So we clear the pte and flush the tlb before the check > > > > > > * this assure us that no O_DIRECT can happen after the check > > > > > > * or in the middle of the check. > > > > > > + * > > > > > > + * No need to notify as we are downgrading page table to read > > > > > > + * only not changing it to point to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > */ > > > > > > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > > > > > > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > > > > > > /* > > > > > > * Check that no O_DIRECT or similar I/O is in progress on the > > > > > > * page > > > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > > > > > > } > > > > > > > > > > > > flush_cache_page(vma, addr, pte_pfn(*ptep)); > > > > > > - ptep_clear_flush_notify(vma, addr, ptep); > > > > > > + /* > > > > > > + * No need to notify as we are replacing a read only page with another > > > > > > + * read only page with the same content. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > + ptep_clear_flush(vma, addr, ptep); > > > > > > set_pte_at_notify(mm, addr, ptep, newpte); > > > > > > > > > > > > page_remove_rmap(page, false); > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > > > > index 061826278520..6b5a0f219ac0 100644 > > > > > > --- a/mm/rmap.c > > > > > > +++ b/mm/rmap.c > > > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > > > > > > #endif > > > > > > } > > > > > > > > > > > > - if (ret) { > > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() as we are > > > > > > + * downgrading page table protection not changing it to point > > > > > > + * to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > + if (ret) > > > > > > (*cleaned)++; > > > > > > - } > > > > > > } > > > > > > > > > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > if (pte_soft_dirty(pteval)) > > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > > > > > > + /* > > > > > > + * No need to invalidate here it will synchronize on > > > > > > + * against the special swap migration pte. > > > > > > + */ > > > > > > goto discard; > > > > > > } > > > > > > > > > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > * will take care of the rest. > > > > > > */ > > > > > > dec_mm_counter(mm, mm_counter(page)); > > > > > > + /* We have to invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > > + address + PAGE_SIZE); > > > > > > } else if (IS_ENABLED(CONFIG_MIGRATION) && > > > > > > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > > > > > > swp_entry_t entry; > > > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > if (pte_soft_dirty(pteval)) > > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > > > > > + /* > > > > > > + * No need to invalidate here it will synchronize on > > > > > > + * against the special swap migration pte. > > > > > > + */ > > > > > > } else if (PageAnon(page)) { > > > > > > swp_entry_t entry = { .val = page_private(subpage) }; > > > > > > pte_t swp_pte; > > > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > WARN_ON_ONCE(1); > > > > > > ret = false; > > > > > > /* We have to invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > > + address + PAGE_SIZE); > > > > > > page_vma_mapped_walk_done(&pvmw); > > > > > > break; > > > > > > } > > > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > /* MADV_FREE page check */ > > > > > > if (!PageSwapBacked(page)) { > > > > > > if (!PageDirty(page)) { > > > > > > + /* Invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, > > > > > > + address, address + PAGE_SIZE); > > > > > > dec_mm_counter(mm, MM_ANONPAGES); > > > > > > goto discard; > > > > > > } > > > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > if (pte_soft_dirty(pteval)) > > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > > > > > - } else > > > > > > + /* Invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > > + address + PAGE_SIZE); > > > > > > + } else { > > > > > > + /* > > > > > > + * We should not need to notify here as we reach this > > > > > > + * case only from freeze_page() itself only call from > > > > > > + * split_huge_page_to_list() so everything below must > > > > > > + * be true: > > > > > > + * - page is not anonymous > > > > > > + * - page is locked > > > > > > + * > > > > > > + * So as it is a locked file back page thus it can not > > > > > > + * be remove from the page cache and replace by a new > > > > > > + * page before mmu_notifier_invalidate_range_end so no > > > > > > + * concurrent thread might update its page table to > > > > > > + * point at new page while a device still is using this > > > > > > + * page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > dec_mm_counter(mm, mm_counter_file(page)); > > > > > > + } > > > > > > discard: > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() it has be > > > > > > + * done above for all cases requiring it to happen under page > > > > > > + * table lock before mmu_notifier_invalidate_range_end() > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > page_remove_rmap(subpage, PageHuge(page)); > > > > > > put_page(page); > > > > > > - mmu_notifier_invalidate_range(mm, address, > > > > > > - address + PAGE_SIZE); > > > > > > } > > > > > > > > > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > > > > > > > > > Looking at the patchset, I understand the efficiency, but I am concerned > > > > > with correctness. > > > > > > > > I am fine in holding this off from reaching Linus but only way to flush this > > > > issues out if any is to have this patch in linux-next or somewhere were they > > > > get a chance of being tested. > > > > > > > > > > Yep, I would like to see some additional testing around npu and get Alistair > > > Popple to comment as well > > > > I think this patch is fine. The only one race window that it might make > > bigger should have no bad consequences. > > > > > > > > > Note that the second patch is always safe. I agree that this one might > > > > not be if hardware implementation is idiotic (well that would be my > > > > opinion and any opinion/point of view can be challenge :)) > > > > > > > > > You mean the only_end variant that avoids shootdown after pmd/pte changes > > > that avoid the _start/_end and have just the only_end variant? That seemed > > > reasonable to me, but I've not tested it or evaluated it in depth > > > > Yes, patch 2/2 in this serie is definitly fine. It invalidate the device > > TLB right after clearing pte entry and avoid latter unecessary invalidation > > of same TLB. > > > > Jérôme > > Balbir Singh. > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yLSnm69FyzDqjm for ; Tue, 24 Oct 2017 07:35:08 +1100 (AEDT) Date: Mon, 23 Oct 2017 16:35:01 -0400 From: Jerome Glisse To: Balbir Singh Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Message-ID: <20171023203501.GA9371@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> <20171019165823.GA3044@redhat.com> <1508565280.5662.6.camel@gmail.com> <20171021154703.GA30458@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 In-Reply-To: <20171021154703.GA30458@redhat.com> List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Sat, Oct 21, 2017 at 11:47:03AM -0400, Jerome Glisse wrote: > On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote: > > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote: > > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > > > > > > On Mon, 16 Oct 2017 23:10:02 -0400 > > > > > > jglisse@redhat.com wrote: > > > > > > > From: Jérôme Glisse [...] > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > > index c037d3d34950..ff5bc647b51d 100644 > > > > > > > --- a/mm/huge_memory.c > > > > > > > +++ b/mm/huge_memory.c > > > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > > > > > > goto out_free_pages; > > > > > > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > > > > > > > > > > > + /* > > > > > > > + * Leave pmd empty until pte is filled note we must notify here as > > > > > > > + * concurrent CPU thread might write to new page before the call to > > > > > > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > > > > > > + * device seeing memory write in different order than CPU. > > > > > > > + * > > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > > + */ > > > > > > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > > > > > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > > > > > > pmd_t _pmd; > > > > > > > int i; > > > > > > > > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > > > > + /* > > > > > > > + * Leave pmd empty until pte is filled note that it is fine to delay > > > > > > > + * notification until mmu_notifier_invalidate_range_end() as we are > > > > > > > + * replacing a zero pmd write protected page with a zero pte write > > > > > > > + * protected page. > > > > > > > + * > > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > > + */ > > > > > > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > > > > > > > > > > > Shouldn't the secondary TLB know if the page size changed? > > > > > > > > > > It should not matter, we are talking virtual to physical on behalf > > > > > of a device against a process address space. So the hardware should > > > > > not care about the page size. > > > > > > > > > > > > > Does that not indicate how much the device can access? Could it try > > > > to access more than what is mapped? > > > > > > Assuming device has huge TLB and 2MB huge page with 4K small page. > > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB > > > each covering 4K. Both case is read only and both case are pointing > > > to same data (ie zero). > > > > > > It is fine to delay the TLB invalidate on the device to the call of > > > mmu_notifier_invalidate_range_end(). The device will keep using the > > > huge TLB for a little longer but both CPU and device are looking at > > > same data. > > > > > > Now if there is a racing thread that replace one of the 512 zeor page > > > after the split but before mmu_notifier_invalidate_range_end() that > > > code path would call mmu_notifier_invalidate_range() before changing > > > the pte to point to something else. Which should shoot down the device > > > TLB (it would be a serious device bug if this did not work). > > > > OK.. This seems reasonable, but I'd really like to see if it can be > > tested > > Well hard to test, many factors first each device might react differently. > Device that only store TLB at 4k granularity are fine. Clever device that > can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller > than their TLB entry ie getting a 4K invalidation would not invalidate a > 2MB TLB entry in the device. I consider this as buggy. I will go look at > the PCIE ATS specification one more time and see if there is any wording > related that. I might bring up a question to the PCIE standard body if not. So inside PCIE ATS there is the definition of "minimum translation or invalidate size" which says 4096 bytes. So my understanding is that hardware must support 4K invalidation in all the case and thus we shoud be safe from possible hazard above. But none the less i will repost without the optimization for huge page to be more concervative as anyway we want to be correct before we care about last bit of optimization. Cheers, Jérôme From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f199.google.com (mail-qk0-f199.google.com [209.85.220.199]) by kanga.kvack.org (Postfix) with ESMTP id 8B8726B0038 for ; Mon, 16 Oct 2017 23:10:15 -0400 (EDT) Received: by mail-qk0-f199.google.com with SMTP id d67so504298qkg.3 for ; Mon, 16 Oct 2017 20:10:15 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id f83si4120043qke.449.2017.10.16.20.10.14 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 16 Oct 2017 20:10:14 -0700 (PDT) From: jglisse@redhat.com Subject: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback Date: Mon, 16 Oct 2017 23:10:01 -0400 Message-Id: <20171017031003.7481-1-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org From: JA(C)rA'me Glisse (Andrew you already have v1 in your queue of patch 1, patch 2 is new, i think you can drop it patch 1 v1 for v2, v2 is bit more conservative and i fixed typos) All this only affect user of invalidate_range callback (at this time CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in drivers/iommu/amd_iommu_v2.c|intel-svm.c) This patchset remove useless double call to mmu_notifier->invalidate_range callback wherever it is safe to do so. The first patch just remove useless call and add documentation explaining why it is safe to do so. The second patch go further by introducing mmu_notifier_invalidate_range_only_end() which skip callback to invalidate_range this can be done when clearing a pte, pmd or pud with notification which call invalidate_range right after clearing under the page table lock. It should improve performances but i am lacking hardware and benchmarks which might show an improvement. Maybe folks in cc can help here. Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: David Woodhouse Cc: Alistair Popple Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Stephen Rothwell Cc: Andrew Donnellan Cc: iommu@lists.linux-foundation.org Cc: linuxppc-dev@lists.ozlabs.org JA(C)rA'me Glisse (2): mm/mmu_notifier: avoid double notification when it is useless v2 mm/mmu_notifier: avoid call to invalidate_range() in range_end() Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++ fs/dax.c | 9 +++- include/linux/mmu_notifier.h | 20 +++++++-- mm/huge_memory.c | 66 ++++++++++++++++++++++++--- mm/hugetlb.c | 16 +++++-- mm/ksm.c | 15 ++++++- mm/memory.c | 6 ++- mm/migrate.c | 15 +++++-- mm/mmu_notifier.c | 11 ++++- mm/rmap.c | 59 ++++++++++++++++++++++--- 10 files changed, 281 insertions(+), 29 deletions(-) create mode 100644 Documentation/vm/mmu_notifier.txt -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f198.google.com (mail-qk0-f198.google.com [209.85.220.198]) by kanga.kvack.org (Postfix) with ESMTP id C03696B0253 for ; Mon, 16 Oct 2017 23:10:17 -0400 (EDT) Received: by mail-qk0-f198.google.com with SMTP id o187so513294qke.1 for ; Mon, 16 Oct 2017 20:10:17 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id r20si4331qke.267.2017.10.16.20.10.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 16 Oct 2017 20:10:16 -0700 (PDT) From: jglisse@redhat.com Subject: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Mon, 16 Oct 2017 23:10:02 -0400 Message-Id: <20171017031003.7481-2-jglisse@redhat.com> In-Reply-To: <20171017031003.7481-1-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org, linux-next@vger.kernel.org From: JA(C)rA'me Glisse This patch only affects users of mmu_notifier->invalidate_range callback which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ... and it is an optimization for those users. Everyone else is unaffected by it. When clearing a pte/pmd we are given a choice to notify the event under the page table lock (notify version of *_clear_flush helpers do call the mmu_notifier_invalidate_range). But that notification is not necessary in all cases. This patches remove almost all cases where it is useless to have a call to mmu_notifier_invalidate_range before mmu_notifier_invalidate_range_end. It also adds documentation in all those cases explaining why. Below is a more in depth analysis of why this is fine to do this: For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a process virtual address space). There is only 2 cases when you need to notify those secondary TLB while holding page table lock when clearing a pte/pmd: A) page backing address is free before mmu_notifier_invalidate_range_end B) a page table entry is updated to point to a new page (COW, write fault on zero page, __replace_page(), ...) Case A is obvious you do not want to take the risk for the device to write to a page that might now be used by something completely different. Case B is more subtle. For correctness it requires the following sequence to happen: - take page table lock - clear page table entry and notify (pmd/pte_huge_clear_flush_notify()) - set page table entry to point to new page If clearing the page table entry is not followed by a notify before setting the new pte/pmd value then you can break memory model like C11 or C++11 for the device. Consider the following scenario (device use a feature similar to ATS/ PASID): Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume they are write protected for COW (other case of B apply too). [Time N] ----------------------------------------------------------------- CPU-thread-0 {try to write to addrA} CPU-thread-1 {try to write to addrB} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA and populate device TLB} DEV-thread-2 {read addrB and populate device TLB} [Time N+1] --------------------------------------------------------------- CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+2] --------------------------------------------------------------- CPU-thread-0 {COW_step1: {update page table point to new page for addrA}} CPU-thread-1 {COW_step1: {update page table point to new page for addrB}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {write to addrA which is a write to new page} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {} CPU-thread-3 {write to addrB which is a write to new page} DEV-thread-0 {} DEV-thread-2 {} [Time N+4] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+5] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA from old page} DEV-thread-2 {read addrB from new page} So here because at time N+2 the clear page table entry was not pair with a notification to invalidate the secondary TLB, the device see the new value for addrB before seing the new value for addrA. This break total memory ordering for the device. When changing a pte to write protect or to point to a new write protected page with same content (KSM) it is ok to delay invalidate_range callback to mmu_notifier_invalidate_range_end() outside the page table lock. This is true even if the thread doing page table update is preempted right after releasing page table lock before calling mmu_notifier_invalidate_range_end Changed since v1: - typos (thanks to Andrea) - Avoid unnecessary precaution in try_to_unmap() (Andrea) - Be more conservative in try_to_unmap_one() Signed-off-by: JA(C)rA'me Glisse Cc: Andrea Arcangeli Cc: Nadav Amit Cc: Linus Torvalds Cc: Andrew Morton Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: David Woodhouse Cc: Alistair Popple Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Stephen Rothwell Cc: Andrew Donnellan Cc: iommu@lists.linux-foundation.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-next@vger.kernel.org --- Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++ fs/dax.c | 9 +++- include/linux/mmu_notifier.h | 3 +- mm/huge_memory.c | 20 +++++++-- mm/hugetlb.c | 16 +++++-- mm/ksm.c | 15 ++++++- mm/rmap.c | 59 ++++++++++++++++++++++--- 7 files changed, 198 insertions(+), 17 deletions(-) create mode 100644 Documentation/vm/mmu_notifier.txt diff --git a/Documentation/vm/mmu_notifier.txt b/Documentation/vm/mmu_notifier.txt new file mode 100644 index 000000000000..23b462566bb7 --- /dev/null +++ b/Documentation/vm/mmu_notifier.txt @@ -0,0 +1,93 @@ +When do you need to notify inside page table lock ? + +When clearing a pte/pmd we are given a choice to notify the event through +(notify version of *_clear_flush call mmu_notifier_invalidate_range) under +the page table lock. But that notification is not necessary in all cases. + +For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use +thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a +process virtual address space). There is only 2 cases when you need to notify +those secondary TLB while holding page table lock when clearing a pte/pmd: + + A) page backing address is free before mmu_notifier_invalidate_range_end() + B) a page table entry is updated to point to a new page (COW, write fault + on zero page, __replace_page(), ...) + +Case A is obvious you do not want to take the risk for the device to write to +a page that might now be used by some completely different task. + +Case B is more subtle. For correctness it requires the following sequence to +happen: + - take page table lock + - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) + - set page table entry to point to new page + +If clearing the page table entry is not followed by a notify before setting +the new pte/pmd value then you can break memory model like C11 or C++11 for +the device. + +Consider the following scenario (device use a feature similar to ATS/PASID): + +Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume +they are write protected for COW (other case of B apply too). + +[Time N] -------------------------------------------------------------------- +CPU-thread-0 {try to write to addrA} +CPU-thread-1 {try to write to addrB} +CPU-thread-2 {} +CPU-thread-3 {} +DEV-thread-0 {read addrA and populate device TLB} +DEV-thread-2 {read addrB and populate device TLB} +[Time N+1] ------------------------------------------------------------------ +CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} +CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} +CPU-thread-2 {} +CPU-thread-3 {} +DEV-thread-0 {} +DEV-thread-2 {} +[Time N+2] ------------------------------------------------------------------ +CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} +CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} +CPU-thread-2 {} +CPU-thread-3 {} +DEV-thread-0 {} +DEV-thread-2 {} +[Time N+3] ------------------------------------------------------------------ +CPU-thread-0 {preempted} +CPU-thread-1 {preempted} +CPU-thread-2 {write to addrA which is a write to new page} +CPU-thread-3 {} +DEV-thread-0 {} +DEV-thread-2 {} +[Time N+3] ------------------------------------------------------------------ +CPU-thread-0 {preempted} +CPU-thread-1 {preempted} +CPU-thread-2 {} +CPU-thread-3 {write to addrB which is a write to new page} +DEV-thread-0 {} +DEV-thread-2 {} +[Time N+4] ------------------------------------------------------------------ +CPU-thread-0 {preempted} +CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} +CPU-thread-2 {} +CPU-thread-3 {} +DEV-thread-0 {} +DEV-thread-2 {} +[Time N+5] ------------------------------------------------------------------ +CPU-thread-0 {preempted} +CPU-thread-1 {} +CPU-thread-2 {} +CPU-thread-3 {} +DEV-thread-0 {read addrA from old page} +DEV-thread-2 {read addrB from new page} + +So here because at time N+2 the clear page table entry was not pair with a +notification to invalidate the secondary TLB, the device see the new value for +addrB before seing the new value for addrA. This break total memory ordering +for the device. + +When changing a pte to write protect or to point to a new write protected page +with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range +call to mmu_notifier_invalidate_range_end() outside the page table lock. This +is true even if the thread doing the page table update is preempted right after +releasing page table lock but before call mmu_notifier_invalidate_range_end(). diff --git a/fs/dax.c b/fs/dax.c index f3a44a7c14b3..9ec797424e4f 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -614,6 +614,13 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl)) continue; + /* + * No need to call mmu_notifier_invalidate_range() as we are + * downgrading page table protection not changing it to point + * to a new page. + * + * See Documentation/vm/mmu_notifier.txt + */ if (pmdp) { #ifdef CONFIG_FS_DAX_PMD pmd_t pmd; @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, pmd = pmd_wrprotect(pmd); pmd = pmd_mkclean(pmd); set_pmd_at(vma->vm_mm, address, pmdp, pmd); - mmu_notifier_invalidate_range(vma->vm_mm, start, end); unlock_pmd: spin_unlock(ptl); #endif @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, pte = pte_wrprotect(pte); pte = pte_mkclean(pte); set_pte_at(vma->vm_mm, address, ptep, pte); - mmu_notifier_invalidate_range(vma->vm_mm, start, end); unlock_pte: pte_unmap_unlock(ptep, ptl); } diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 6866e8126982..49c925c96b8a 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -155,7 +155,8 @@ struct mmu_notifier_ops { * shared page-tables, it not necessary to implement the * invalidate_range_start()/end() notifiers, as * invalidate_range() alread catches the points in time when an - * external TLB range needs to be flushed. + * external TLB range needs to be flushed. For more in depth + * discussion on this see Documentation/vm/mmu_notifier.txt * * The invalidate_range() function is called under the ptl * spin-lock and not allowed to sleep. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c037d3d34950..ff5bc647b51d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, goto out_free_pages; VM_BUG_ON_PAGE(!PageHead(page), page); + /* + * Leave pmd empty until pte is filled note we must notify here as + * concurrent CPU thread might write to new page before the call to + * mmu_notifier_invalidate_range_end() happens which can lead to a + * device seeing memory write in different order than CPU. + * + * See Documentation/vm/mmu_notifier.txt + */ pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); - /* leave pmd empty until pte is filled */ pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); pmd_populate(vma->vm_mm, &_pmd, pgtable); @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, pmd_t _pmd; int i; - /* leave pmd empty until pte is filled */ - pmdp_huge_clear_flush_notify(vma, haddr, pmd); + /* + * Leave pmd empty until pte is filled note that it is fine to delay + * notification until mmu_notifier_invalidate_range_end() as we are + * replacing a zero pmd write protected page with a zero pte write + * protected page. + * + * See Documentation/vm/mmu_notifier.txt + */ + pmdp_huge_clear_flush(vma, haddr, pmd); pgtable = pgtable_trans_huge_withdraw(mm, pmd); pmd_populate(mm, &_pmd, pgtable); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1768efa4c501..63a63f1b536c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); } else { if (cow) { + /* + * No need to notify as we are downgrading page + * table protection not changing it to point + * to a new page. + * + * See Documentation/vm/mmu_notifier.txt + */ huge_ptep_set_wrprotect(src, addr, src_pte); - mmu_notifier_invalidate_range(src, mmun_start, - mmun_end); } entry = huge_ptep_get(src_pte); ptepage = pte_page(entry); @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, * and that page table be reused and filled with junk. */ flush_hugetlb_tlb_range(vma, start, end); - mmu_notifier_invalidate_range(mm, start, end); + /* + * No need to call mmu_notifier_invalidate_range() we are downgrading + * page table protection not changing it to point to a new page. + * + * See Documentation/vm/mmu_notifier.txt + */ i_mmap_unlock_write(vma->vm_file->f_mapping); mmu_notifier_invalidate_range_end(mm, start, end); diff --git a/mm/ksm.c b/mm/ksm.c index 6cb60f46cce5..be8f4576f842 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, * So we clear the pte and flush the tlb before the check * this assure us that no O_DIRECT can happen after the check * or in the middle of the check. + * + * No need to notify as we are downgrading page table to read + * only not changing it to point to a new page. + * + * See Documentation/vm/mmu_notifier.txt */ - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); /* * Check that no O_DIRECT or similar I/O is in progress on the * page @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, } flush_cache_page(vma, addr, pte_pfn(*ptep)); - ptep_clear_flush_notify(vma, addr, ptep); + /* + * No need to notify as we are replacing a read only page with another + * read only page with the same content. + * + * See Documentation/vm/mmu_notifier.txt + */ + ptep_clear_flush(vma, addr, ptep); set_pte_at_notify(mm, addr, ptep, newpte); page_remove_rmap(page, false); diff --git a/mm/rmap.c b/mm/rmap.c index 061826278520..6b5a0f219ac0 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, #endif } - if (ret) { - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); + /* + * No need to call mmu_notifier_invalidate_range() as we are + * downgrading page table protection not changing it to point + * to a new page. + * + * See Documentation/vm/mmu_notifier.txt + */ + if (ret) (*cleaned)++; - } } mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); + /* + * No need to invalidate here it will synchronize on + * against the special swap migration pte. + */ goto discard; } @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * will take care of the rest. */ dec_mm_counter(mm, mm_counter(page)); + /* We have to invalidate as we cleared the pte */ + mmu_notifier_invalidate_range(mm, address, + address + PAGE_SIZE); } else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { swp_entry_t entry; @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); set_pte_at(mm, address, pvmw.pte, swp_pte); + /* + * No need to invalidate here it will synchronize on + * against the special swap migration pte. + */ } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(subpage) }; pte_t swp_pte; @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, WARN_ON_ONCE(1); ret = false; /* We have to invalidate as we cleared the pte */ + mmu_notifier_invalidate_range(mm, address, + address + PAGE_SIZE); page_vma_mapped_walk_done(&pvmw); break; } @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* MADV_FREE page check */ if (!PageSwapBacked(page)) { if (!PageDirty(page)) { + /* Invalidate as we cleared the pte */ + mmu_notifier_invalidate_range(mm, + address, address + PAGE_SIZE); dec_mm_counter(mm, MM_ANONPAGES); goto discard; } @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); set_pte_at(mm, address, pvmw.pte, swp_pte); - } else + /* Invalidate as we cleared the pte */ + mmu_notifier_invalidate_range(mm, address, + address + PAGE_SIZE); + } else { + /* + * We should not need to notify here as we reach this + * case only from freeze_page() itself only call from + * split_huge_page_to_list() so everything below must + * be true: + * - page is not anonymous + * - page is locked + * + * So as it is a locked file back page thus it can not + * be remove from the page cache and replace by a new + * page before mmu_notifier_invalidate_range_end so no + * concurrent thread might update its page table to + * point at new page while a device still is using this + * page. + * + * See Documentation/vm/mmu_notifier.txt + */ dec_mm_counter(mm, mm_counter_file(page)); + } discard: + /* + * No need to call mmu_notifier_invalidate_range() it has be + * done above for all cases requiring it to happen under page + * table lock before mmu_notifier_invalidate_range_end() + * + * See Documentation/vm/mmu_notifier.txt + */ page_remove_rmap(subpage, PageHuge(page)); put_page(page); - mmu_notifier_invalidate_range(mm, address, - address + PAGE_SIZE); } mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f200.google.com (mail-qk0-f200.google.com [209.85.220.200]) by kanga.kvack.org (Postfix) with ESMTP id A51736B025E for ; Mon, 16 Oct 2017 23:10:19 -0400 (EDT) Received: by mail-qk0-f200.google.com with SMTP id b62so465995qkh.18 for ; Mon, 16 Oct 2017 20:10:19 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id h67si1294247qkc.426.2017.10.16.20.10.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 16 Oct 2017 20:10:18 -0700 (PDT) From: jglisse@redhat.com Subject: [PATCH 2/2] mm/mmu_notifier: avoid call to invalidate_range() in range_end() Date: Mon, 16 Oct 2017 23:10:03 -0400 Message-Id: <20171017031003.7481-3-jglisse@redhat.com> In-Reply-To: <20171017031003.7481-1-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org From: JA(C)rA'me Glisse This is an optimization patch that only affect mmu_notifier users which rely on the invalidate_range() callback. This patch avoids calling that callback twice in a row from inside __mmu_notifier_invalidate_range_end Existing pattern (before this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_end() mmu_notifier_invalidate_range() New pattern (after this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_only_end() We call the invalidate_range callback after clearing the page table under the page table lock and we skip the call to invalidate_range inside the __mmu_notifier_invalidate_range_end() function. Idea from Andrea Arcangeli Signed-off-by: JA(C)rA'me Glisse Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: David Woodhouse Cc: Alistair Popple Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Stephen Rothwell Cc: Andrew Donnellan Cc: iommu@lists.linux-foundation.org Cc: linuxppc-dev@lists.ozlabs.org --- include/linux/mmu_notifier.h | 17 ++++++++++++++-- mm/huge_memory.c | 46 ++++++++++++++++++++++++++++++++++++++++---- mm/memory.c | 6 +++++- mm/migrate.c | 15 ++++++++++++--- mm/mmu_notifier.c | 11 +++++++++-- 5 files changed, 83 insertions(+), 12 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 49c925c96b8a..6665c4624287 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -213,7 +213,8 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm, extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, unsigned long start, unsigned long end); extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end); + unsigned long start, unsigned long end, + bool only_end); extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end); @@ -267,7 +268,14 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, unsigned long start, unsigned long end) { if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_end(mm, start, end); + __mmu_notifier_invalidate_range_end(mm, start, end, false); +} + +static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end, true); } static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, @@ -438,6 +446,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, { } +static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ff5bc647b51d..b2912305994f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1220,7 +1220,12 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, page_remove_rmap(page, true); spin_unlock(vmf->ptl); - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pmdp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start, + mmun_end); ret |= VM_FAULT_WRITE; put_page(page); @@ -1369,7 +1374,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) } spin_unlock(vmf->ptl); out_mn: - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pmdp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start, + mmun_end); out: return ret; out_unlock: @@ -2021,7 +2031,12 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, out: spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PUD_SIZE); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pudp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(mm, haddr, haddr + + HPAGE_PUD_SIZE); } #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ @@ -2096,6 +2111,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR); return; } else if (is_huge_zero_pmd(*pmd)) { + /* + * FIXME: Do we want to invalidate secondary mmu by calling + * mmu_notifier_invalidate_range() see comments below inside + * __split_huge_pmd() ? + * + * We are going from a zero huge page write protected to zero + * small page also write protected so it does not seems useful + * to invalidate secondary mmu at this time. + */ return __split_huge_zero_page_pmd(vma, haddr, pmd); } @@ -2231,7 +2255,21 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, __split_huge_pmd_locked(vma, pmd, haddr, freeze); out: spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE); + /* + * No need to double call mmu_notifier->invalidate_range() callback. + * They are 3 cases to consider inside __split_huge_pmd_locked(): + * 1) pmdp_huge_clear_flush_notify() call invalidate_range() obvious + * 2) __split_huge_zero_page_pmd() read only zero page and any write + * fault will trigger a flush_notify before pointing to a new page + * (it is fine if the secondary mmu keeps pointing to the old zero + * page in the meantime) + * 3) Split a huge pmd into pte pointing to the same page. No need + * to invalidate secondary tlb entry they are all still valid. + * any further changes to individual pte will notify. So no need + * to call mmu_notifier->invalidate_range() + */ + mmu_notifier_invalidate_range_only_end(mm, haddr, haddr + + HPAGE_PMD_SIZE); } void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index 47cdf4e85c2d..8a0c410037d2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2555,7 +2555,11 @@ static int wp_page_copy(struct vm_fault *vmf) put_page(new_page); pte_unmap_unlock(vmf->pte, vmf->ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above ptep_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end); if (old_page) { /* * Don't let another task, with possibly unlocked vma, diff --git a/mm/migrate.c b/mm/migrate.c index e00814ca390e..2f0f8190cb6f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2088,7 +2088,11 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED); spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above pmdp_huge_clear_flush_notify() did already call it. + */ + mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end); /* Take an "isolate" reference and put new page on the LRU. */ get_page(new_page); @@ -2804,9 +2808,14 @@ static void migrate_vma_pages(struct migrate_vma *migrate) migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; } + /* + * No need to double call mmu_notifier->invalidate_range() callback as + * the above ptep_clear_flush_notify() inside migrate_vma_insert_page() + * did already call it. + */ if (notified) - mmu_notifier_invalidate_range_end(mm, mmu_start, - migrate->end); + mmu_notifier_invalidate_range_only_end(mm, mmu_start, + migrate->end); } /* diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 314285284e6e..96edb33fd09a 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -190,7 +190,9 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start); void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, + unsigned long end, + bool only_end) { struct mmu_notifier *mn; int id; @@ -204,8 +206,13 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, * subsystem registers either invalidate_range_start()/end() or * invalidate_range(), so this will be no additional overhead * (besides the pointer check). + * + * We skip call to invalidate_range() if we know it is safe ie + * call site use mmu_notifier_invalidate_range_only_end() which + * is safe to do when we know that a call to invalidate_range() + * already happen under page table lock. */ - if (mn->ops->invalidate_range) + if (!only_end && mn->ops->invalidate_range) mn->ops->invalidate_range(mn, mm, start, end); if (mn->ops->invalidate_range_end) mn->ops->invalidate_range_end(mn, mm, start, end); -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 4A0D46B0033 for ; Wed, 18 Oct 2017 23:04:40 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id z11so4709825pfk.23 for ; Wed, 18 Oct 2017 20:04:40 -0700 (PDT) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id f4sor857611plm.6.2017.10.18.20.04.38 for (Google Transport Security); Wed, 18 Oct 2017 20:04:38 -0700 (PDT) Date: Thu, 19 Oct 2017 14:04:26 +1100 From: Balbir Singh Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Message-ID: <20171019140426.21f51957@MiWiFi-R3-srv> In-Reply-To: <20171017031003.7481-2-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: jglisse@redhat.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org, linux-next@vger.kernel.org On Mon, 16 Oct 2017 23:10:02 -0400 jglisse@redhat.com wrote: > From: J=C3=A9r=C3=B4me Glisse >=20 > + /* > + * No need to call mmu_notifier_invalidate_range() as we are > + * downgrading page table protection not changing it to point > + * to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > if (pmdp) { > #ifdef CONFIG_FS_DAX_PMD > pmd_t pmd; > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_= space *mapping, > pmd =3D pmd_wrprotect(pmd); > pmd =3D pmd_mkclean(pmd); > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); Could the secondary TLB still see the mapping as dirty and propagate the di= rty bit back? > unlock_pmd: > spin_unlock(ptl); > #endif > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_= space *mapping, > pte =3D pte_wrprotect(pte); > pte =3D pte_mkclean(pte); > set_pte_at(vma->vm_mm, address, ptep, pte); > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); Ditto > unlock_pte: > pte_unmap_unlock(ptep, ptl); > } > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 6866e8126982..49c925c96b8a 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > * shared page-tables, it not necessary to implement the > * invalidate_range_start()/end() notifiers, as > * invalidate_range() alread catches the points in time when an > - * external TLB range needs to be flushed. > + * external TLB range needs to be flushed. For more in depth > + * discussion on this see Documentation/vm/mmu_notifier.txt > * > * The invalidate_range() function is called under the ptl > * spin-lock and not allowed to sleep. > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index c037d3d34950..ff5bc647b51d 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_= fault *vmf, pmd_t orig_pmd, > goto out_free_pages; > VM_BUG_ON_PAGE(!PageHead(page), page); > =20 > + /* > + * Leave pmd empty until pte is filled note we must notify here as > + * concurrent CPU thread might write to new page before the call to > + * mmu_notifier_invalidate_range_end() happens which can lead to a > + * device seeing memory write in different order than CPU. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > - /* leave pmd empty until pte is filled */ > =20 > pgtable =3D pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > pmd_populate(vma->vm_mm, &_pmd, pgtable); > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_a= rea_struct *vma, > pmd_t _pmd; > int i; > =20 > - /* leave pmd empty until pte is filled */ > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > + /* > + * Leave pmd empty until pte is filled note that it is fine to delay > + * notification until mmu_notifier_invalidate_range_end() as we are > + * replacing a zero pmd write protected page with a zero pte write > + * protected page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > + pmdp_huge_clear_flush(vma, haddr, pmd); Shouldn't the secondary TLB know if the page size changed? > =20 > pgtable =3D pgtable_trans_huge_withdraw(mm, pmd); > pmd_populate(mm, &_pmd, pgtable); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 1768efa4c501..63a63f1b536c 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst,= struct mm_struct *src, > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > } else { > if (cow) { > + /* > + * No need to notify as we are downgrading page > + * table protection not changing it to point > + * to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > huge_ptep_set_wrprotect(src, addr, src_pte); OK.. so we could get write faults on write accesses from the device. > - mmu_notifier_invalidate_range(src, mmun_start, > - mmun_end); > } > entry =3D huge_ptep_get(src_pte); > ptepage =3D pte_page(entry); > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_= area_struct *vma, > * and that page table be reused and filled with junk. > */ > flush_hugetlb_tlb_range(vma, start, end); > - mmu_notifier_invalidate_range(mm, start, end); > + /* > + * No need to call mmu_notifier_invalidate_range() we are downgrading > + * page table protection not changing it to point to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > i_mmap_unlock_write(vma->vm_file->f_mapping); > mmu_notifier_invalidate_range_end(mm, start, end); > =20 > diff --git a/mm/ksm.c b/mm/ksm.c > index 6cb60f46cce5..be8f4576f842 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struc= t *vma, struct page *page, > * So we clear the pte and flush the tlb before the check > * this assure us that no O_DIRECT can happen after the check > * or in the middle of the check. > + * > + * No need to notify as we are downgrading page table to read > + * only not changing it to point to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > */ > - entry =3D ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > + entry =3D ptep_clear_flush(vma, pvmw.address, pvmw.pte); > /* > * Check that no O_DIRECT or similar I/O is in progress on the > * page > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma= , struct page *page, > } > =20 > flush_cache_page(vma, addr, pte_pfn(*ptep)); > - ptep_clear_flush_notify(vma, addr, ptep); > + /* > + * No need to notify as we are replacing a read only page with another > + * read only page with the same content. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > + ptep_clear_flush(vma, addr, ptep); > set_pte_at_notify(mm, addr, ptep, newpte); > =20 > page_remove_rmap(page, false); > diff --git a/mm/rmap.c b/mm/rmap.c > index 061826278520..6b5a0f219ac0 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, str= uct vm_area_struct *vma, > #endif > } > =20 > - if (ret) { > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > + /* > + * No need to call mmu_notifier_invalidate_range() as we are > + * downgrading page table protection not changing it to point > + * to a new page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > + if (ret) > (*cleaned)++; > - } > } > =20 > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, st= ruct vm_area_struct *vma, > if (pte_soft_dirty(pteval)) > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > + /* > + * No need to invalidate here it will synchronize on > + * against the special swap migration pte. > + */ > goto discard; > } > =20 > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, str= uct vm_area_struct *vma, > * will take care of the rest. > */ > dec_mm_counter(mm, mm_counter(page)); > + /* We have to invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, address, > + address + PAGE_SIZE); > } else if (IS_ENABLED(CONFIG_MIGRATION) && > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > swp_entry_t entry; > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, st= ruct vm_area_struct *vma, > if (pte_soft_dirty(pteval)) > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); > set_pte_at(mm, address, pvmw.pte, swp_pte); > + /* > + * No need to invalidate here it will synchronize on > + * against the special swap migration pte. > + */ > } else if (PageAnon(page)) { > swp_entry_t entry =3D { .val =3D page_private(subpage) }; > pte_t swp_pte; > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, str= uct vm_area_struct *vma, > WARN_ON_ONCE(1); > ret =3D false; > /* We have to invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, address, > + address + PAGE_SIZE); > page_vma_mapped_walk_done(&pvmw); > break; > } > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, str= uct vm_area_struct *vma, > /* MADV_FREE page check */ > if (!PageSwapBacked(page)) { > if (!PageDirty(page)) { > + /* Invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, > + address, address + PAGE_SIZE); > dec_mm_counter(mm, MM_ANONPAGES); > goto discard; > } > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, s= truct vm_area_struct *vma, > if (pte_soft_dirty(pteval)) > swp_pte =3D pte_swp_mksoft_dirty(swp_pte); > set_pte_at(mm, address, pvmw.pte, swp_pte); > - } else > + /* Invalidate as we cleared the pte */ > + mmu_notifier_invalidate_range(mm, address, > + address + PAGE_SIZE); > + } else { > + /* > + * We should not need to notify here as we reach this > + * case only from freeze_page() itself only call from > + * split_huge_page_to_list() so everything below must > + * be true: > + * - page is not anonymous > + * - page is locked > + * > + * So as it is a locked file back page thus it can not > + * be remove from the page cache and replace by a new > + * page before mmu_notifier_invalidate_range_end so no > + * concurrent thread might update its page table to > + * point at new page while a device still is using this > + * page. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > dec_mm_counter(mm, mm_counter_file(page)); > + } > discard: > + /* > + * No need to call mmu_notifier_invalidate_range() it has be > + * done above for all cases requiring it to happen under page > + * table lock before mmu_notifier_invalidate_range_end() > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > page_remove_rmap(subpage, PageHuge(page)); > put_page(page); > - mmu_notifier_invalidate_range(mm, address, > - address + PAGE_SIZE); > } > =20 > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); Looking at the patchset, I understand the efficiency, but I am concerned with correctness. Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f197.google.com (mail-qk0-f197.google.com [209.85.220.197]) by kanga.kvack.org (Postfix) with ESMTP id 739526B0033 for ; Wed, 18 Oct 2017 23:08:18 -0400 (EDT) Received: by mail-qk0-f197.google.com with SMTP id g74so7794061qke.4 for ; Wed, 18 Oct 2017 20:08:18 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id t27si92417qki.317.2017.10.18.20.08.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Oct 2017 20:08:17 -0700 (PDT) Date: Wed, 18 Oct 2017 23:08:12 -0400 From: Jerome Glisse Subject: Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback Message-ID: <20171019030812.GB5246@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171019134319.1b856091@MiWiFi-R3-srv> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20171019134319.1b856091@MiWiFi-R3-srv> Sender: owner-linux-mm@kvack.org List-ID: To: Balbir Singh Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org On Thu, Oct 19, 2017 at 01:43:19PM +1100, Balbir Singh wrote: > On Mon, 16 Oct 2017 23:10:01 -0400 > jglisse@redhat.com wrote: > > > From: Jerome Glisse > > > > (Andrew you already have v1 in your queue of patch 1, patch 2 is new, > > i think you can drop it patch 1 v1 for v2, v2 is bit more conservative > > and i fixed typos) > > > > All this only affect user of invalidate_range callback (at this time > > CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in > > drivers/iommu/amd_iommu_v2.c|intel-svm.c) > > > > This patchset remove useless double call to mmu_notifier->invalidate_range > > callback wherever it is safe to do so. The first patch just remove useless > > call > > As in an extra call? Where does that come from? Before this patch you had the following pattern: mmu_notifier_invalidate_range_start(); take_page_table_lock() ... update_page_table() mmu_notifier_invalidate_range() ... drop_page_table_lock() mmu_notifier_invalidate_range_end(); It happens that mmu_notifier_invalidate_range_end() also make an unconditional call to mmu_notifier_invalidate_range() so in the above scenario you had 2 calls to mmu_notifier_invalidate_range() Obviously one of the 2 call is useless. In some case you can drop the first call (under the page table lock) this is what patch 1 does. In other cases you can drop the second call that happen inside mmu_notifier_invalidate_range_end() that is what patch 2 does. Hence why i am referring to useless double call. I have added more documentation to explain all this in the code and also under Documentation/vm/mmu_notifier.txt > > > and add documentation explaining why it is safe to do so. The second > > patch go further by introducing mmu_notifier_invalidate_range_only_end() > > which skip callback to invalidate_range this can be done when clearing a > > pte, pmd or pud with notification which call invalidate_range right after > > clearing under the page table lock. > > > > Balbir Singh. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f200.google.com (mail-qk0-f200.google.com [209.85.220.200]) by kanga.kvack.org (Postfix) with ESMTP id E77766B0033 for ; Wed, 18 Oct 2017 23:28:20 -0400 (EDT) Received: by mail-qk0-f200.google.com with SMTP id d67so7831367qkg.3 for ; Wed, 18 Oct 2017 20:28:20 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id j37si1515358qtb.354.2017.10.18.20.28.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Oct 2017 20:28:19 -0700 (PDT) Date: Wed, 18 Oct 2017 23:28:12 -0400 From: Jerome Glisse Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Message-ID: <20171019032811.GC5246@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20171019140426.21f51957@MiWiFi-R3-srv> Sender: owner-linux-mm@kvack.org List-ID: To: Balbir Singh Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org, linux-next@vger.kernel.org On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > On Mon, 16 Oct 2017 23:10:02 -0400 > jglisse@redhat.com wrote: > > > From: Jerome Glisse > > > > + /* > > + * No need to call mmu_notifier_invalidate_range() as we are > > + * downgrading page table protection not changing it to point > > + * to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > if (pmdp) { > > #ifdef CONFIG_FS_DAX_PMD > > pmd_t pmd; > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > pmd = pmd_wrprotect(pmd); > > pmd = pmd_mkclean(pmd); > > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? I am assuming hardware does sane thing of setting the dirty bit only when walking the CPU page table when device does a write fault ie once the device get a write TLB entry the dirty is set by the IOMMU when walking the page table before returning the lookup result to the device and that it won't be set again latter (ie propagated back latter). I should probably have spell that out and maybe some of the ATS/PASID implementer did not do that. > > > unlock_pmd: > > spin_unlock(ptl); > > #endif > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > pte = pte_wrprotect(pte); > > pte = pte_mkclean(pte); > > set_pte_at(vma->vm_mm, address, ptep, pte); > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > Ditto > > > unlock_pte: > > pte_unmap_unlock(ptep, ptl); > > } > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index 6866e8126982..49c925c96b8a 100644 > > --- a/include/linux/mmu_notifier.h > > +++ b/include/linux/mmu_notifier.h > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > > * shared page-tables, it not necessary to implement the > > * invalidate_range_start()/end() notifiers, as > > * invalidate_range() alread catches the points in time when an > > - * external TLB range needs to be flushed. > > + * external TLB range needs to be flushed. For more in depth > > + * discussion on this see Documentation/vm/mmu_notifier.txt > > * > > * The invalidate_range() function is called under the ptl > > * spin-lock and not allowed to sleep. > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index c037d3d34950..ff5bc647b51d 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > goto out_free_pages; > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > + /* > > + * Leave pmd empty until pte is filled note we must notify here as > > + * concurrent CPU thread might write to new page before the call to > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > + * device seeing memory write in different order than CPU. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > - /* leave pmd empty until pte is filled */ > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > pmd_t _pmd; > > int i; > > > > - /* leave pmd empty until pte is filled */ > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > + /* > > + * Leave pmd empty until pte is filled note that it is fine to delay > > + * notification until mmu_notifier_invalidate_range_end() as we are > > + * replacing a zero pmd write protected page with a zero pte write > > + * protected page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > Shouldn't the secondary TLB know if the page size changed? It should not matter, we are talking virtual to physical on behalf of a device against a process address space. So the hardware should not care about the page size. Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero 4K pages is replace by something new then a device TLB shootdown will happen before the new page is set. Only issue i can think of is if the IOMMU TLB (if there is one) or the device TLB (you do expect that there is one) does not invalidate TLB entry if the TLB shootdown is smaller than the TLB entry. That would be idiotic but yes i know hardware bug. > > > > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > pmd_populate(mm, &_pmd, pgtable); > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 1768efa4c501..63a63f1b536c 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > > } else { > > if (cow) { > > + /* > > + * No need to notify as we are downgrading page > > + * table protection not changing it to point > > + * to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > huge_ptep_set_wrprotect(src, addr, src_pte); > > OK.. so we could get write faults on write accesses from the device. > > > - mmu_notifier_invalidate_range(src, mmun_start, > > - mmun_end); > > } > > entry = huge_ptep_get(src_pte); > > ptepage = pte_page(entry); > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > > * and that page table be reused and filled with junk. > > */ > > flush_hugetlb_tlb_range(vma, start, end); > > - mmu_notifier_invalidate_range(mm, start, end); > > + /* > > + * No need to call mmu_notifier_invalidate_range() we are downgrading > > + * page table protection not changing it to point to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > i_mmap_unlock_write(vma->vm_file->f_mapping); > > mmu_notifier_invalidate_range_end(mm, start, end); > > > > diff --git a/mm/ksm.c b/mm/ksm.c > > index 6cb60f46cce5..be8f4576f842 100644 > > --- a/mm/ksm.c > > +++ b/mm/ksm.c > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > > * So we clear the pte and flush the tlb before the check > > * this assure us that no O_DIRECT can happen after the check > > * or in the middle of the check. > > + * > > + * No need to notify as we are downgrading page table to read > > + * only not changing it to point to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > */ > > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > > /* > > * Check that no O_DIRECT or similar I/O is in progress on the > > * page > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > > } > > > > flush_cache_page(vma, addr, pte_pfn(*ptep)); > > - ptep_clear_flush_notify(vma, addr, ptep); > > + /* > > + * No need to notify as we are replacing a read only page with another > > + * read only page with the same content. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > + ptep_clear_flush(vma, addr, ptep); > > set_pte_at_notify(mm, addr, ptep, newpte); > > > > page_remove_rmap(page, false); > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 061826278520..6b5a0f219ac0 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > > #endif > > } > > > > - if (ret) { > > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > > + /* > > + * No need to call mmu_notifier_invalidate_range() as we are > > + * downgrading page table protection not changing it to point > > + * to a new page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > + if (ret) > > (*cleaned)++; > > - } > > } > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > if (pte_soft_dirty(pteval)) > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > > + /* > > + * No need to invalidate here it will synchronize on > > + * against the special swap migration pte. > > + */ > > goto discard; > > } > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > * will take care of the rest. > > */ > > dec_mm_counter(mm, mm_counter(page)); > > + /* We have to invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, address, > > + address + PAGE_SIZE); > > } else if (IS_ENABLED(CONFIG_MIGRATION) && > > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > > swp_entry_t entry; > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > if (pte_soft_dirty(pteval)) > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > + /* > > + * No need to invalidate here it will synchronize on > > + * against the special swap migration pte. > > + */ > > } else if (PageAnon(page)) { > > swp_entry_t entry = { .val = page_private(subpage) }; > > pte_t swp_pte; > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > WARN_ON_ONCE(1); > > ret = false; > > /* We have to invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, address, > > + address + PAGE_SIZE); > > page_vma_mapped_walk_done(&pvmw); > > break; > > } > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > /* MADV_FREE page check */ > > if (!PageSwapBacked(page)) { > > if (!PageDirty(page)) { > > + /* Invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, > > + address, address + PAGE_SIZE); > > dec_mm_counter(mm, MM_ANONPAGES); > > goto discard; > > } > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > if (pte_soft_dirty(pteval)) > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > - } else > > + /* Invalidate as we cleared the pte */ > > + mmu_notifier_invalidate_range(mm, address, > > + address + PAGE_SIZE); > > + } else { > > + /* > > + * We should not need to notify here as we reach this > > + * case only from freeze_page() itself only call from > > + * split_huge_page_to_list() so everything below must > > + * be true: > > + * - page is not anonymous > > + * - page is locked > > + * > > + * So as it is a locked file back page thus it can not > > + * be remove from the page cache and replace by a new > > + * page before mmu_notifier_invalidate_range_end so no > > + * concurrent thread might update its page table to > > + * point at new page while a device still is using this > > + * page. > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > dec_mm_counter(mm, mm_counter_file(page)); > > + } > > discard: > > + /* > > + * No need to call mmu_notifier_invalidate_range() it has be > > + * done above for all cases requiring it to happen under page > > + * table lock before mmu_notifier_invalidate_range_end() > > + * > > + * See Documentation/vm/mmu_notifier.txt > > + */ > > page_remove_rmap(subpage, PageHuge(page)); > > put_page(page); > > - mmu_notifier_invalidate_range(mm, address, > > - address + PAGE_SIZE); > > } > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > Looking at the patchset, I understand the efficiency, but I am concerned > with correctness. I am fine in holding this off from reaching Linus but only way to flush this issues out if any is to have this patch in linux-next or somewhere were they get a chance of being tested. Note that the second patch is always safe. I agree that this one might not be if hardware implementation is idiotic (well that would be my opinion and any opinion/point of view can be challenge :)) > > Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f200.google.com (mail-qt0-f200.google.com [209.85.216.200]) by kanga.kvack.org (Postfix) with ESMTP id E88F86B0038 for ; Thu, 19 Oct 2017 12:58:32 -0400 (EDT) Received: by mail-qt0-f200.google.com with SMTP id d9so8551752qtd.8 for ; Thu, 19 Oct 2017 09:58:32 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 11si4318649qkn.237.2017.10.19.09.58.30 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 19 Oct 2017 09:58:31 -0700 (PDT) Date: Thu, 19 Oct 2017 12:58:23 -0400 From: Jerome Glisse Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Message-ID: <20171019165823.GA3044@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Balbir Singh Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > >> On Mon, 16 Oct 2017 23:10:02 -0400 > >> jglisse@redhat.com wrote: > >> > >> > From: Jerome Glisse > >> > > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() as we are > >> > + * downgrading page table protection not changing it to point > >> > + * to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > if (pmdp) { > >> > #ifdef CONFIG_FS_DAX_PMD > >> > pmd_t pmd; > >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > >> > pmd = pmd_wrprotect(pmd); > >> > pmd = pmd_mkclean(pmd); > >> > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > >> > >> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? > > > > I am assuming hardware does sane thing of setting the dirty bit only > > when walking the CPU page table when device does a write fault ie > > once the device get a write TLB entry the dirty is set by the IOMMU > > when walking the page table before returning the lookup result to the > > device and that it won't be set again latter (ie propagated back > > latter). > > > > The other possibility is that the hardware things the page is writable > and already > marked dirty. It allows writes and does not set the dirty bit? I thought about this some more and the patch can not regress anything that is not broken today. So if we assume that device can propagate dirty bit because it can cache the write protection than all current code is broken for two reasons: First one is current code clear pte entry, build a new pte value with write protection and update pte entry with new pte value. So any PASID/ ATS platform that allows device to cache the write bit and set dirty bit anytime after that can race during that window and you would loose the dirty bit of the device. That is not that bad as you are gonna propagate the dirty bit to the struct page. Second one is if the dirty bit is propagated back to the new write protected pte. Quick look at code it seems that when we zap pte or or mkclean we don't check that the pte has write permission but only care about the dirty bit. So it should not have any bad consequence. After this patch only the second window is bigger and thus more likely to happen. But nothing sinister should happen from that. > > > I should probably have spell that out and maybe some of the ATS/PASID > > implementer did not do that. > > > >> > >> > unlock_pmd: > >> > spin_unlock(ptl); > >> > #endif > >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > >> > pte = pte_wrprotect(pte); > >> > pte = pte_mkclean(pte); > >> > set_pte_at(vma->vm_mm, address, ptep, pte); > >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > >> > >> Ditto > >> > >> > unlock_pte: > >> > pte_unmap_unlock(ptep, ptl); > >> > } > >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > >> > index 6866e8126982..49c925c96b8a 100644 > >> > --- a/include/linux/mmu_notifier.h > >> > +++ b/include/linux/mmu_notifier.h > >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > >> > * shared page-tables, it not necessary to implement the > >> > * invalidate_range_start()/end() notifiers, as > >> > * invalidate_range() alread catches the points in time when an > >> > - * external TLB range needs to be flushed. > >> > + * external TLB range needs to be flushed. For more in depth > >> > + * discussion on this see Documentation/vm/mmu_notifier.txt > >> > * > >> > * The invalidate_range() function is called under the ptl > >> > * spin-lock and not allowed to sleep. > >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > >> > index c037d3d34950..ff5bc647b51d 100644 > >> > --- a/mm/huge_memory.c > >> > +++ b/mm/huge_memory.c > >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > >> > goto out_free_pages; > >> > VM_BUG_ON_PAGE(!PageHead(page), page); > >> > > >> > + /* > >> > + * Leave pmd empty until pte is filled note we must notify here as > >> > + * concurrent CPU thread might write to new page before the call to > >> > + * mmu_notifier_invalidate_range_end() happens which can lead to a > >> > + * device seeing memory write in different order than CPU. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > >> > - /* leave pmd empty until pte is filled */ > >> > > >> > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > >> > pmd_populate(vma->vm_mm, &_pmd, pgtable); > >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > >> > pmd_t _pmd; > >> > int i; > >> > > >> > - /* leave pmd empty until pte is filled */ > >> > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > >> > + /* > >> > + * Leave pmd empty until pte is filled note that it is fine to delay > >> > + * notification until mmu_notifier_invalidate_range_end() as we are > >> > + * replacing a zero pmd write protected page with a zero pte write > >> > + * protected page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > + pmdp_huge_clear_flush(vma, haddr, pmd); > >> > >> Shouldn't the secondary TLB know if the page size changed? > > > > It should not matter, we are talking virtual to physical on behalf > > of a device against a process address space. So the hardware should > > not care about the page size. > > > > Does that not indicate how much the device can access? Could it try > to access more than what is mapped? Assuming device has huge TLB and 2MB huge page with 4K small page. You are going from one 1 TLB covering a 2MB zero page to 512 TLB each covering 4K. Both case is read only and both case are pointing to same data (ie zero). It is fine to delay the TLB invalidate on the device to the call of mmu_notifier_invalidate_range_end(). The device will keep using the huge TLB for a little longer but both CPU and device are looking at same data. Now if there is a racing thread that replace one of the 512 zeor page after the split but before mmu_notifier_invalidate_range_end() that code path would call mmu_notifier_invalidate_range() before changing the pte to point to something else. Which should shoot down the device TLB (it would be a serious device bug if this did not work). > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > > 4K pages is replace by something new then a device TLB shootdown will > > happen before the new page is set. > > > > Only issue i can think of is if the IOMMU TLB (if there is one) or > > the device TLB (you do expect that there is one) does not invalidate > > TLB entry if the TLB shootdown is smaller than the TLB entry. That > > would be idiotic but yes i know hardware bug. > > > > > >> > >> > > >> > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > >> > pmd_populate(mm, &_pmd, pgtable); > >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> > index 1768efa4c501..63a63f1b536c 100644 > >> > --- a/mm/hugetlb.c > >> > +++ b/mm/hugetlb.c > >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > >> > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > >> > } else { > >> > if (cow) { > >> > + /* > >> > + * No need to notify as we are downgrading page > >> > + * table protection not changing it to point > >> > + * to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > huge_ptep_set_wrprotect(src, addr, src_pte); > >> > >> OK.. so we could get write faults on write accesses from the device. > >> > >> > - mmu_notifier_invalidate_range(src, mmun_start, > >> > - mmun_end); > >> > } > >> > entry = huge_ptep_get(src_pte); > >> > ptepage = pte_page(entry); > >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > >> > * and that page table be reused and filled with junk. > >> > */ > >> > flush_hugetlb_tlb_range(vma, start, end); > >> > - mmu_notifier_invalidate_range(mm, start, end); > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() we are downgrading > >> > + * page table protection not changing it to point to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > i_mmap_unlock_write(vma->vm_file->f_mapping); > >> > mmu_notifier_invalidate_range_end(mm, start, end); > >> > > >> > diff --git a/mm/ksm.c b/mm/ksm.c > >> > index 6cb60f46cce5..be8f4576f842 100644 > >> > --- a/mm/ksm.c > >> > +++ b/mm/ksm.c > >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > >> > * So we clear the pte and flush the tlb before the check > >> > * this assure us that no O_DIRECT can happen after the check > >> > * or in the middle of the check. > >> > + * > >> > + * No need to notify as we are downgrading page table to read > >> > + * only not changing it to point to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > */ > >> > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > >> > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > >> > /* > >> > * Check that no O_DIRECT or similar I/O is in progress on the > >> > * page > >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > >> > } > >> > > >> > flush_cache_page(vma, addr, pte_pfn(*ptep)); > >> > - ptep_clear_flush_notify(vma, addr, ptep); > >> > + /* > >> > + * No need to notify as we are replacing a read only page with another > >> > + * read only page with the same content. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > + ptep_clear_flush(vma, addr, ptep); > >> > set_pte_at_notify(mm, addr, ptep, newpte); > >> > > >> > page_remove_rmap(page, false); > >> > diff --git a/mm/rmap.c b/mm/rmap.c > >> > index 061826278520..6b5a0f219ac0 100644 > >> > --- a/mm/rmap.c > >> > +++ b/mm/rmap.c > >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > >> > #endif > >> > } > >> > > >> > - if (ret) { > >> > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() as we are > >> > + * downgrading page table protection not changing it to point > >> > + * to a new page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > + if (ret) > >> > (*cleaned)++; > >> > - } > >> > } > >> > > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > if (pte_soft_dirty(pteval)) > >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); > >> > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > >> > + /* > >> > + * No need to invalidate here it will synchronize on > >> > + * against the special swap migration pte. > >> > + */ > >> > goto discard; > >> > } > >> > > >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > * will take care of the rest. > >> > */ > >> > dec_mm_counter(mm, mm_counter(page)); > >> > + /* We have to invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, address, > >> > + address + PAGE_SIZE); > >> > } else if (IS_ENABLED(CONFIG_MIGRATION) && > >> > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > >> > swp_entry_t entry; > >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > if (pte_soft_dirty(pteval)) > >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); > >> > set_pte_at(mm, address, pvmw.pte, swp_pte); > >> > + /* > >> > + * No need to invalidate here it will synchronize on > >> > + * against the special swap migration pte. > >> > + */ > >> > } else if (PageAnon(page)) { > >> > swp_entry_t entry = { .val = page_private(subpage) }; > >> > pte_t swp_pte; > >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > WARN_ON_ONCE(1); > >> > ret = false; > >> > /* We have to invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, address, > >> > + address + PAGE_SIZE); > >> > page_vma_mapped_walk_done(&pvmw); > >> > break; > >> > } > >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > /* MADV_FREE page check */ > >> > if (!PageSwapBacked(page)) { > >> > if (!PageDirty(page)) { > >> > + /* Invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, > >> > + address, address + PAGE_SIZE); > >> > dec_mm_counter(mm, MM_ANONPAGES); > >> > goto discard; > >> > } > >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > if (pte_soft_dirty(pteval)) > >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); > >> > set_pte_at(mm, address, pvmw.pte, swp_pte); > >> > - } else > >> > + /* Invalidate as we cleared the pte */ > >> > + mmu_notifier_invalidate_range(mm, address, > >> > + address + PAGE_SIZE); > >> > + } else { > >> > + /* > >> > + * We should not need to notify here as we reach this > >> > + * case only from freeze_page() itself only call from > >> > + * split_huge_page_to_list() so everything below must > >> > + * be true: > >> > + * - page is not anonymous > >> > + * - page is locked > >> > + * > >> > + * So as it is a locked file back page thus it can not > >> > + * be remove from the page cache and replace by a new > >> > + * page before mmu_notifier_invalidate_range_end so no > >> > + * concurrent thread might update its page table to > >> > + * point at new page while a device still is using this > >> > + * page. > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > dec_mm_counter(mm, mm_counter_file(page)); > >> > + } > >> > discard: > >> > + /* > >> > + * No need to call mmu_notifier_invalidate_range() it has be > >> > + * done above for all cases requiring it to happen under page > >> > + * table lock before mmu_notifier_invalidate_range_end() > >> > + * > >> > + * See Documentation/vm/mmu_notifier.txt > >> > + */ > >> > page_remove_rmap(subpage, PageHuge(page)); > >> > put_page(page); > >> > - mmu_notifier_invalidate_range(mm, address, > >> > - address + PAGE_SIZE); > >> > } > >> > > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > >> > >> Looking at the patchset, I understand the efficiency, but I am concerned > >> with correctness. > > > > I am fine in holding this off from reaching Linus but only way to flush this > > issues out if any is to have this patch in linux-next or somewhere were they > > get a chance of being tested. > > > > Yep, I would like to see some additional testing around npu and get Alistair > Popple to comment as well I think this patch is fine. The only one race window that it might make bigger should have no bad consequences. > > > Note that the second patch is always safe. I agree that this one might > > not be if hardware implementation is idiotic (well that would be my > > opinion and any opinion/point of view can be challenge :)) > > > You mean the only_end variant that avoids shootdown after pmd/pte changes > that avoid the _start/_end and have just the only_end variant? That seemed > reasonable to me, but I've not tested it or evaluated it in depth Yes, patch 2/2 in this serie is definitly fine. It invalidate the device TLB right after clearing pte entry and avoid latter unecessary invalidation of same TLB. Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 731D46B0038 for ; Sat, 21 Oct 2017 01:54:53 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id f85so12555629pfe.7 for ; Fri, 20 Oct 2017 22:54:53 -0700 (PDT) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id n66sor787435pfa.101.2017.10.20.22.54.51 for (Google Transport Security); Fri, 20 Oct 2017 22:54:51 -0700 (PDT) Message-ID: <1508565280.5662.6.camel@gmail.com> Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 From: Balbir Singh Date: Sat, 21 Oct 2017 16:54:40 +1100 In-Reply-To: <20171019165823.GA3044@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> <20171019165823.GA3044@redhat.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote: > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > > > > On Mon, 16 Oct 2017 23:10:02 -0400 > > > > jglisse@redhat.com wrote: > > > > > > > > > From: JA(C)rA'me Glisse > > > > > > > > > > + /* > > > > > + * No need to call mmu_notifier_invalidate_range() as we are > > > > > + * downgrading page table protection not changing it to point > > > > > + * to a new page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > if (pmdp) { > > > > > #ifdef CONFIG_FS_DAX_PMD > > > > > pmd_t pmd; > > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > > > > pmd = pmd_wrprotect(pmd); > > > > > pmd = pmd_mkclean(pmd); > > > > > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > > > > > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? > > > > > > I am assuming hardware does sane thing of setting the dirty bit only > > > when walking the CPU page table when device does a write fault ie > > > once the device get a write TLB entry the dirty is set by the IOMMU > > > when walking the page table before returning the lookup result to the > > > device and that it won't be set again latter (ie propagated back > > > latter). > > > > > > > The other possibility is that the hardware things the page is writable > > and already > > marked dirty. It allows writes and does not set the dirty bit? > > I thought about this some more and the patch can not regress anything > that is not broken today. So if we assume that device can propagate > dirty bit because it can cache the write protection than all current > code is broken for two reasons: > > First one is current code clear pte entry, build a new pte value with > write protection and update pte entry with new pte value. So any PASID/ > ATS platform that allows device to cache the write bit and set dirty > bit anytime after that can race during that window and you would loose > the dirty bit of the device. That is not that bad as you are gonna > propagate the dirty bit to the struct page. But they stay consistent with the notifiers, so from the OS perspective it notifies of any PTE changes as they happen. When the ATS platform sees invalidation, it invalidates it's PTE's as well. I was speaking of the case where the ATS platform could assume it has write access and has not seen any invalidation, the OS could return back to user space or the caller with write bit clear, but the ATS platform could still do a write since it's not seen the invalidation. > > Second one is if the dirty bit is propagated back to the new write > protected pte. Quick look at code it seems that when we zap pte or > or mkclean we don't check that the pte has write permission but only > care about the dirty bit. So it should not have any bad consequence. > > After this patch only the second window is bigger and thus more likely > to happen. But nothing sinister should happen from that. > > > > > > > I should probably have spell that out and maybe some of the ATS/PASID > > > implementer did not do that. > > > > > > > > > > > > unlock_pmd: > > > > > spin_unlock(ptl); > > > > > #endif > > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > > > > pte = pte_wrprotect(pte); > > > > > pte = pte_mkclean(pte); > > > > > set_pte_at(vma->vm_mm, address, ptep, pte); > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > > > > > > > Ditto > > > > > > > > > unlock_pte: > > > > > pte_unmap_unlock(ptep, ptl); > > > > > } > > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > > > > index 6866e8126982..49c925c96b8a 100644 > > > > > --- a/include/linux/mmu_notifier.h > > > > > +++ b/include/linux/mmu_notifier.h > > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > > > > > * shared page-tables, it not necessary to implement the > > > > > * invalidate_range_start()/end() notifiers, as > > > > > * invalidate_range() alread catches the points in time when an > > > > > - * external TLB range needs to be flushed. > > > > > + * external TLB range needs to be flushed. For more in depth > > > > > + * discussion on this see Documentation/vm/mmu_notifier.txt > > > > > * > > > > > * The invalidate_range() function is called under the ptl > > > > > * spin-lock and not allowed to sleep. > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > index c037d3d34950..ff5bc647b51d 100644 > > > > > --- a/mm/huge_memory.c > > > > > +++ b/mm/huge_memory.c > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > > > > goto out_free_pages; > > > > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > > > > > > > + /* > > > > > + * Leave pmd empty until pte is filled note we must notify here as > > > > > + * concurrent CPU thread might write to new page before the call to > > > > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > > > > + * device seeing memory write in different order than CPU. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > > > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > > > > pmd_t _pmd; > > > > > int i; > > > > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > > + /* > > > > > + * Leave pmd empty until pte is filled note that it is fine to delay > > > > > + * notification until mmu_notifier_invalidate_range_end() as we are > > > > > + * replacing a zero pmd write protected page with a zero pte write > > > > > + * protected page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > > > > > > > Shouldn't the secondary TLB know if the page size changed? > > > > > > It should not matter, we are talking virtual to physical on behalf > > > of a device against a process address space. So the hardware should > > > not care about the page size. > > > > > > > Does that not indicate how much the device can access? Could it try > > to access more than what is mapped? > > Assuming device has huge TLB and 2MB huge page with 4K small page. > You are going from one 1 TLB covering a 2MB zero page to 512 TLB > each covering 4K. Both case is read only and both case are pointing > to same data (ie zero). > > It is fine to delay the TLB invalidate on the device to the call of > mmu_notifier_invalidate_range_end(). The device will keep using the > huge TLB for a little longer but both CPU and device are looking at > same data. > > Now if there is a racing thread that replace one of the 512 zeor page > after the split but before mmu_notifier_invalidate_range_end() that > code path would call mmu_notifier_invalidate_range() before changing > the pte to point to something else. Which should shoot down the device > TLB (it would be a serious device bug if this did not work). OK.. This seems reasonable, but I'd really like to see if it can be tested > > > > > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > > > 4K pages is replace by something new then a device TLB shootdown will > > > happen before the new page is set. > > > > > > Only issue i can think of is if the IOMMU TLB (if there is one) or > > > the device TLB (you do expect that there is one) does not invalidate > > > TLB entry if the TLB shootdown is smaller than the TLB entry. That > > > would be idiotic but yes i know hardware bug. > > > > > > > > > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > > > > pmd_populate(mm, &_pmd, pgtable); > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > > index 1768efa4c501..63a63f1b536c 100644 > > > > > --- a/mm/hugetlb.c > > > > > +++ b/mm/hugetlb.c > > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > > > > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > > > > > } else { > > > > > if (cow) { > > > > > + /* > > > > > + * No need to notify as we are downgrading page > > > > > + * table protection not changing it to point > > > > > + * to a new page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > huge_ptep_set_wrprotect(src, addr, src_pte); > > > > > > > > OK.. so we could get write faults on write accesses from the device. > > > > > > > > > - mmu_notifier_invalidate_range(src, mmun_start, > > > > > - mmun_end); > > > > > } > > > > > entry = huge_ptep_get(src_pte); > > > > > ptepage = pte_page(entry); > > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > > > > > * and that page table be reused and filled with junk. > > > > > */ > > > > > flush_hugetlb_tlb_range(vma, start, end); > > > > > - mmu_notifier_invalidate_range(mm, start, end); > > > > > + /* > > > > > + * No need to call mmu_notifier_invalidate_range() we are downgrading > > > > > + * page table protection not changing it to point to a new page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > i_mmap_unlock_write(vma->vm_file->f_mapping); > > > > > mmu_notifier_invalidate_range_end(mm, start, end); > > > > > > > > > > diff --git a/mm/ksm.c b/mm/ksm.c > > > > > index 6cb60f46cce5..be8f4576f842 100644 > > > > > --- a/mm/ksm.c > > > > > +++ b/mm/ksm.c > > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > > > > > * So we clear the pte and flush the tlb before the check > > > > > * this assure us that no O_DIRECT can happen after the check > > > > > * or in the middle of the check. > > > > > + * > > > > > + * No need to notify as we are downgrading page table to read > > > > > + * only not changing it to point to a new page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > */ > > > > > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > > > > > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > > > > > /* > > > > > * Check that no O_DIRECT or similar I/O is in progress on the > > > > > * page > > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > > > > > } > > > > > > > > > > flush_cache_page(vma, addr, pte_pfn(*ptep)); > > > > > - ptep_clear_flush_notify(vma, addr, ptep); > > > > > + /* > > > > > + * No need to notify as we are replacing a read only page with another > > > > > + * read only page with the same content. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > + ptep_clear_flush(vma, addr, ptep); > > > > > set_pte_at_notify(mm, addr, ptep, newpte); > > > > > > > > > > page_remove_rmap(page, false); > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > > > index 061826278520..6b5a0f219ac0 100644 > > > > > --- a/mm/rmap.c > > > > > +++ b/mm/rmap.c > > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > > > > > #endif > > > > > } > > > > > > > > > > - if (ret) { > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > > > > > + /* > > > > > + * No need to call mmu_notifier_invalidate_range() as we are > > > > > + * downgrading page table protection not changing it to point > > > > > + * to a new page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > + if (ret) > > > > > (*cleaned)++; > > > > > - } > > > > > } > > > > > > > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > if (pte_soft_dirty(pteval)) > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > > > > > + /* > > > > > + * No need to invalidate here it will synchronize on > > > > > + * against the special swap migration pte. > > > > > + */ > > > > > goto discard; > > > > > } > > > > > > > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > * will take care of the rest. > > > > > */ > > > > > dec_mm_counter(mm, mm_counter(page)); > > > > > + /* We have to invalidate as we cleared the pte */ > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > + address + PAGE_SIZE); > > > > > } else if (IS_ENABLED(CONFIG_MIGRATION) && > > > > > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > > > > > swp_entry_t entry; > > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > if (pte_soft_dirty(pteval)) > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > > > > + /* > > > > > + * No need to invalidate here it will synchronize on > > > > > + * against the special swap migration pte. > > > > > + */ > > > > > } else if (PageAnon(page)) { > > > > > swp_entry_t entry = { .val = page_private(subpage) }; > > > > > pte_t swp_pte; > > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > WARN_ON_ONCE(1); > > > > > ret = false; > > > > > /* We have to invalidate as we cleared the pte */ > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > + address + PAGE_SIZE); > > > > > page_vma_mapped_walk_done(&pvmw); > > > > > break; > > > > > } > > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > /* MADV_FREE page check */ > > > > > if (!PageSwapBacked(page)) { > > > > > if (!PageDirty(page)) { > > > > > + /* Invalidate as we cleared the pte */ > > > > > + mmu_notifier_invalidate_range(mm, > > > > > + address, address + PAGE_SIZE); > > > > > dec_mm_counter(mm, MM_ANONPAGES); > > > > > goto discard; > > > > > } > > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > if (pte_soft_dirty(pteval)) > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > > > > - } else > > > > > + /* Invalidate as we cleared the pte */ > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > + address + PAGE_SIZE); > > > > > + } else { > > > > > + /* > > > > > + * We should not need to notify here as we reach this > > > > > + * case only from freeze_page() itself only call from > > > > > + * split_huge_page_to_list() so everything below must > > > > > + * be true: > > > > > + * - page is not anonymous > > > > > + * - page is locked > > > > > + * > > > > > + * So as it is a locked file back page thus it can not > > > > > + * be remove from the page cache and replace by a new > > > > > + * page before mmu_notifier_invalidate_range_end so no > > > > > + * concurrent thread might update its page table to > > > > > + * point at new page while a device still is using this > > > > > + * page. > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > dec_mm_counter(mm, mm_counter_file(page)); > > > > > + } > > > > > discard: > > > > > + /* > > > > > + * No need to call mmu_notifier_invalidate_range() it has be > > > > > + * done above for all cases requiring it to happen under page > > > > > + * table lock before mmu_notifier_invalidate_range_end() > > > > > + * > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > + */ > > > > > page_remove_rmap(subpage, PageHuge(page)); > > > > > put_page(page); > > > > > - mmu_notifier_invalidate_range(mm, address, > > > > > - address + PAGE_SIZE); > > > > > } > > > > > > > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > > > > > > > Looking at the patchset, I understand the efficiency, but I am concerned > > > > with correctness. > > > > > > I am fine in holding this off from reaching Linus but only way to flush this > > > issues out if any is to have this patch in linux-next or somewhere were they > > > get a chance of being tested. > > > > > > > Yep, I would like to see some additional testing around npu and get Alistair > > Popple to comment as well > > I think this patch is fine. The only one race window that it might make > bigger should have no bad consequences. > > > > > > Note that the second patch is always safe. I agree that this one might > > > not be if hardware implementation is idiotic (well that would be my > > > opinion and any opinion/point of view can be challenge :)) > > > > > > You mean the only_end variant that avoids shootdown after pmd/pte changes > > that avoid the _start/_end and have just the only_end variant? That seemed > > reasonable to me, but I've not tested it or evaluated it in depth > > Yes, patch 2/2 in this serie is definitly fine. It invalidate the device > TLB right after clearing pte entry and avoid latter unecessary invalidation > of same TLB. > > JA(C)rA'me Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f72.google.com (mail-oi0-f72.google.com [209.85.218.72]) by kanga.kvack.org (Postfix) with ESMTP id 6BF486B0038 for ; Sat, 21 Oct 2017 11:47:11 -0400 (EDT) Received: by mail-oi0-f72.google.com with SMTP id f66so14034882oib.1 for ; Sat, 21 Oct 2017 08:47:11 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 19si974905oie.278.2017.10.21.08.47.09 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 21 Oct 2017 08:47:09 -0700 (PDT) Date: Sat, 21 Oct 2017 11:47:03 -0400 From: Jerome Glisse Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Message-ID: <20171021154703.GA30458@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> <20171019165823.GA3044@redhat.com> <1508565280.5662.6.camel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1508565280.5662.6.camel@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Balbir Singh Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote: > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote: > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > > > > > On Mon, 16 Oct 2017 23:10:02 -0400 > > > > > jglisse@redhat.com wrote: > > > > > > > > > > > From: Jerome Glisse > > > > > > > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() as we are > > > > > > + * downgrading page table protection not changing it to point > > > > > > + * to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > if (pmdp) { > > > > > > #ifdef CONFIG_FS_DAX_PMD > > > > > > pmd_t pmd; > > > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > > > > > pmd = pmd_wrprotect(pmd); > > > > > > pmd = pmd_mkclean(pmd); > > > > > > set_pmd_at(vma->vm_mm, address, pmdp, pmd); > > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > > > > > > > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? > > > > > > > > I am assuming hardware does sane thing of setting the dirty bit only > > > > when walking the CPU page table when device does a write fault ie > > > > once the device get a write TLB entry the dirty is set by the IOMMU > > > > when walking the page table before returning the lookup result to the > > > > device and that it won't be set again latter (ie propagated back > > > > latter). > > > > > > > > > > The other possibility is that the hardware things the page is writable > > > and already > > > marked dirty. It allows writes and does not set the dirty bit? > > > > I thought about this some more and the patch can not regress anything > > that is not broken today. So if we assume that device can propagate > > dirty bit because it can cache the write protection than all current > > code is broken for two reasons: > > > > First one is current code clear pte entry, build a new pte value with > > write protection and update pte entry with new pte value. So any PASID/ > > ATS platform that allows device to cache the write bit and set dirty > > bit anytime after that can race during that window and you would loose > > the dirty bit of the device. That is not that bad as you are gonna > > propagate the dirty bit to the struct page. > > But they stay consistent with the notifiers, so from the OS perspective > it notifies of any PTE changes as they happen. When the ATS platform sees > invalidation, it invalidates it's PTE's as well. > > I was speaking of the case where the ATS platform could assume it has > write access and has not seen any invalidation, the OS could return > back to user space or the caller with write bit clear, but the ATS > platform could still do a write since it's not seen the invalidation. I understood what you said and what is above apply. I am removing only one of the invalidation not both. So with that patch the invalidation is delayed after the page table lock drop but before dax/page_mkclean returns. Hence any further activity will be read only on any device too once we exit those functions. The only difference is the window during which device can report dirty pte. Before that patch the 2 "~bogus~" window were small: First window between pmd/pte_get_clear_flush and set_pte/pmd Second window between set_pte/pmd and mmu_notifier_invalidate_range The first window stay the same, the second window is bigger, potentialy lot bigger if thread is prempted before mmu_notifier_invalidate_range_end But that is fine as in that case the page is reported as dirty and thus we are not missing anything and the kernel code does not care about seeing read only pte mark as dirty. > > > > > Second one is if the dirty bit is propagated back to the new write > > protected pte. Quick look at code it seems that when we zap pte or > > or mkclean we don't check that the pte has write permission but only > > care about the dirty bit. So it should not have any bad consequence. > > > > After this patch only the second window is bigger and thus more likely > > to happen. But nothing sinister should happen from that. > > > > > > > > > > > I should probably have spell that out and maybe some of the ATS/PASID > > > > implementer did not do that. > > > > > > > > > > > > > > > unlock_pmd: > > > > > > spin_unlock(ptl); > > > > > > #endif > > > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, > > > > > > pte = pte_wrprotect(pte); > > > > > > pte = pte_mkclean(pte); > > > > > > set_pte_at(vma->vm_mm, address, ptep, pte); > > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); > > > > > > > > > > Ditto > > > > > > > > > > > unlock_pte: > > > > > > pte_unmap_unlock(ptep, ptl); > > > > > > } > > > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > > > > > index 6866e8126982..49c925c96b8a 100644 > > > > > > --- a/include/linux/mmu_notifier.h > > > > > > +++ b/include/linux/mmu_notifier.h > > > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { > > > > > > * shared page-tables, it not necessary to implement the > > > > > > * invalidate_range_start()/end() notifiers, as > > > > > > * invalidate_range() alread catches the points in time when an > > > > > > - * external TLB range needs to be flushed. > > > > > > + * external TLB range needs to be flushed. For more in depth > > > > > > + * discussion on this see Documentation/vm/mmu_notifier.txt > > > > > > * > > > > > > * The invalidate_range() function is called under the ptl > > > > > > * spin-lock and not allowed to sleep. > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > index c037d3d34950..ff5bc647b51d 100644 > > > > > > --- a/mm/huge_memory.c > > > > > > +++ b/mm/huge_memory.c > > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > > > > > goto out_free_pages; > > > > > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > > > > > > > > > + /* > > > > > > + * Leave pmd empty until pte is filled note we must notify here as > > > > > > + * concurrent CPU thread might write to new page before the call to > > > > > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > > > > > + * device seeing memory write in different order than CPU. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > > > > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > > > > > pmd_t _pmd; > > > > > > int i; > > > > > > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > > > + /* > > > > > > + * Leave pmd empty until pte is filled note that it is fine to delay > > > > > > + * notification until mmu_notifier_invalidate_range_end() as we are > > > > > > + * replacing a zero pmd write protected page with a zero pte write > > > > > > + * protected page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > > > > > > > > > Shouldn't the secondary TLB know if the page size changed? > > > > > > > > It should not matter, we are talking virtual to physical on behalf > > > > of a device against a process address space. So the hardware should > > > > not care about the page size. > > > > > > > > > > Does that not indicate how much the device can access? Could it try > > > to access more than what is mapped? > > > > Assuming device has huge TLB and 2MB huge page with 4K small page. > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB > > each covering 4K. Both case is read only and both case are pointing > > to same data (ie zero). > > > > It is fine to delay the TLB invalidate on the device to the call of > > mmu_notifier_invalidate_range_end(). The device will keep using the > > huge TLB for a little longer but both CPU and device are looking at > > same data. > > > > Now if there is a racing thread that replace one of the 512 zeor page > > after the split but before mmu_notifier_invalidate_range_end() that > > code path would call mmu_notifier_invalidate_range() before changing > > the pte to point to something else. Which should shoot down the device > > TLB (it would be a serious device bug if this did not work). > > OK.. This seems reasonable, but I'd really like to see if it can be > tested Well hard to test, many factors first each device might react differently. Device that only store TLB at 4k granularity are fine. Clever device that can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller than their TLB entry ie getting a 4K invalidation would not invalidate a 2MB TLB entry in the device. I consider this as buggy. I will go look at the PCIE ATS specification one more time and see if there is any wording related that. I might bring up a question to the PCIE standard body if not. Second factor is that it is a race between split zero and a write fault. I can probably do a crappy patch that msleep if split happens against a given mm to increase the race window. But i would be testing against one device (right now i can only access AMD IOMMUv2 devices with discret ATS GPU) > > > > > > > > > > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > > > > 4K pages is replace by something new then a device TLB shootdown will > > > > happen before the new page is set. > > > > > > > > Only issue i can think of is if the IOMMU TLB (if there is one) or > > > > the device TLB (you do expect that there is one) does not invalidate > > > > TLB entry if the TLB shootdown is smaller than the TLB entry. That > > > > would be idiotic but yes i know hardware bug. > > > > > > > > > > > > > > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > > > > > pmd_populate(mm, &_pmd, pgtable); > > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > > > index 1768efa4c501..63a63f1b536c 100644 > > > > > > --- a/mm/hugetlb.c > > > > > > +++ b/mm/hugetlb.c > > > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > > > > > > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); > > > > > > } else { > > > > > > if (cow) { > > > > > > + /* > > > > > > + * No need to notify as we are downgrading page > > > > > > + * table protection not changing it to point > > > > > > + * to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > huge_ptep_set_wrprotect(src, addr, src_pte); > > > > > > > > > > OK.. so we could get write faults on write accesses from the device. > > > > > > > > > > > - mmu_notifier_invalidate_range(src, mmun_start, > > > > > > - mmun_end); > > > > > > } > > > > > > entry = huge_ptep_get(src_pte); > > > > > > ptepage = pte_page(entry); > > > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > > > > > > * and that page table be reused and filled with junk. > > > > > > */ > > > > > > flush_hugetlb_tlb_range(vma, start, end); > > > > > > - mmu_notifier_invalidate_range(mm, start, end); > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() we are downgrading > > > > > > + * page table protection not changing it to point to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > i_mmap_unlock_write(vma->vm_file->f_mapping); > > > > > > mmu_notifier_invalidate_range_end(mm, start, end); > > > > > > > > > > > > diff --git a/mm/ksm.c b/mm/ksm.c > > > > > > index 6cb60f46cce5..be8f4576f842 100644 > > > > > > --- a/mm/ksm.c > > > > > > +++ b/mm/ksm.c > > > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > > > > > > * So we clear the pte and flush the tlb before the check > > > > > > * this assure us that no O_DIRECT can happen after the check > > > > > > * or in the middle of the check. > > > > > > + * > > > > > > + * No need to notify as we are downgrading page table to read > > > > > > + * only not changing it to point to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > */ > > > > > > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); > > > > > > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > > > > > > /* > > > > > > * Check that no O_DIRECT or similar I/O is in progress on the > > > > > > * page > > > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > > > > > > } > > > > > > > > > > > > flush_cache_page(vma, addr, pte_pfn(*ptep)); > > > > > > - ptep_clear_flush_notify(vma, addr, ptep); > > > > > > + /* > > > > > > + * No need to notify as we are replacing a read only page with another > > > > > > + * read only page with the same content. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > + ptep_clear_flush(vma, addr, ptep); > > > > > > set_pte_at_notify(mm, addr, ptep, newpte); > > > > > > > > > > > > page_remove_rmap(page, false); > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > > > > index 061826278520..6b5a0f219ac0 100644 > > > > > > --- a/mm/rmap.c > > > > > > +++ b/mm/rmap.c > > > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, > > > > > > #endif > > > > > > } > > > > > > > > > > > > - if (ret) { > > > > > > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() as we are > > > > > > + * downgrading page table protection not changing it to point > > > > > > + * to a new page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > + if (ret) > > > > > > (*cleaned)++; > > > > > > - } > > > > > > } > > > > > > > > > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > if (pte_soft_dirty(pteval)) > > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); > > > > > > + /* > > > > > > + * No need to invalidate here it will synchronize on > > > > > > + * against the special swap migration pte. > > > > > > + */ > > > > > > goto discard; > > > > > > } > > > > > > > > > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > * will take care of the rest. > > > > > > */ > > > > > > dec_mm_counter(mm, mm_counter(page)); > > > > > > + /* We have to invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > > + address + PAGE_SIZE); > > > > > > } else if (IS_ENABLED(CONFIG_MIGRATION) && > > > > > > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { > > > > > > swp_entry_t entry; > > > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > if (pte_soft_dirty(pteval)) > > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > > > > > + /* > > > > > > + * No need to invalidate here it will synchronize on > > > > > > + * against the special swap migration pte. > > > > > > + */ > > > > > > } else if (PageAnon(page)) { > > > > > > swp_entry_t entry = { .val = page_private(subpage) }; > > > > > > pte_t swp_pte; > > > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > WARN_ON_ONCE(1); > > > > > > ret = false; > > > > > > /* We have to invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > > + address + PAGE_SIZE); > > > > > > page_vma_mapped_walk_done(&pvmw); > > > > > > break; > > > > > > } > > > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > /* MADV_FREE page check */ > > > > > > if (!PageSwapBacked(page)) { > > > > > > if (!PageDirty(page)) { > > > > > > + /* Invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, > > > > > > + address, address + PAGE_SIZE); > > > > > > dec_mm_counter(mm, MM_ANONPAGES); > > > > > > goto discard; > > > > > > } > > > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > if (pte_soft_dirty(pteval)) > > > > > > swp_pte = pte_swp_mksoft_dirty(swp_pte); > > > > > > set_pte_at(mm, address, pvmw.pte, swp_pte); > > > > > > - } else > > > > > > + /* Invalidate as we cleared the pte */ > > > > > > + mmu_notifier_invalidate_range(mm, address, > > > > > > + address + PAGE_SIZE); > > > > > > + } else { > > > > > > + /* > > > > > > + * We should not need to notify here as we reach this > > > > > > + * case only from freeze_page() itself only call from > > > > > > + * split_huge_page_to_list() so everything below must > > > > > > + * be true: > > > > > > + * - page is not anonymous > > > > > > + * - page is locked > > > > > > + * > > > > > > + * So as it is a locked file back page thus it can not > > > > > > + * be remove from the page cache and replace by a new > > > > > > + * page before mmu_notifier_invalidate_range_end so no > > > > > > + * concurrent thread might update its page table to > > > > > > + * point at new page while a device still is using this > > > > > > + * page. > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > dec_mm_counter(mm, mm_counter_file(page)); > > > > > > + } > > > > > > discard: > > > > > > + /* > > > > > > + * No need to call mmu_notifier_invalidate_range() it has be > > > > > > + * done above for all cases requiring it to happen under page > > > > > > + * table lock before mmu_notifier_invalidate_range_end() > > > > > > + * > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > + */ > > > > > > page_remove_rmap(subpage, PageHuge(page)); > > > > > > put_page(page); > > > > > > - mmu_notifier_invalidate_range(mm, address, > > > > > > - address + PAGE_SIZE); > > > > > > } > > > > > > > > > > > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > > > > > > > > > > Looking at the patchset, I understand the efficiency, but I am concerned > > > > > with correctness. > > > > > > > > I am fine in holding this off from reaching Linus but only way to flush this > > > > issues out if any is to have this patch in linux-next or somewhere were they > > > > get a chance of being tested. > > > > > > > > > > Yep, I would like to see some additional testing around npu and get Alistair > > > Popple to comment as well > > > > I think this patch is fine. The only one race window that it might make > > bigger should have no bad consequences. > > > > > > > > > Note that the second patch is always safe. I agree that this one might > > > > not be if hardware implementation is idiotic (well that would be my > > > > opinion and any opinion/point of view can be challenge :)) > > > > > > > > > You mean the only_end variant that avoids shootdown after pmd/pte changes > > > that avoid the _start/_end and have just the only_end variant? That seemed > > > reasonable to me, but I've not tested it or evaluated it in depth > > > > Yes, patch 2/2 in this serie is definitly fine. It invalidate the device > > TLB right after clearing pte entry and avoid latter unecessary invalidation > > of same TLB. > > > > Jerome > > Balbir Singh. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f69.google.com (mail-oi0-f69.google.com [209.85.218.69]) by kanga.kvack.org (Postfix) with ESMTP id DB8976B0033 for ; Mon, 23 Oct 2017 16:35:09 -0400 (EDT) Received: by mail-oi0-f69.google.com with SMTP id s185so19464547oif.16 for ; Mon, 23 Oct 2017 13:35:09 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id y13si2488391otg.321.2017.10.23.13.35.07 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Oct 2017 13:35:07 -0700 (PDT) Date: Mon, 23 Oct 2017 16:35:01 -0400 From: Jerome Glisse Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Message-ID: <20171023203501.GA9371@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> <20171019165823.GA3044@redhat.com> <1508565280.5662.6.camel@gmail.com> <20171021154703.GA30458@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20171021154703.GA30458@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Balbir Singh Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next On Sat, Oct 21, 2017 at 11:47:03AM -0400, Jerome Glisse wrote: > On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote: > > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote: > > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > > > > > > On Mon, 16 Oct 2017 23:10:02 -0400 > > > > > > jglisse@redhat.com wrote: > > > > > > > From: Jerome Glisse [...] > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > > index c037d3d34950..ff5bc647b51d 100644 > > > > > > > --- a/mm/huge_memory.c > > > > > > > +++ b/mm/huge_memory.c > > > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > > > > > > goto out_free_pages; > > > > > > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > > > > > > > > > > > + /* > > > > > > > + * Leave pmd empty until pte is filled note we must notify here as > > > > > > > + * concurrent CPU thread might write to new page before the call to > > > > > > > + * mmu_notifier_invalidate_range_end() happens which can lead to a > > > > > > > + * device seeing memory write in different order than CPU. > > > > > > > + * > > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > > + */ > > > > > > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > > > > > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > > > > > > pmd_t _pmd; > > > > > > > int i; > > > > > > > > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > > > > + /* > > > > > > > + * Leave pmd empty until pte is filled note that it is fine to delay > > > > > > > + * notification until mmu_notifier_invalidate_range_end() as we are > > > > > > > + * replacing a zero pmd write protected page with a zero pte write > > > > > > > + * protected page. > > > > > > > + * > > > > > > > + * See Documentation/vm/mmu_notifier.txt > > > > > > > + */ > > > > > > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > > > > > > > > > > > Shouldn't the secondary TLB know if the page size changed? > > > > > > > > > > It should not matter, we are talking virtual to physical on behalf > > > > > of a device against a process address space. So the hardware should > > > > > not care about the page size. > > > > > > > > > > > > > Does that not indicate how much the device can access? Could it try > > > > to access more than what is mapped? > > > > > > Assuming device has huge TLB and 2MB huge page with 4K small page. > > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB > > > each covering 4K. Both case is read only and both case are pointing > > > to same data (ie zero). > > > > > > It is fine to delay the TLB invalidate on the device to the call of > > > mmu_notifier_invalidate_range_end(). The device will keep using the > > > huge TLB for a little longer but both CPU and device are looking at > > > same data. > > > > > > Now if there is a racing thread that replace one of the 512 zeor page > > > after the split but before mmu_notifier_invalidate_range_end() that > > > code path would call mmu_notifier_invalidate_range() before changing > > > the pte to point to something else. Which should shoot down the device > > > TLB (it would be a serious device bug if this did not work). > > > > OK.. This seems reasonable, but I'd really like to see if it can be > > tested > > Well hard to test, many factors first each device might react differently. > Device that only store TLB at 4k granularity are fine. Clever device that > can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller > than their TLB entry ie getting a 4K invalidation would not invalidate a > 2MB TLB entry in the device. I consider this as buggy. I will go look at > the PCIE ATS specification one more time and see if there is any wording > related that. I might bring up a question to the PCIE standard body if not. So inside PCIE ATS there is the definition of "minimum translation or invalidate size" which says 4096 bytes. So my understanding is that hardware must support 4K invalidation in all the case and thus we shoud be safe from possible hazard above. But none the less i will repost without the optimization for huge page to be more concervative as anyway we want to be correct before we care about last bit of optimization. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751811AbdJSCo3 (ORCPT ); Wed, 18 Oct 2017 22:44:29 -0400 Received: from mail-pf0-f194.google.com ([209.85.192.194]:50449 "EHLO mail-pf0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751592AbdJSCnb (ORCPT ); Wed, 18 Oct 2017 22:43:31 -0400 X-Google-Smtp-Source: ABhQp+QbE/SZmRVozdyktlAivr6lFnJVq2OulDGr3cMaeHfDv+iE/39WSoaOhUYhDmv81Ieden6j4w== Date: Thu, 19 Oct 2017 13:43:19 +1100 From: Balbir Singh To: jglisse@redhat.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, linuxppc-dev@lists.ozlabs.org Subject: Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback Message-ID: <20171019134319.1b856091@MiWiFi-R3-srv> In-Reply-To: <20171017031003.7481-1-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> X-Mailer: Claws Mail 3.15.1-dirty (GTK+ 2.24.31; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by nfs id v9J2iYFK005448 On Mon, 16 Oct 2017 23:10:01 -0400 jglisse@redhat.com wrote: > From: Jérôme Glisse > > (Andrew you already have v1 in your queue of patch 1, patch 2 is new, > i think you can drop it patch 1 v1 for v2, v2 is bit more conservative > and i fixed typos) > > All this only affect user of invalidate_range callback (at this time > CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in > drivers/iommu/amd_iommu_v2.c|intel-svm.c) > > This patchset remove useless double call to mmu_notifier->invalidate_range > callback wherever it is safe to do so. The first patch just remove useless > call As in an extra call? Where does that come from? > and add documentation explaining why it is safe to do so. The second > patch go further by introducing mmu_notifier_invalidate_range_only_end() > which skip callback to invalidate_range this can be done when clearing a > pte, pmd or pud with notification which call invalidate_range right after > clearing under the page table lock. > Balbir Singh. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752163AbdJSKxQ (ORCPT ); Thu, 19 Oct 2017 06:53:16 -0400 Received: from mail-vk0-f66.google.com ([209.85.213.66]:44594 "EHLO mail-vk0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751836AbdJSKxN (ORCPT ); Thu, 19 Oct 2017 06:53:13 -0400 X-Google-Smtp-Source: ABhQp+Qeb6vuEIGbgPAHPC52YumybzCJ2QL+ENO93w6o0W1z8Xv1WkuU97Q8o5eqlIlnrRe7brPkWHkD3T1bPBn2teg= MIME-Version: 1.0 In-Reply-To: <20171019032811.GC5246@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> <20171017031003.7481-2-jglisse@redhat.com> <20171019140426.21f51957@MiWiFi-R3-srv> <20171019032811.GC5246@redhat.com> From: Balbir Singh Date: Thu, 19 Oct 2017 21:53:11 +1100 Message-ID: Subject: Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 To: Jerome Glisse Cc: linux-mm , "linux-kernel@vger.kernel.org" , Andrea Arcangeli , Nadav Amit , Linus Torvalds , Andrew Morton , Joerg Roedel , Suravee Suthikulpanit , David Woodhouse , Alistair Popple , Michael Ellerman , Benjamin Herrenschmidt , Stephen Rothwell , Andrew Donnellan , iommu@lists.linux-foundation.org, "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , linux-next Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by nfs id v9JArLXE005696 On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse wrote: > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: >> On Mon, 16 Oct 2017 23:10:02 -0400 >> jglisse@redhat.com wrote: >> >> > From: Jérôme Glisse >> > >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() as we are >> > + * downgrading page table protection not changing it to point >> > + * to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > if (pmdp) { >> > #ifdef CONFIG_FS_DAX_PMD >> > pmd_t pmd; >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, >> > pmd = pmd_wrprotect(pmd); >> > pmd = pmd_mkclean(pmd); >> > set_pmd_at(vma->vm_mm, address, pmdp, pmd); >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); >> >> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back? > > I am assuming hardware does sane thing of setting the dirty bit only > when walking the CPU page table when device does a write fault ie > once the device get a write TLB entry the dirty is set by the IOMMU > when walking the page table before returning the lookup result to the > device and that it won't be set again latter (ie propagated back > latter). > The other possibility is that the hardware things the page is writable and already marked dirty. It allows writes and does not set the dirty bit? > I should probably have spell that out and maybe some of the ATS/PASID > implementer did not do that. > >> >> > unlock_pmd: >> > spin_unlock(ptl); >> > #endif >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, >> > pte = pte_wrprotect(pte); >> > pte = pte_mkclean(pte); >> > set_pte_at(vma->vm_mm, address, ptep, pte); >> > - mmu_notifier_invalidate_range(vma->vm_mm, start, end); >> >> Ditto >> >> > unlock_pte: >> > pte_unmap_unlock(ptep, ptl); >> > } >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h >> > index 6866e8126982..49c925c96b8a 100644 >> > --- a/include/linux/mmu_notifier.h >> > +++ b/include/linux/mmu_notifier.h >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops { >> > * shared page-tables, it not necessary to implement the >> > * invalidate_range_start()/end() notifiers, as >> > * invalidate_range() alread catches the points in time when an >> > - * external TLB range needs to be flushed. >> > + * external TLB range needs to be flushed. For more in depth >> > + * discussion on this see Documentation/vm/mmu_notifier.txt >> > * >> > * The invalidate_range() function is called under the ptl >> > * spin-lock and not allowed to sleep. >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> > index c037d3d34950..ff5bc647b51d 100644 >> > --- a/mm/huge_memory.c >> > +++ b/mm/huge_memory.c >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, >> > goto out_free_pages; >> > VM_BUG_ON_PAGE(!PageHead(page), page); >> > >> > + /* >> > + * Leave pmd empty until pte is filled note we must notify here as >> > + * concurrent CPU thread might write to new page before the call to >> > + * mmu_notifier_invalidate_range_end() happens which can lead to a >> > + * device seeing memory write in different order than CPU. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); >> > - /* leave pmd empty until pte is filled */ >> > >> > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); >> > pmd_populate(vma->vm_mm, &_pmd, pgtable); >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, >> > pmd_t _pmd; >> > int i; >> > >> > - /* leave pmd empty until pte is filled */ >> > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); >> > + /* >> > + * Leave pmd empty until pte is filled note that it is fine to delay >> > + * notification until mmu_notifier_invalidate_range_end() as we are >> > + * replacing a zero pmd write protected page with a zero pte write >> > + * protected page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > + pmdp_huge_clear_flush(vma, haddr, pmd); >> >> Shouldn't the secondary TLB know if the page size changed? > > It should not matter, we are talking virtual to physical on behalf > of a device against a process address space. So the hardware should > not care about the page size. > Does that not indicate how much the device can access? Could it try to access more than what is mapped? > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero > 4K pages is replace by something new then a device TLB shootdown will > happen before the new page is set. > > Only issue i can think of is if the IOMMU TLB (if there is one) or > the device TLB (you do expect that there is one) does not invalidate > TLB entry if the TLB shootdown is smaller than the TLB entry. That > would be idiotic but yes i know hardware bug. > > >> >> > >> > pgtable = pgtable_trans_huge_withdraw(mm, pmd); >> > pmd_populate(mm, &_pmd, pgtable); >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> > index 1768efa4c501..63a63f1b536c 100644 >> > --- a/mm/hugetlb.c >> > +++ b/mm/hugetlb.c >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, >> > set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); >> > } else { >> > if (cow) { >> > + /* >> > + * No need to notify as we are downgrading page >> > + * table protection not changing it to point >> > + * to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > huge_ptep_set_wrprotect(src, addr, src_pte); >> >> OK.. so we could get write faults on write accesses from the device. >> >> > - mmu_notifier_invalidate_range(src, mmun_start, >> > - mmun_end); >> > } >> > entry = huge_ptep_get(src_pte); >> > ptepage = pte_page(entry); >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, >> > * and that page table be reused and filled with junk. >> > */ >> > flush_hugetlb_tlb_range(vma, start, end); >> > - mmu_notifier_invalidate_range(mm, start, end); >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() we are downgrading >> > + * page table protection not changing it to point to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > i_mmap_unlock_write(vma->vm_file->f_mapping); >> > mmu_notifier_invalidate_range_end(mm, start, end); >> > >> > diff --git a/mm/ksm.c b/mm/ksm.c >> > index 6cb60f46cce5..be8f4576f842 100644 >> > --- a/mm/ksm.c >> > +++ b/mm/ksm.c >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, >> > * So we clear the pte and flush the tlb before the check >> > * this assure us that no O_DIRECT can happen after the check >> > * or in the middle of the check. >> > + * >> > + * No need to notify as we are downgrading page table to read >> > + * only not changing it to point to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > */ >> > - entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte); >> > + entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); >> > /* >> > * Check that no O_DIRECT or similar I/O is in progress on the >> > * page >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, >> > } >> > >> > flush_cache_page(vma, addr, pte_pfn(*ptep)); >> > - ptep_clear_flush_notify(vma, addr, ptep); >> > + /* >> > + * No need to notify as we are replacing a read only page with another >> > + * read only page with the same content. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > + ptep_clear_flush(vma, addr, ptep); >> > set_pte_at_notify(mm, addr, ptep, newpte); >> > >> > page_remove_rmap(page, false); >> > diff --git a/mm/rmap.c b/mm/rmap.c >> > index 061826278520..6b5a0f219ac0 100644 >> > --- a/mm/rmap.c >> > +++ b/mm/rmap.c >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, >> > #endif >> > } >> > >> > - if (ret) { >> > - mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() as we are >> > + * downgrading page table protection not changing it to point >> > + * to a new page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > + if (ret) >> > (*cleaned)++; >> > - } >> > } >> > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> > if (pte_soft_dirty(pteval)) >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); >> > set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); >> > + /* >> > + * No need to invalidate here it will synchronize on >> > + * against the special swap migration pte. >> > + */ >> > goto discard; >> > } >> > >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> > * will take care of the rest. >> > */ >> > dec_mm_counter(mm, mm_counter(page)); >> > + /* We have to invalidate as we cleared the pte */ >> > + mmu_notifier_invalidate_range(mm, address, >> > + address + PAGE_SIZE); >> > } else if (IS_ENABLED(CONFIG_MIGRATION) && >> > (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { >> > swp_entry_t entry; >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> > if (pte_soft_dirty(pteval)) >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); >> > set_pte_at(mm, address, pvmw.pte, swp_pte); >> > + /* >> > + * No need to invalidate here it will synchronize on >> > + * against the special swap migration pte. >> > + */ >> > } else if (PageAnon(page)) { >> > swp_entry_t entry = { .val = page_private(subpage) }; >> > pte_t swp_pte; >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> > WARN_ON_ONCE(1); >> > ret = false; >> > /* We have to invalidate as we cleared the pte */ >> > + mmu_notifier_invalidate_range(mm, address, >> > + address + PAGE_SIZE); >> > page_vma_mapped_walk_done(&pvmw); >> > break; >> > } >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> > /* MADV_FREE page check */ >> > if (!PageSwapBacked(page)) { >> > if (!PageDirty(page)) { >> > + /* Invalidate as we cleared the pte */ >> > + mmu_notifier_invalidate_range(mm, >> > + address, address + PAGE_SIZE); >> > dec_mm_counter(mm, MM_ANONPAGES); >> > goto discard; >> > } >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> > if (pte_soft_dirty(pteval)) >> > swp_pte = pte_swp_mksoft_dirty(swp_pte); >> > set_pte_at(mm, address, pvmw.pte, swp_pte); >> > - } else >> > + /* Invalidate as we cleared the pte */ >> > + mmu_notifier_invalidate_range(mm, address, >> > + address + PAGE_SIZE); >> > + } else { >> > + /* >> > + * We should not need to notify here as we reach this >> > + * case only from freeze_page() itself only call from >> > + * split_huge_page_to_list() so everything below must >> > + * be true: >> > + * - page is not anonymous >> > + * - page is locked >> > + * >> > + * So as it is a locked file back page thus it can not >> > + * be remove from the page cache and replace by a new >> > + * page before mmu_notifier_invalidate_range_end so no >> > + * concurrent thread might update its page table to >> > + * point at new page while a device still is using this >> > + * page. >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > dec_mm_counter(mm, mm_counter_file(page)); >> > + } >> > discard: >> > + /* >> > + * No need to call mmu_notifier_invalidate_range() it has be >> > + * done above for all cases requiring it to happen under page >> > + * table lock before mmu_notifier_invalidate_range_end() >> > + * >> > + * See Documentation/vm/mmu_notifier.txt >> > + */ >> > page_remove_rmap(subpage, PageHuge(page)); >> > put_page(page); >> > - mmu_notifier_invalidate_range(mm, address, >> > - address + PAGE_SIZE); >> > } >> > >> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); >> >> Looking at the patchset, I understand the efficiency, but I am concerned >> with correctness. > > I am fine in holding this off from reaching Linus but only way to flush this > issues out if any is to have this patch in linux-next or somewhere were they > get a chance of being tested. > Yep, I would like to see some additional testing around npu and get Alistair Popple to comment as well > Note that the second patch is always safe. I agree that this one might > not be if hardware implementation is idiotic (well that would be my > opinion and any opinion/point of view can be challenge :)) You mean the only_end variant that avoids shootdown after pmd/pte changes that avoid the _start/_end and have just the only_end variant? That seemed reasonable to me, but I've not tested it or evaluated it in depth Balbir Singh.