From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 434A8CA9EC3 for ; Thu, 31 Oct 2019 12:33:51 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CC7C820578 for ; Thu, 31 Oct 2019 12:33:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=126.com header.i=@126.com header.b="k8uGJAIP" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CC7C820578 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=126.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 21E2E6B0003; Thu, 31 Oct 2019 08:33:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1CF7E6B0005; Thu, 31 Oct 2019 08:33:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0BF766B0007; Thu, 31 Oct 2019 08:33:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0242.hostedemail.com [216.40.44.242]) by kanga.kvack.org (Postfix) with ESMTP id DFA616B0003 for ; Thu, 31 Oct 2019 08:33:49 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 62C321802E8B0 for ; Thu, 31 Oct 2019 12:33:49 +0000 (UTC) X-FDA: 76104021378.08.twig04_20ee6a5c1b607 X-HE-Tag: twig04_20ee6a5c1b607 X-Filterd-Recvd-Size: 9206 Received: from m15-112.126.com (m15-112.126.com [220.181.15.112]) by imf11.hostedemail.com (Postfix) with ESMTP for ; Thu, 31 Oct 2019 12:33:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=Subject:From:Message-ID:Date:MIME-Version; bh=dPTq6 cuEB8CU2v3V0ikyBaTPFJjF2LRQTHEofpNiWqo=; b=k8uGJAIPIGgWldzOCOByI zCWkmnOxUDYlosT2toV30V3jS1PwAY4gRCnXwdBV1yJ3VaEOwJW8Wr7MezD411hr uQm/25VwKiHNZMP/iIKx8HE3/fRLZg72k09LW41R5Svw74OWaX4vCQHddg/y+dVM x3t3QsMdmtwmwC5p7s8J4A= Received: from [192.168.0.103] (unknown [115.192.52.162]) by smtp2 (Coremail) with SMTP id DMmowACX9g5C1LpdtFNEDQ--.4517S2; Thu, 31 Oct 2019 20:32:04 +0800 (CST) Subject: Re: [PATCH v2] mm: Fix checking unmapped holes for mbind To: Naoya Horiguchi , Li Xinhai Cc: "linux-mm@kvack.org" , akpm , Vlastimil Babka , Michal Hocko , "linux-kernel@vger.kernel.org" , Linux API , Hugh Dickins References: <201910291756045288126@gmail.com> <2019103111150700409251@gmail.com> <20191031090621.GA8196@hori.linux.bs1.fc.nec.co.jp> From: Li Xinhai Message-ID: <696e5cdb-5292-c5fb-43f4-32c91ae82607@126.com> Date: Thu, 31 Oct 2019 20:32:01 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.2.0 MIME-Version: 1.0 In-Reply-To: <20191031090621.GA8196@hori.linux.bs1.fc.nec.co.jp> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-CM-TRANSID:DMmowACX9g5C1LpdtFNEDQ--.4517S2 X-Coremail-Antispam: 1Uf129KBjvJXoWxtr47ZFyDWw1ftFyrCr1xKrg_yoWxAr4kpr WrG3WYva18W3yUtwnFvr1q9ry3tr18Gr48AF17JFn5Xrn8tr4aq34xtry5uFWvy3ykZ3W8 ZF42gwsxCFs8AFDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07bfoGdUUUUU= X-Originating-IP: [115.192.52.162] X-CM-SenderInfo: pol0x0pkdlszl0k6ij2wof0z/1tbiTwRf1VpD+tFUwgAAsl Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2019/10/31 17:06, Naoya Horiguchi wrote: > On Thu, Oct 31, 2019 at 11:15:09AM +0800, Li Xinhai wrote: >> On 2019-10-29=C2=A0at 17:56=C2=A0Li Xinhai=C2=A0wrote: >>> queue_pages_range() will check for unmapped holes besides queue pages= for >>> migration. The rules for checking unmapped holes are: >>> 1 Unmapped holes at any part of the specified range should be reporte= d as >>> =C2=A0 EFAULT if mbind() for none MPOL_DEFAULT cases; >>> 2 Unmapped holes at any part of the specified range should be ignored= if >>> =C2=A0 mbind() for MPOL_DEFAULT case; >>> Note that the second rule is the current implementation, but it seems >>> conflicts the Linux API definition. >>> >>> queue_pages_test_walk() is fixed by introduce new fields in struct >>> queue_pages which help to check: >>> 1 holes at head and tail side of specified range; >>> 2 the whole range is in a hole; >>> >>> Besides, queue_pages_test_walk() must update previous vma record no m= atter >>> the current vma should be considered for queue pages or not. >>> >> >> More details about current issue (which breaks the mbind() API definit= ion): >> 1. In=C2=A0queue_pages_test_walk() >> checking on (!vma->vm_next && vma->vm_end < end) would never success, >> because 'end' passed from walk_page_test() make sure "end <=3D =C2=A0v= ma->vm_end". so hole >> beyond the last vma can't be detected. >> >> 2.=C2=A0queue_pages_test_walk() only called for vma, and 'start', 'end= ' parameters are guaranteed >> within vma. Then, if the range starting or ending in an unmapped hole, >> queue_pages_test_walk() don't have chance to be called to check. In ot= her words, the >> current code can detect this case (range span over hole): >> [ =C2=A0vma =C2=A0][ hole ][ vma] >> =C2=A0 =C2=A0[ =C2=A0 =C2=A0 range =C2=A0 =C2=A0 =C2=A0] >> but cant not detect these case : >> [ =C2=A0vma =C2=A0][ hole ][ vma] >> =C2=A0 =C2=A0[ =C2=A0range =C2=A0] >> [ =C2=A0vma =C2=A0][ hole ][ =C2=A0vma =C2=A0] >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 [ =C2=A0range =C2=A0] >=20 > IIRC, page table walker (walk_page_range()) should be designed to > handle these range inputs by separating into sub-ranges by vma > boundaries like below (with your notation): >=20 > [ vma ][ hole ][ vma ] > [ ][ ][ ] // for your 1st case > [ ][ ] // for your 2nd case > [ ][ ] // for your 3rd case >=20 Yes, pagewalk works exactly as your description. > And I found that .pte_hole is undefined in queue_pages_walk_ops. > So I'm guessing now that that's why hole regions are ignored and > the definition of EFAULT behavior in manpage is violated. > So providing proper .pte_hole callback could be another approach > for this issue which might fit better to the design. Using the .pte_hole() can be one option, we may stop detecting the unmapped holes when .pte_hole() been first called for a hole which is outside vma. But, .pte_hole() would be called many times during the walking, because it is called for holes inside and outside VMA. .test_walk() is called once for one vma, and not called for other situation, so better for cost. > IOW, queue_pages_test_walk() might not the right place to handle > hole regions by definition. I guess the original design was to avoid walking those vma twice, one for test holes and one for queue pages. Maybe we can find better choice? >=20 > Thanks, > Naoya Horiguchi >=20 >> >> 3. a checking in mbind_range() try to recover if the hole is in head s= ide, but can't >> recover if hole is in tail side of range. >> >> - Xinhai >> >>> Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") >>> Fixes: 6f4576e3687b ("mempolicy: apply page table walker on queue_pag= es_range()") >>> Fixes: 48684a65b4e3 ("mm: pagewalk: fix misbehavior of walk_page_rang= e for vma(VM_PFNMAP)") >>> Signed-off-by: Li Xinhai >>> --- >>> Changes in v2: >>> =C2=A0 - Fix the unmapped holes checking in queue_pages_test_walk() = instead of >>> =C2=A0 =C2=A0 mbind_rnage(). >>> >>> =C2=A0mm/mempolicy.c | 44 +++++++++++++++++++++++++++++-------------= -- >>> =C2=A01 file changed, 29 insertions(+), 15 deletions(-) >>> >>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>> index 4ae967bcf954..24087dfa4dcd 100644 >>> --- a/mm/mempolicy.c >>> +++ b/mm/mempolicy.c >>> @@ -411,6 +411,9 @@ struct queue_pages { >>> =C2=A0 unsigned long flags; >>> =C2=A0 nodemask_t *nmask; >>> =C2=A0 struct vm_area_struct *prev; >>> + unsigned long start; >>> + unsigned long end; >>> + int in_hole; >>> =C2=A0}; >>> =20 >>> =C2=A0/* >>> @@ -618,28 +621,31 @@ static int queue_pages_test_walk(unsigned long = start, unsigned long end, >>> =C2=A0 unsigned long endvma =3D vma->vm_end; >>> =C2=A0 unsigned long flags =3D qp->flags; >>> =20 >>> - /* >>> - * Need check MPOL_MF_STRICT to return -EIO if possible >>> - * regardless of vma_migratable >>> - */ >>> - if (!vma_migratable(vma) && >>> - =C2=A0 =C2=A0!(flags & MPOL_MF_STRICT)) >>> - return 1; >>> - >>> + /* range check first */ >>> =C2=A0 if (endvma > end) >>> =C2=A0 endvma =3D end; >>> - if (vma->vm_start > start) >>> - start =3D vma->vm_start; >>> + BUG_ON((vma->vm_start > start) || (vma->vm_end < end)); >>> =20 >>> + qp->in_hole =3D 0; >>> =C2=A0 if (!(flags & MPOL_MF_DISCONTIG_OK)) { >>> - if (!vma->vm_next && vma->vm_end < end) >>> + if ((!vma->vm_next && vma->vm_end < qp->end) || >>> + (vma->vm_next && qp->end < vma->vm_next->vm_start)) >>> =C2=A0 return -EFAULT; >>> - if (qp->prev && qp->prev->vm_end < vma->vm_start) >>> + if ((qp->prev && qp->prev->vm_end < vma->vm_start) || >>> + (!qp->prev && qp->start < vma->vm_start)) >>> =C2=A0 return -EFAULT; >>> =C2=A0 } >>> =20 >>> =C2=A0 qp->prev =3D vma; >>> =20 >>> + /* >>> + * Need check MPOL_MF_STRICT to return -EIO if possible >>> + * regardless of vma_migratable >>> + */ >>> + if (!vma_migratable(vma) && >>> + =C2=A0 =C2=A0!(flags & MPOL_MF_STRICT)) >>> + return 1; >>> + >>> =C2=A0 if (flags & MPOL_MF_LAZY) { >>> =C2=A0 /* Similar to task_numa_work, skip inaccessible VMAs */ >>> =C2=A0 if (!is_vm_hugetlb_page(vma) && >>> @@ -679,14 +685,23 @@ queue_pages_range(struct mm_struct *mm, unsigne= d long start, unsigned long end, >>> =C2=A0 nodemask_t *nodes, unsigned long flags, >>> =C2=A0 struct list_head *pagelist) >>> =C2=A0{ >>> + int err; >>> =C2=A0 struct queue_pages qp =3D { >>> =C2=A0 .pagelist =3D pagelist, >>> =C2=A0 .flags =3D flags, >>> =C2=A0 .nmask =3D nodes, >>> =C2=A0 .prev =3D NULL, >>> + .start =3D start, >>> + .end =3D end, >>> + .in_hole =3D 1, >>> =C2=A0 }; >>> =20 >>> - return walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp); >>> + err =3D walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp)= ; >>> + /* whole range in unmapped hole */ >>> + if (qp->in_hole && !(flags & MPOL_MF_DISCONTIG_OK)) >>> + err =3D -EFAULT; >>> + >>> + return err; >>> =C2=A0} >>> =20 >>> =C2=A0/* >>> @@ -738,8 +753,7 @@ static int mbind_range(struct mm_struct *mm, unsi= gned long start, >>> =C2=A0 unsigned long vmend; >>> =20 >>> =C2=A0 vma =3D find_vma(mm, start); >>> - if (!vma || vma->vm_start > start) >>> - return -EFAULT; >>> + BUG_ON(!vma); >>> =20 >>> =C2=A0 prev =3D vma->vm_prev; >>> =C2=A0 if (start > vma->vm_start) >>> --=20 >>> 2.22.0