From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C063FEB64DA for ; Thu, 15 Jun 2023 14:53:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345118AbjFOOxT (ORCPT ); Thu, 15 Jun 2023 10:53:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36164 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345139AbjFOOxH (ORCPT ); Thu, 15 Jun 2023 10:53:07 -0400 Received: from mail-ed1-x52a.google.com (mail-ed1-x52a.google.com [IPv6:2a00:1450:4864:20::52a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D8FD92972 for ; Thu, 15 Jun 2023 07:52:53 -0700 (PDT) Received: by mail-ed1-x52a.google.com with SMTP id 4fb4d7f45d1cf-513ea2990b8so10788a12.0 for ; Thu, 15 Jun 2023 07:52:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1686840772; x=1689432772; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=IAtbfG4Aoz6kytomR9wtfMXX6I18hQsyqyJIKoo2vmA=; b=fSrJpGcgZ1ZKV9qgO5cMqf4m56iOHhDAXNSThI8B74HnDj6UMjTgbDzZIZyjvQnY1R Xx6Tmc+iE/57/HuVGRFuPzB2i0L12QwzPzKCZNA2ciwXSqdQl/qEfZE9YBrWGPyZbsSG ygi3lzfzXYwxvRLZHugqOgNwrHLSKy83i+DZrYSeezi22l9n0NFXgc0KZ/iA7Jt3G0KF RIVYIOILg36nVZoVgUXU7ykY/lcVDSe+jUG8VB57jbA1yskBDyzbLNLfsFIUWguSHTn+ kdzFJ9+pj/k2yf/PW6pBgk+D3zHo04G0Uw28QJMadK/qeBo4aHSDyJBeR2fS2rVY0GKt yXDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686840772; x=1689432772; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=IAtbfG4Aoz6kytomR9wtfMXX6I18hQsyqyJIKoo2vmA=; b=XCMo/f04SoGHJ76PlL+IyilYBOvYpS26Syh/XYU1RT978Pgi1AkkMoI+ujSKmvCQVT 9oTsv1Y9ypVAUjmx66rD8VYbazlB251Huv2tuf0h7NkBCZ+kCvtUaWY7m8UYnU2mVpSY Bc30D1H6nIjjRHaV8dH0eHvWXLaragqJQz3amj9BC7AHKI4ZNaU5q+fTBSWP+7JS+qQL 1FlyqSaE8cXqsKXwJoA2NDsuIPu7N7SZwkTOMka5WFEftvkiNhoqAsv8avxOwFfnfs2v 4nrQWnR/9sgvxu75IYC485VV2tERfdcw5WQboyemFp8ATZxamo5xB0kQS35HowlEs8Vv IKNg== X-Gm-Message-State: AC+VfDyDYLAEKwy114zvsLDWMgsNaTRy8xWlFFpp5y0FabgiUYhbGjY1 ssW+IdAIDkw7GqEBsAsix1cg7IZ+4pa6IQ8Pju4N+g== X-Google-Smtp-Source: ACHHUZ67CknraQiNFeWTHQxsI/pD93j9Z/IG8HHMp+oyBmBPrD3SOWlFDPxALG/rqwNKAAJWPgK3+biNNXtJMfLIvgs= X-Received: by 2002:a50:d681:0:b0:514:95d4:c2bb with SMTP id r1-20020a50d681000000b0051495d4c2bbmr114787edi.2.1686840772093; Thu, 15 Jun 2023 07:52:52 -0700 (PDT) MIME-Version: 1.0 References: <20230613102905.2808371-1-usama.anjum@collabora.com> <20230613102905.2808371-3-usama.anjum@collabora.com> <0db01d90-09d6-08a4-bbb8-70670d3baa94@collabora.com> <34203acf-7270-7ade-a60e-ae0f729dcf70@collabora.com> <96b7cc00-d213-ad7d-1b48-b27f75b04d22@collabora.com> In-Reply-To: <96b7cc00-d213-ad7d-1b48-b27f75b04d22@collabora.com> From: =?UTF-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= Date: Thu, 15 Jun 2023 16:52:40 +0200 Message-ID: Subject: Re: [PATCH v18 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs To: Muhammad Usama Anjum Cc: Peter Xu , David Hildenbrand , Andrew Morton , Andrei Vagin , Danylo Mocherniuk , Paul Gofman , Cyrill Gorcunov , Mike Rapoport , Nadav Amit , Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Thu, 15 Jun 2023 at 15:58, Muhammad Usama Anjum wrote: > I'll send next revision now. > On 6/14/23 11:00=E2=80=AFPM, Micha=C5=82 Miros=C5=82aw wrote: > > (A quick reply to answer open questions in case they help the next vers= ion.) > > > > On Wed, 14 Jun 2023 at 19:10, Muhammad Usama Anjum > > wrote: > >> On 6/14/23 8:14=E2=80=AFPM, Micha=C5=82 Miros=C5=82aw wrote: > >>> On Wed, 14 Jun 2023 at 15:46, Muhammad Usama Anjum > >>> wrote: > >>>> > >>>> On 6/14/23 3:36=E2=80=AFAM, Micha=C5=82 Miros=C5=82aw wrote: > >>>>> On Tue, 13 Jun 2023 at 12:29, Muhammad Usama Anjum > >>>>> wrote: > > [...] > >>>>>> + if (cur_buf->bitmap =3D=3D bitmap && > >>>>>> + cur_buf->start + cur_buf->len * PAGE_SIZE =3D=3D addr)= { > >>>>>> + cur_buf->len +=3D n_pages; > >>>>>> + p->found_pages +=3D n_pages; > >>>>>> + } else { > >>>>>> + if (cur_buf->len && p->vec_buf_index >=3D p->vec_b= uf_len) > >>>>>> + return -ENOMEM; > >>>>> > >>>>> Shouldn't this be -ENOSPC? -ENOMEM usually signifies that the kerne= l > >>>>> ran out of memory when allocating, not that there is no space in a > >>>>> user-provided buffer. > >>>> There are 3 kinds of return values here: > >>>> * PM_SCAN_FOUND_MAX_PAGES (1) ---> max_pages have been found. Abort = the > >>>> page walk from next entry > >>>> * 0 ---> continue the page walk > >>>> * -ENOMEM --> Abort the page walk from current entry, user buffer is= full > >>>> which is not error, but only a stop signal. This -ENOMEM is just > >>>> differentiater from (1). This -ENOMEM is for internal use and isn't > >>>> returned to user. > >>> > >>> But why ENOSPC is not good here? I was used before, I think. > >> -ENOSPC is being returned in form of true error from > >> pagemap_scan_hugetlb_entry(). So I'd to remove -ENOSPC from here as it > >> wasn't true error here, it was only a way to abort the walk immediatel= y. > >> I'm liking the following erturn code from here now: > >> > >> #define PM_SCAN_BUFFER_FULL (-256) > > > > I guess this will be reworked anyway, but I'd prefer this didn't need > > custom errors etc. If we agree to decoupling the selection and GET > > output, it could be: > > > > bool is_interesting_page(p, flags); // this one does the > > required/anyof/excluded match > > size_t output_range(p, start, len, flags); // this one fills the > > output vector and returns how many pages were fit > > > > In this setup, `is_interesting_page() && (n_out =3D output_range()) < > > n_pages` means this is the final range, no more will fit. And if > > `n_out =3D=3D 0` then no pages fit and no WP is needed (no other specia= l > > cases). > Right now, pagemap_scan_output() performs the work of both of these two > functions. The part can be broken into is_interesting_pages() and we can > leave the remaining part as it is. > > Saying that n_out < n_pages tells us the buffer is full covers one case. > But there is case of maximum pages have been found and walk needs to be > aborted. This case is exactly what `n_out < n_pages` will cover (if scan_output uses max_pages properly to limit n_out). Isn't it that when the buffer is full we want to abort the scan always (with WP if `n_out > 0`)? > >>>>> For flags name: PM_REQUIRE_WRITE_ACCESS? > >>>>> Or Is it intended to be checked only if doing WP (as the current na= me > >>>>> suggests) and so it would be redundant as WP currently requires > >>>>> `p->required_mask =3D PAGE_IS_WRITTEN`? > >>>> This is intended to indicate that if userfaultfd is needed. If > >>>> PAGE_IS_WRITTEN is mentioned in any of mask, we need to check if > >>>> userfaultfd has been initialized for this memory. I'll rename to > >>>> PM_SCAN_REQUIRE_UFFD. > >>> > >>> Why do we need that check? Wouldn't `is_written =3D false` work for v= mas > >>> not registered via uffd? > >> UFFD_FEATURE_WP_ASYNC and UNPOPULATED needs to be set on the memory re= gion > >> for it to report correct written values on the memory region. Without = UFFD > >> WP ASYNC and UNPOUPULATED defined on the memory, we consider UFFD_WP s= tate > >> undefined. If user hasn't initialized memory with UFFD, he has no righ= t to > >> set is_written =3D false. > > > > How about calculating `is_written =3D is_uffd_registered() && > > is_uffd_wp()`? This would enable a user to apply GET+WP for the whole > > address space of a process regardless of whether all of it is > > registered. > I wouldn't want to check if uffd is registered again and again. This is w= hy > we are doing it only once every walk in pagemap_scan_test_walk(). There is no need to do the checks repeatedly. If I understand the code correctly, uffd registration is per-vma, so it can be communicated from test_walk to entry/hole callbacks via a field in pagemap_scan_private. > >>> While here, I wonder if we really need to fail the call if there are > >>> unknown bits in those masks set: if this bit set is expanded with > >>> another category flags, a newer userspace run on older kernel would > >>> get EINVAL even if the "treat unknown as 0" be what it requires. > >>> There is no simple way in the API to discover what bits the kernel > >>> supports. We could allow a no-op (no WP nor GET) call to help with > >>> that and then rejecting unknown bits would make sense. > >> I've not seen any examples of this. But I've seen examples of returnin= g > >> error if kernel doesn't support a feature. Each new feature comes with= a > >> kernel version, greater than this version support this feature. If use= r is > >> trying to use advanced feature which isn't present in a kernel, we sho= uld > >> return error and not proceed to confuse the user/kernel. In fact if we= look > >> at userfaultfd_api(), we return error immediately if feature has some = bit > >> set which kernel doesn't support. > > > > I think we should have a way of detecting the supported flags if we > > don't want a forward compatibility policy for flags here. Maybe it > > would be enough to allow all the no-op combinations for this purpose? > Again I don't think UFFD is doing anything like this. If it's cheap and easy to provide a user with a way to detect the supported features - why not do it? Best Regards Micha=C5=82 Miros=C5=82aw