From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 795A0C433F5 for ; Mon, 21 Feb 2022 15:29:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 964AF8D0002; Mon, 21 Feb 2022 10:29:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 912888D0001; Mon, 21 Feb 2022 10:29:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B3D48D0002; Mon, 21 Feb 2022 10:29:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0235.hostedemail.com [216.40.44.235]) by kanga.kvack.org (Postfix) with ESMTP id 658FE8D0001 for ; Mon, 21 Feb 2022 10:29:57 -0500 (EST) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 14E5C9D648 for ; Mon, 21 Feb 2022 15:29:57 +0000 (UTC) X-FDA: 79167172434.27.67AF219 Received: from alexa-out-sd-01.qualcomm.com (alexa-out-sd-01.qualcomm.com [199.106.114.38]) by imf17.hostedemail.com (Postfix) with ESMTP id 380B34000F for ; Mon, 21 Feb 2022 15:29:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; i=@quicinc.com; q=dns/txt; s=qcdkim; t=1645457396; x=1676993396; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=TJATCfdEY+Yshshlz7Pap97zJDfTn8eCDrhWL1Hqg0o=; b=JF1FIe9InIlVFQyG++tJMVQdWmSTPDgU1BKgAW6IyXVLVGAphIVGRhFf njw6kpwypXQ9k34zj6PumHwTezVmTT6VsMsFOchS5zUphFw6Z/t4s9uTd IODbheics0iP+6IUOhFt+jwl1jEd6niJ7NcRRN1DZSHljaOqwnBjes6aa w=; Received: from unknown (HELO ironmsg02-sd.qualcomm.com) ([10.53.140.142]) by alexa-out-sd-01.qualcomm.com with ESMTP; 21 Feb 2022 07:29:55 -0800 X-QCInternal: smtphost Received: from nasanex01c.na.qualcomm.com ([10.47.97.222]) by ironmsg02-sd.qualcomm.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Feb 2022 07:29:54 -0800 Received: from nalasex01a.na.qualcomm.com (10.47.209.196) by nasanex01c.na.qualcomm.com (10.47.97.222) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.15; Mon, 21 Feb 2022 07:29:53 -0800 Received: from [10.216.13.78] (10.80.80.8) by nalasex01a.na.qualcomm.com (10.47.209.196) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.15; Mon, 21 Feb 2022 07:29:49 -0800 Message-ID: <00f23e1c-aef3-e95c-3f0c-0dd9e21e8d1b@quicinc.com> Date: Mon, 21 Feb 2022 20:59:45 +0530 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.6.0 Subject: Re: [PATCH v4] mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem Content-Language: en-US To: , , , , , , , , CC: , References: <1644572051-24091-1-git-send-email-quic_charante@quicinc.com> From: Charan Teja Kalla In-Reply-To: <1644572051-24091-1-git-send-email-quic_charante@quicinc.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nalasex01a.na.qualcomm.com (10.47.209.196) X-Rspamd-Queue-Id: 380B34000F X-Stat-Signature: 9m6rg5y7sd64a1n9r6i7rjd13cwp7ywn X-Rspam-User: Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=quicinc.com header.s=qcdkim header.b=JF1FIe9I; spf=pass (imf17.hostedemail.com: domain of quic_charante@quicinc.com designates 199.106.114.38 as permitted sender) smtp.mailfrom=quic_charante@quicinc.com; dmarc=pass (policy=none) header.from=quicinc.com X-Rspamd-Server: rspam05 X-HE-Tag: 1645457395-19322 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Just a gentle ping. Please help in providing your inputs here. Thanks, On 2/11/2022 3:04 PM, Charan Teja Kalla wrote: > Currently fadvise(2) is supported only for the files that doesn't > associated with noop_backing_dev_info thus for the files, like shmem, > fadvise results into NOP. But then there is file_operations->fadvise() > that lets the file systems to implement their own fadvise > implementation. Use this support to implement some of the POSIX_FADV_XXX > functionality for shmem files. > > This patch aims to implement POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED > advices to shmem files which can be helpful for the drivers who may want > to manage the shmem pages of the files that are created through > shmem_file_setup[_with_mnt](). An example usecase may be like, driver > can create the shmem file of the size equal to its requirements and > map the pages for DMA and then pass the fd to user. The user who knows > well about the usage of these pages can now decide when these pages are > not required push them to swap through DONTNEED thus free up memory well > in advance rather than relying on the reclaim and use WILLNEED when it > decide that they are useful in the near future. IOW, it lets the clients > to free up/read the memory when it wants to. Another usecase is that GEM > objets which are currenlty allocated and managed through shmem files can > use vfs_fadvise(DONT|WILLNEED) on shmem fd when the driver comes to > know(like through some hints from user space) that GEM objects are not > going to use/will need in the near future. > > Some questions asked while reviewing this patch: > > Q) Can the same thing be achieved with FD mapped to user and use > madvise? > A) All drivers are not mapping all the shmem fd's to user space and want > to manage them with in the kernel. Ex: shmem memory can be mapped to the > other subsystems and they fill in the data and then give it to other > subsystem for further processing, where, the user mapping is not at all > required. A simple example, memory that is given for gpu subsystem > which can be filled directly and give to display subsystem. And the > respective drivers know well about when to keep that memory in ram or > swap based on may be a user activity. > > Q) Should we add the documentation section in Manual pages? > A) The man[1] pages for the fadvise() whatever says is also applicable > for shmem files. so couldn't feel it correct to add specific to shmem > files separately. > [1] https://linux.die.net/man/2/fadvise > > Q) The proposed semantics of POSIX_FADV_DONTNEED is actually similar to > MADV_PAGEOUT and different from MADV_DONTNEED. This is a user facing API > and this difference will cause confusion? > A) man pages [1] says that "POSIX_FADV_DONTNEED attempts to free cached > pages associated with the specified region." This means on issuing this > FADV, it is expected to free the file cache pages. And it is > implementation defined If the dirty pages may be attempted to writeback. > And the unwritten dirty pages will not be freed. So, FADV_DONTNEED also > covers the semantics of MADV_PAGEOUT for file pages and there is no > purpose of PAGEOUT for file pages. > [1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html > > Signed-off-by: Charan Teja Kalla > --- > Changes in V4: > -- Changed the code to use reclaim_pages() to writeout the shmem pages to swap and then reclaim. > -- Addressed comments from Mark Hemment and Matthew. > -- fadvise() on shmem file may even unmap a page. > > Changes in V3: > -- Considered THP pages while doing FADVISE_[DONT|WILL]NEED, identified by Matthew. > -- xarray used properly, as identified by Matthew. > -- Excluded mapped pages as it requires unmapping and the man pages of fadvise don't talk about them. > -- RESEND: Fixed the compilation issue when CONFIG_TMPFS is not defined. > -- https://patchwork.kernel.org/project/linux-mm/patch/1641488717-13865-1-git-send-email-quic_charante@quicinc.com/ > > Changes in V2: > -- Rearranged the code to not to sleep with rcu_lock while using xas_() functionality. > -- Addressed the comments from Suren. > -- https://patchwork.kernel.org/project/linux-mm/patch/1638442253-1591-1-git-send-email-quic_charante@quicinc.com/ > > changes in V1: > -- Created the interface for fadvise(2) to work on shmem files. > -- https://patchwork.kernel.org/project/linux-mm/patch/1633701982-22302-1-git-send-email-charante@codeaurora.org/ > > mm/shmem.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 129 insertions(+) > > diff --git a/mm/shmem.c b/mm/shmem.c > index 18f93c2..fe475af 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -39,6 +39,9 @@ > #include > #include > #include > +#include > +#include > +#include > > static struct vfsmount *shm_mnt; > > @@ -2275,6 +2278,131 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > return 0; > } > > +static void shmem_isolate_pages_range(struct address_space *mapping, loff_t start, > + loff_t end, struct list_head *list) > +{ > + XA_STATE(xas, &mapping->i_pages, start); > + struct page *page; > + > + rcu_read_lock(); > + xas_for_each(&xas, page, end) { > + if (xas_retry(&xas, page)) > + continue; > + if (xa_is_value(page)) > + continue; > + > + if (!get_page_unless_zero(page)) > + continue; > + if (isolate_lru_page(page)) { > + put_page(page); > + continue; > + } > + put_page(page); > + > + if (PageUnevictable(page) || page_mapcount(page) > 1) { > + putback_lru_page(page); > + continue; > + } > + > + /* > + * Prepare the page to be passed to the reclaim_pages(). > + * VM couldn't reclaim the page unless we clear PG_young. > + * Also, to ensure that the pages are written before > + * reclaiming, page is set to dirty. > + * Since we are not clearing the pte_young in the mapped > + * page pte's, its reclaim may not be attempted. > + */ > + ClearPageReferenced(page); > + test_and_clear_page_young(page); > + SetPageDirty(page); > + list_add(&page->lru, list); > + if (need_resched()) { > + xas_pause(&xas); > + cond_resched_rcu(); > + } > + } > + rcu_read_unlock(); > +} > + > +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start, > + loff_t end) > +{ > + LIST_HEAD(list); > + > + if (!shmem_mapping(mapping)) > + return -EINVAL; > + > + if (!total_swap_pages) > + return 0; > + > + lru_add_drain(); > + shmem_isolate_pages_range(mapping, start, end, &list); > + reclaim_pages(&list); > + > + return 0; > +} > + > +static int shmem_fadvise_willneed(struct address_space *mapping, > + pgoff_t start, pgoff_t long end) > +{ > + struct page *page; > + pgoff_t index; > + > + xa_for_each_range(&mapping->i_pages, index, page, start, end) { > + if (!xa_is_value(page)) > + continue; > + page = shmem_read_mapping_page(mapping, index); > + if (!IS_ERR(page)) > + put_page(page); > + } > + > + return 0; > +} > + > +static int shmem_fadvise(struct file *file, loff_t offset, loff_t len, int advice) > +{ > + loff_t endbyte; > + pgoff_t start_index; > + pgoff_t end_index; > + struct address_space *mapping; > + int ret = 0; > + > + mapping = file->f_mapping; > + if (!mapping || len < 0) > + return -EINVAL; > + > + endbyte = (u64)offset + (u64)len; > + if (!len || endbyte < len) > + endbyte = -1; > + else > + endbyte--; > + > + > + start_index = offset >> PAGE_SHIFT; > + end_index = endbyte >> PAGE_SHIFT; > + switch (advice) { > + case POSIX_FADV_DONTNEED: > + ret = shmem_fadvise_dontneed(mapping, start_index, end_index); > + break; > + case POSIX_FADV_WILLNEED: > + ret = shmem_fadvise_willneed(mapping, start_index, end_index); > + break; > + case POSIX_FADV_NORMAL: > + case POSIX_FADV_RANDOM: > + case POSIX_FADV_SEQUENTIAL: > + case POSIX_FADV_NOREUSE: > + /* > + * No bad return value, but ignore advice. May have to > + * implement in future. > + */ > + break; > + default: > + return -EINVAL; > + } > + > + return ret; > +} > + > static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir, > umode_t mode, dev_t dev, unsigned long flags) > { > @@ -3777,6 +3905,7 @@ static const struct file_operations shmem_file_operations = { > .splice_write = iter_file_splice_write, > .fallocate = shmem_fallocate, > #endif > + .fadvise = shmem_fadvise, > }; > > static const struct inode_operations shmem_inode_operations = {