From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api Date: Sun, 01 May 2016 07:17:56 +1000 Message-ID: <87wpneu77f.fsf@notabene.neil.brown.name> References: <1461671772-1269-1-git-send-email-mhocko@kernel.org> <8737q5ugcx.fsf@notabene.neil.brown.name> <57233571.50509@redhat.com> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Return-path: In-Reply-To: <57233571.50509-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Sender: linux-nfs-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Steven Whitehouse , Michal Hocko , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Theodore Ts'o , linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, LKML , Dave Chinner , reiserfs-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-f2fs-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, logfs-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org, cluster-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, Chris Mason , linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, Jan Kara , Andrew Morton , xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.orgcluster-devel List-Id: ceph-devel.vger.kernel.org --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Fri, Apr 29 2016, Steven Whitehouse wrote: > Hi, > > On 29/04/16 06:35, NeilBrown wrote: >> If we could similarly move evict() into kswapd (and I believe we can) >> then most file systems would do nothing in reclaim context that >> interferes with allocation context. > evict() is an issue, but moving it into kswapd would be a potential=20 > problem for GFS2. We already have a memory allocation issue when=20 > recovery is taking place and memory is short. The code path is as follows: > > 1. Inode is scheduled for eviction (which requires deallocation) > 2. The glock is required in order to perform the deallocation, which=20 > implies getting a DLM lock > 3. Another node in the cluster fails, so needs recovery > 4. When the DLM lock is requested, it gets blocked until recovery is=20 > complete (for the failed node) > 5. Recovery is performed using a userland fencing utility > 6. Fencing requires memory and then blocks on the eviction > 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on=20 > DLM lock, DLM lock waiting on fencing) You even have user-space in the loop there - impressive! You can't really pass GFP_NOFS to a user-space thread, can you :-? > > It doesn't happen often, but we've been looking at the best place to=20 > break that cycle, and one of the things we've been wondering is whether=20 > we could avoid deallocation evictions from memory related contexts, or=20 > at least make it async somehow. I think "async" is definitely the answer and I think evict()/evict_inode() is the best place to focus attention. I can see now (thanks) that just moving the evict() call to kswapd isn't really a solution as it will just serve to block kswapd and so lots of other freeing of memory won't happen. I'm now imagining giving ->evict_inode() a "don't sleep" flag and allowing it to return -EAGAIN. In that case evict would queue the inode to kswapd (or maybe another thread) for periodic retry. The flag would only get set when prune_icache_sb() calls dispose_list() to call evict(). Other uses (e.g. unmount, iput) would still be synchronous. How difficult would it be to change gfs's evict_inode() to optionally never block? For this to work we would need to add a way for deactivate_locked_super() to wait for all the async evictions to complete. Currently prune_icache_sb() is called under s_umount. If we moved part of the eviction out of that lock some other synchronization would be needed. Possibly a per-superblock list of "inodes being evicted" would suffice. Thanks, NeilBrown > > The issue is that it is not possible to know in advance whether an=20 > eviction will result in mearly writing things back to disk (because the=20 > inode is being dropped from cache, but still resides on disk) which is=20 > easy to do, or whether it requires a full deallocation (n_link=3D=3D0) wh= ich=20 > may require significant resources and time, > > Steve. --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJXJSEEAAoJEDnsnt1WYoG5r68P/2XBKjAdTUcRbcSSLoUKYpEo nFQiiu9BM8FRmffmYHNQrRVQQsEA8H5WKekt0heSAUyqbs75dPybzH8Bm447azdm rb6ZUSSKV0LiDFWxe/mXjDFi9qgplAVAKIMQVoTUADgi6YXfpqYwjkTfXiBPcJF2 NXecVP/OBA0aGT7sUBJOYq1hKCA8e4oIAvEUdjv5c/405U4FoiTmTICwCkhCPTHR y5tACMN3RtAbzmxsQ0LHIkz8XMKiwtvUkG/Ku054lSQknknjfgESQSsBtEqiTXb+ I9vdZUbg0kjz6KAOJ/QogDjI47ORtoHptnB07NMl2OX9LWq93SPg0F81HfE5eIBc Y4NvPLg/EyBjW6KpcmYiAlfkRDEvt3/FeyaKCtEKzuu4cpCbGXqxRXqAl/tXzJLx VlFJqcvn9YNzyqvs4K+ZbHc6KKq+ppHRpWaXemIiwE69hkGiXH12Rb6cMN3XzOuU Tm7ORKC3HuPdoHLR0Ls/N+C16C2cQhkFlG3MGFyECtG2qKzotOJP/dvN0HNI+LSc fRW3/BQlCEmtwNJ2cpt6v6zUHmcEoPtxECUhIJllOzlnUqZ941i/tzTPNyY60WDA OBRNlLoo9qG9IDUVGjGoDA1WS+eLDmptOGi7T7gPHkvwKLJg6CH8Ivdped54sqMP K3N9YqjgZ+d4FzruChG6 =tfvX -----END PGP SIGNATURE----- --=-=-=-- -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Date: Sun, 01 May 2016 07:17:56 +1000 Subject: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api In-Reply-To: <57233571.50509@redhat.com> References: <1461671772-1269-1-git-send-email-mhocko@kernel.org> <8737q5ugcx.fsf@notabene.neil.brown.name> <57233571.50509@redhat.com> Message-ID: <87wpneu77f.fsf@notabene.neil.brown.name> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Fri, Apr 29 2016, Steven Whitehouse wrote: > Hi, > > On 29/04/16 06:35, NeilBrown wrote: >> If we could similarly move evict() into kswapd (and I believe we can) >> then most file systems would do nothing in reclaim context that >> interferes with allocation context. > evict() is an issue, but moving it into kswapd would be a potential > problem for GFS2. We already have a memory allocation issue when > recovery is taking place and memory is short. The code path is as follows: > > 1. Inode is scheduled for eviction (which requires deallocation) > 2. The glock is required in order to perform the deallocation, which > implies getting a DLM lock > 3. Another node in the cluster fails, so needs recovery > 4. When the DLM lock is requested, it gets blocked until recovery is > complete (for the failed node) > 5. Recovery is performed using a userland fencing utility > 6. Fencing requires memory and then blocks on the eviction > 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on > DLM lock, DLM lock waiting on fencing) You even have user-space in the loop there - impressive! You can't really pass GFP_NOFS to a user-space thread, can you :-? > > It doesn't happen often, but we've been looking at the best place to > break that cycle, and one of the things we've been wondering is whether > we could avoid deallocation evictions from memory related contexts, or > at least make it async somehow. I think "async" is definitely the answer and I think evict()/evict_inode() is the best place to focus attention. I can see now (thanks) that just moving the evict() call to kswapd isn't really a solution as it will just serve to block kswapd and so lots of other freeing of memory won't happen. I'm now imagining giving ->evict_inode() a "don't sleep" flag and allowing it to return -EAGAIN. In that case evict would queue the inode to kswapd (or maybe another thread) for periodic retry. The flag would only get set when prune_icache_sb() calls dispose_list() to call evict(). Other uses (e.g. unmount, iput) would still be synchronous. How difficult would it be to change gfs's evict_inode() to optionally never block? For this to work we would need to add a way for deactivate_locked_super() to wait for all the async evictions to complete. Currently prune_icache_sb() is called under s_umount. If we moved part of the eviction out of that lock some other synchronization would be needed. Possibly a per-superblock list of "inodes being evicted" would suffice. Thanks, NeilBrown > > The issue is that it is not possible to know in advance whether an > eviction will result in mearly writing things back to disk (because the > inode is being dropped from cache, but still resides on disk) which is > easy to do, or whether it requires a full deallocation (n_link==0) which > may require significant resources and time, > > Steve. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 818 bytes Desc: not available URL: From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from neil.brown.name ([103.29.64.221]:39202 "EHLO neil.brown.name" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752363AbcD3VSZ (ORCPT ); Sat, 30 Apr 2016 17:18:25 -0400 From: NeilBrown To: Steven Whitehouse , Michal Hocko , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Date: Sun, 01 May 2016 07:17:56 +1000 Cc: linux-nfs@vger.kernel.org, linux-ext4@vger.kernel.org, "Theodore Ts'o" , linux-ntfs-dev@lists.sourceforge.net, LKML , Dave Chinner , reiserfs-devel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, logfs@logfs.org, cluster-devel@redhat.com, Chris Mason , linux-mtd@lists.infradead.org, Jan Kara , Andrew Morton , xfs@oss.sgi.com, ceph-devel@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-afs@lists.infradead.org, cluster-devel Subject: Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api In-Reply-To: <57233571.50509@redhat.com> References: <1461671772-1269-1-git-send-email-mhocko@kernel.org> <8737q5ugcx.fsf@notabene.neil.brown.name> <57233571.50509@redhat.com> Message-ID: <87wpneu77f.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Sender: linux-btrfs-owner@vger.kernel.org List-ID: --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Fri, Apr 29 2016, Steven Whitehouse wrote: > Hi, > > On 29/04/16 06:35, NeilBrown wrote: >> If we could similarly move evict() into kswapd (and I believe we can) >> then most file systems would do nothing in reclaim context that >> interferes with allocation context. > evict() is an issue, but moving it into kswapd would be a potential=20 > problem for GFS2. We already have a memory allocation issue when=20 > recovery is taking place and memory is short. The code path is as follows: > > 1. Inode is scheduled for eviction (which requires deallocation) > 2. The glock is required in order to perform the deallocation, which=20 > implies getting a DLM lock > 3. Another node in the cluster fails, so needs recovery > 4. When the DLM lock is requested, it gets blocked until recovery is=20 > complete (for the failed node) > 5. Recovery is performed using a userland fencing utility > 6. Fencing requires memory and then blocks on the eviction > 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on=20 > DLM lock, DLM lock waiting on fencing) You even have user-space in the loop there - impressive! You can't really pass GFP_NOFS to a user-space thread, can you :-? > > It doesn't happen often, but we've been looking at the best place to=20 > break that cycle, and one of the things we've been wondering is whether=20 > we could avoid deallocation evictions from memory related contexts, or=20 > at least make it async somehow. I think "async" is definitely the answer and I think evict()/evict_inode() is the best place to focus attention. I can see now (thanks) that just moving the evict() call to kswapd isn't really a solution as it will just serve to block kswapd and so lots of other freeing of memory won't happen. I'm now imagining giving ->evict_inode() a "don't sleep" flag and allowing it to return -EAGAIN. In that case evict would queue the inode to kswapd (or maybe another thread) for periodic retry. The flag would only get set when prune_icache_sb() calls dispose_list() to call evict(). Other uses (e.g. unmount, iput) would still be synchronous. How difficult would it be to change gfs's evict_inode() to optionally never block? For this to work we would need to add a way for deactivate_locked_super() to wait for all the async evictions to complete. Currently prune_icache_sb() is called under s_umount. If we moved part of the eviction out of that lock some other synchronization would be needed. Possibly a per-superblock list of "inodes being evicted" would suffice. Thanks, NeilBrown > > The issue is that it is not possible to know in advance whether an=20 > eviction will result in mearly writing things back to disk (because the=20 > inode is being dropped from cache, but still resides on disk) which is=20 > easy to do, or whether it requires a full deallocation (n_link=3D=3D0) wh= ich=20 > may require significant resources and time, > > Steve. --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJXJSEEAAoJEDnsnt1WYoG5r68P/2XBKjAdTUcRbcSSLoUKYpEo nFQiiu9BM8FRmffmYHNQrRVQQsEA8H5WKekt0heSAUyqbs75dPybzH8Bm447azdm rb6ZUSSKV0LiDFWxe/mXjDFi9qgplAVAKIMQVoTUADgi6YXfpqYwjkTfXiBPcJF2 NXecVP/OBA0aGT7sUBJOYq1hKCA8e4oIAvEUdjv5c/405U4FoiTmTICwCkhCPTHR y5tACMN3RtAbzmxsQ0LHIkz8XMKiwtvUkG/Ku054lSQknknjfgESQSsBtEqiTXb+ I9vdZUbg0kjz6KAOJ/QogDjI47ORtoHptnB07NMl2OX9LWq93SPg0F81HfE5eIBc Y4NvPLg/EyBjW6KpcmYiAlfkRDEvt3/FeyaKCtEKzuu4cpCbGXqxRXqAl/tXzJLx VlFJqcvn9YNzyqvs4K+ZbHc6KKq+ppHRpWaXemIiwE69hkGiXH12Rb6cMN3XzOuU Tm7ORKC3HuPdoHLR0Ls/N+C16C2cQhkFlG3MGFyECtG2qKzotOJP/dvN0HNI+LSc fRW3/BQlCEmtwNJ2cpt6v6zUHmcEoPtxECUhIJllOzlnUqZ941i/tzTPNyY60WDA OBRNlLoo9qG9IDUVGjGoDA1WS+eLDmptOGi7T7gPHkvwKLJg6CH8Ivdped54sqMP K3N9YqjgZ+d4FzruChG6 =tfvX -----END PGP SIGNATURE----- --=-=-=-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api Date: Sun, 01 May 2016 07:17:56 +1000 Message-ID: <87wpneu77f.fsf@notabene.neil.brown.name> References: <1461671772-1269-1-git-send-email-mhocko@kernel.org> <8737q5ugcx.fsf@notabene.neil.brown.name> <57233571.50509@redhat.com> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Theodore Ts'o , linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, LKML , Dave Chinner , reiserfs-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-f2fs-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, logfs-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org, cluster-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, Chris Mason , linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, Jan Kara , Andrew Morton , xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, cluster-devel To: Steven Whitehouse , Michal Hocko , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Return-path: In-Reply-To: <57233571.50509-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Sender: linux-nfs-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: linux-ext4.vger.kernel.org --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Fri, Apr 29 2016, Steven Whitehouse wrote: > Hi, > > On 29/04/16 06:35, NeilBrown wrote: >> If we could similarly move evict() into kswapd (and I believe we can) >> then most file systems would do nothing in reclaim context that >> interferes with allocation context. > evict() is an issue, but moving it into kswapd would be a potential=20 > problem for GFS2. We already have a memory allocation issue when=20 > recovery is taking place and memory is short. The code path is as follows: > > 1. Inode is scheduled for eviction (which requires deallocation) > 2. The glock is required in order to perform the deallocation, which=20 > implies getting a DLM lock > 3. Another node in the cluster fails, so needs recovery > 4. When the DLM lock is requested, it gets blocked until recovery is=20 > complete (for the failed node) > 5. Recovery is performed using a userland fencing utility > 6. Fencing requires memory and then blocks on the eviction > 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on=20 > DLM lock, DLM lock waiting on fencing) You even have user-space in the loop there - impressive! You can't really pass GFP_NOFS to a user-space thread, can you :-? > > It doesn't happen often, but we've been looking at the best place to=20 > break that cycle, and one of the things we've been wondering is whether=20 > we could avoid deallocation evictions from memory related contexts, or=20 > at least make it async somehow. I think "async" is definitely the answer and I think evict()/evict_inode() is the best place to focus attention. I can see now (thanks) that just moving the evict() call to kswapd isn't really a solution as it will just serve to block kswapd and so lots of other freeing of memory won't happen. I'm now imagining giving ->evict_inode() a "don't sleep" flag and allowing it to return -EAGAIN. In that case evict would queue the inode to kswapd (or maybe another thread) for periodic retry. The flag would only get set when prune_icache_sb() calls dispose_list() to call evict(). Other uses (e.g. unmount, iput) would still be synchronous. How difficult would it be to change gfs's evict_inode() to optionally never block? For this to work we would need to add a way for deactivate_locked_super() to wait for all the async evictions to complete. Currently prune_icache_sb() is called under s_umount. If we moved part of the eviction out of that lock some other synchronization would be needed. Possibly a per-superblock list of "inodes being evicted" would suffice. Thanks, NeilBrown > > The issue is that it is not possible to know in advance whether an=20 > eviction will result in mearly writing things back to disk (because the=20 > inode is being dropped from cache, but still resides on disk) which is=20 > easy to do, or whether it requires a full deallocation (n_link=3D=3D0) wh= ich=20 > may require significant resources and time, > > Steve. --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJXJSEEAAoJEDnsnt1WYoG5r68P/2XBKjAdTUcRbcSSLoUKYpEo nFQiiu9BM8FRmffmYHNQrRVQQsEA8H5WKekt0heSAUyqbs75dPybzH8Bm447azdm rb6ZUSSKV0LiDFWxe/mXjDFi9qgplAVAKIMQVoTUADgi6YXfpqYwjkTfXiBPcJF2 NXecVP/OBA0aGT7sUBJOYq1hKCA8e4oIAvEUdjv5c/405U4FoiTmTICwCkhCPTHR y5tACMN3RtAbzmxsQ0LHIkz8XMKiwtvUkG/Ku054lSQknknjfgESQSsBtEqiTXb+ I9vdZUbg0kjz6KAOJ/QogDjI47ORtoHptnB07NMl2OX9LWq93SPg0F81HfE5eIBc Y4NvPLg/EyBjW6KpcmYiAlfkRDEvt3/FeyaKCtEKzuu4cpCbGXqxRXqAl/tXzJLx VlFJqcvn9YNzyqvs4K+ZbHc6KKq+ppHRpWaXemIiwE69hkGiXH12Rb6cMN3XzOuU Tm7ORKC3HuPdoHLR0Ls/N+C16C2cQhkFlG3MGFyECtG2qKzotOJP/dvN0HNI+LSc fRW3/BQlCEmtwNJ2cpt6v6zUHmcEoPtxECUhIJllOzlnUqZ941i/tzTPNyY60WDA OBRNlLoo9qG9IDUVGjGoDA1WS+eLDmptOGi7T7gPHkvwKLJg6CH8Ivdped54sqMP K3N9YqjgZ+d4FzruChG6 =tfvX -----END PGP SIGNATURE----- --=-=-=-- -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 839AA7CFA for ; Sat, 30 Apr 2016 16:18:35 -0500 (CDT) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 48ECE304032 for ; Sat, 30 Apr 2016 14:18:32 -0700 (PDT) Received: from neil.brown.name (neil.brown.name [103.29.64.221]) by cuda.sgi.com with ESMTP id XlQKUeT2UPKDcBjU (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO) for ; Sat, 30 Apr 2016 14:18:29 -0700 (PDT) From: NeilBrown Date: Sun, 01 May 2016 07:17:56 +1000 Subject: Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api In-Reply-To: <57233571.50509@redhat.com> References: <1461671772-1269-1-git-send-email-mhocko@kernel.org> <8737q5ugcx.fsf@notabene.neil.brown.name> <57233571.50509@redhat.com> Message-ID: <87wpneu77f.fsf@notabene.neil.brown.name> MIME-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============7632315655972675530==" Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Steven Whitehouse , Michal Hocko , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Cc: linux-nfs@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, Theodore Ts'o , logfs@logfs.org, linux-ntfs-dev@lists.sourceforge.net, xfs@oss.sgi.com, LKML , reiserfs-devel@vger.kernel.org, cluster-devel , Chris Mason , linux-mtd@lists.infradead.org, Jan Kara , Andrew Morton , linux-ext4@vger.kernel.org, ceph-devel@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-afs@lists.infradead.org --===============7632315655972675530== Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Fri, Apr 29 2016, Steven Whitehouse wrote: > Hi, > > On 29/04/16 06:35, NeilBrown wrote: >> If we could similarly move evict() into kswapd (and I believe we can) >> then most file systems would do nothing in reclaim context that >> interferes with allocation context. > evict() is an issue, but moving it into kswapd would be a potential=20 > problem for GFS2. We already have a memory allocation issue when=20 > recovery is taking place and memory is short. The code path is as follows: > > 1. Inode is scheduled for eviction (which requires deallocation) > 2. The glock is required in order to perform the deallocation, which=20 > implies getting a DLM lock > 3. Another node in the cluster fails, so needs recovery > 4. When the DLM lock is requested, it gets blocked until recovery is=20 > complete (for the failed node) > 5. Recovery is performed using a userland fencing utility > 6. Fencing requires memory and then blocks on the eviction > 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on=20 > DLM lock, DLM lock waiting on fencing) You even have user-space in the loop there - impressive! You can't really pass GFP_NOFS to a user-space thread, can you :-? > > It doesn't happen often, but we've been looking at the best place to=20 > break that cycle, and one of the things we've been wondering is whether=20 > we could avoid deallocation evictions from memory related contexts, or=20 > at least make it async somehow. I think "async" is definitely the answer and I think evict()/evict_inode() is the best place to focus attention. I can see now (thanks) that just moving the evict() call to kswapd isn't really a solution as it will just serve to block kswapd and so lots of other freeing of memory won't happen. I'm now imagining giving ->evict_inode() a "don't sleep" flag and allowing it to return -EAGAIN. In that case evict would queue the inode to kswapd (or maybe another thread) for periodic retry. The flag would only get set when prune_icache_sb() calls dispose_list() to call evict(). Other uses (e.g. unmount, iput) would still be synchronous. How difficult would it be to change gfs's evict_inode() to optionally never block? For this to work we would need to add a way for deactivate_locked_super() to wait for all the async evictions to complete. Currently prune_icache_sb() is called under s_umount. If we moved part of the eviction out of that lock some other synchronization would be needed. Possibly a per-superblock list of "inodes being evicted" would suffice. Thanks, NeilBrown > > The issue is that it is not possible to know in advance whether an=20 > eviction will result in mearly writing things back to disk (because the=20 > inode is being dropped from cache, but still resides on disk) which is=20 > easy to do, or whether it requires a full deallocation (n_link=3D=3D0) wh= ich=20 > may require significant resources and time, > > Steve. --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJXJSEEAAoJEDnsnt1WYoG5r68P/2XBKjAdTUcRbcSSLoUKYpEo nFQiiu9BM8FRmffmYHNQrRVQQsEA8H5WKekt0heSAUyqbs75dPybzH8Bm447azdm rb6ZUSSKV0LiDFWxe/mXjDFi9qgplAVAKIMQVoTUADgi6YXfpqYwjkTfXiBPcJF2 NXecVP/OBA0aGT7sUBJOYq1hKCA8e4oIAvEUdjv5c/405U4FoiTmTICwCkhCPTHR y5tACMN3RtAbzmxsQ0LHIkz8XMKiwtvUkG/Ku054lSQknknjfgESQSsBtEqiTXb+ I9vdZUbg0kjz6KAOJ/QogDjI47ORtoHptnB07NMl2OX9LWq93SPg0F81HfE5eIBc Y4NvPLg/EyBjW6KpcmYiAlfkRDEvt3/FeyaKCtEKzuu4cpCbGXqxRXqAl/tXzJLx VlFJqcvn9YNzyqvs4K+ZbHc6KKq+ppHRpWaXemIiwE69hkGiXH12Rb6cMN3XzOuU Tm7ORKC3HuPdoHLR0Ls/N+C16C2cQhkFlG3MGFyECtG2qKzotOJP/dvN0HNI+LSc fRW3/BQlCEmtwNJ2cpt6v6zUHmcEoPtxECUhIJllOzlnUqZ941i/tzTPNyY60WDA OBRNlLoo9qG9IDUVGjGoDA1WS+eLDmptOGi7T7gPHkvwKLJg6CH8Ivdped54sqMP K3N9YqjgZ+d4FzruChG6 =tfvX -----END PGP SIGNATURE----- --=-=-=-- --===============7632315655972675530== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs --===============7632315655972675530==-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 660F36B007E for ; Sat, 30 Apr 2016 17:18:45 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id 4so100271682pfw.0 for ; Sat, 30 Apr 2016 14:18:45 -0700 (PDT) Received: from neil.brown.name (neil.brown.name. [103.29.64.221]) by mx.google.com with ESMTPS id d62si3394642pfc.214.2016.04.30.14.18.44 for (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sat, 30 Apr 2016 14:18:44 -0700 (PDT) From: NeilBrown Date: Sun, 01 May 2016 07:17:56 +1000 Subject: Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api In-Reply-To: <57233571.50509@redhat.com> References: <1461671772-1269-1-git-send-email-mhocko@kernel.org> <8737q5ugcx.fsf@notabene.neil.brown.name> <57233571.50509@redhat.com> Message-ID: <87wpneu77f.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Sender: owner-linux-mm@kvack.org List-ID: To: Steven Whitehouse , Michal Hocko , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Cc: linux-nfs@vger.kernel.org, linux-ext4@vger.kernel.org, Theodore Ts'o , linux-ntfs-dev@lists.sourceforge.net, LKML , Dave Chinner , reiserfs-devel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, logfs@logfs.org, cluster-devel@redhat.com, Chris Mason , linux-mtd@lists.infradead.org, Jan Kara , Andrew Morton , xfs@oss.sgi.com, ceph-devel@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-afs@lists.infradead.orgcluster-devel --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Fri, Apr 29 2016, Steven Whitehouse wrote: > Hi, > > On 29/04/16 06:35, NeilBrown wrote: >> If we could similarly move evict() into kswapd (and I believe we can) >> then most file systems would do nothing in reclaim context that >> interferes with allocation context. > evict() is an issue, but moving it into kswapd would be a potential=20 > problem for GFS2. We already have a memory allocation issue when=20 > recovery is taking place and memory is short. The code path is as follows: > > 1. Inode is scheduled for eviction (which requires deallocation) > 2. The glock is required in order to perform the deallocation, which=20 > implies getting a DLM lock > 3. Another node in the cluster fails, so needs recovery > 4. When the DLM lock is requested, it gets blocked until recovery is=20 > complete (for the failed node) > 5. Recovery is performed using a userland fencing utility > 6. Fencing requires memory and then blocks on the eviction > 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on=20 > DLM lock, DLM lock waiting on fencing) You even have user-space in the loop there - impressive! You can't really pass GFP_NOFS to a user-space thread, can you :-? > > It doesn't happen often, but we've been looking at the best place to=20 > break that cycle, and one of the things we've been wondering is whether=20 > we could avoid deallocation evictions from memory related contexts, or=20 > at least make it async somehow. I think "async" is definitely the answer and I think evict()/evict_inode() is the best place to focus attention. I can see now (thanks) that just moving the evict() call to kswapd isn't really a solution as it will just serve to block kswapd and so lots of other freeing of memory won't happen. I'm now imagining giving ->evict_inode() a "don't sleep" flag and allowing it to return -EAGAIN. In that case evict would queue the inode to kswapd (or maybe another thread) for periodic retry. The flag would only get set when prune_icache_sb() calls dispose_list() to call evict(). Other uses (e.g. unmount, iput) would still be synchronous. How difficult would it be to change gfs's evict_inode() to optionally never block? For this to work we would need to add a way for deactivate_locked_super() to wait for all the async evictions to complete. Currently prune_icache_sb() is called under s_umount. If we moved part of the eviction out of that lock some other synchronization would be needed. Possibly a per-superblock list of "inodes being evicted" would suffice. Thanks, NeilBrown > > The issue is that it is not possible to know in advance whether an=20 > eviction will result in mearly writing things back to disk (because the=20 > inode is being dropped from cache, but still resides on disk) which is=20 > easy to do, or whether it requires a full deallocation (n_link=3D=3D0) wh= ich=20 > may require significant resources and time, > > Steve. --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJXJSEEAAoJEDnsnt1WYoG5r68P/2XBKjAdTUcRbcSSLoUKYpEo nFQiiu9BM8FRmffmYHNQrRVQQsEA8H5WKekt0heSAUyqbs75dPybzH8Bm447azdm rb6ZUSSKV0LiDFWxe/mXjDFi9qgplAVAKIMQVoTUADgi6YXfpqYwjkTfXiBPcJF2 NXecVP/OBA0aGT7sUBJOYq1hKCA8e4oIAvEUdjv5c/405U4FoiTmTICwCkhCPTHR y5tACMN3RtAbzmxsQ0LHIkz8XMKiwtvUkG/Ku054lSQknknjfgESQSsBtEqiTXb+ I9vdZUbg0kjz6KAOJ/QogDjI47ORtoHptnB07NMl2OX9LWq93SPg0F81HfE5eIBc Y4NvPLg/EyBjW6KpcmYiAlfkRDEvt3/FeyaKCtEKzuu4cpCbGXqxRXqAl/tXzJLx VlFJqcvn9YNzyqvs4K+ZbHc6KKq+ppHRpWaXemIiwE69hkGiXH12Rb6cMN3XzOuU Tm7ORKC3HuPdoHLR0Ls/N+C16C2cQhkFlG3MGFyECtG2qKzotOJP/dvN0HNI+LSc fRW3/BQlCEmtwNJ2cpt6v6zUHmcEoPtxECUhIJllOzlnUqZ941i/tzTPNyY60WDA OBRNlLoo9qG9IDUVGjGoDA1WS+eLDmptOGi7T7gPHkvwKLJg6CH8Ivdped54sqMP K3N9YqjgZ+d4FzruChG6 =tfvX -----END PGP SIGNATURE----- --=-=-=-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org