From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 70C1D14F9D6 for ; Fri, 12 Apr 2024 20:07:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712952422; cv=none; b=spAyW7xowQFcWGnLG39RvQsjnr+nLrl10qHQXG2v5+u5uqdUwfnylQIuP08UC6yQlxc2R6h6FOysfz/tygSsQ1yXe029ozALNQujADH7okJiRZMP9GUhJ+t8/BxAVJ0kfo7VG1RfYYqGWwVLXhV6vkyJS6tJM+vh86wYkcx72Rg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712952422; c=relaxed/simple; bh=wNGPd6zmxsSOZvB1Y9+rixXvra8fmaco8/6xNltH2Z0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Qncs1do89GGqDyWUBZWl0Jj0nXuiHnjqDTIF+E5p1gIvjhVPbAko3emySQLST+eA3aR2yHH5COuixoWUkuS9wcSz7teXL7+wW7TKi9nsd4AYMnuW8tT3K0fmSHbQH5/tb6wSTHEck8Z46ETNk2JgwRhnaTOYnrd+CKTGPyd35I0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=toxicpanda.com; spf=none smtp.mailfrom=toxicpanda.com; dkim=pass (2048-bit key) header.d=toxicpanda-com.20230601.gappssmtp.com header.i=@toxicpanda-com.20230601.gappssmtp.com header.b=rVhPfGXr; arc=none smtp.client-ip=209.85.222.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=toxicpanda.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=toxicpanda.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=toxicpanda-com.20230601.gappssmtp.com header.i=@toxicpanda-com.20230601.gappssmtp.com header.b="rVhPfGXr" Received: by mail-qk1-f180.google.com with SMTP id af79cd13be357-78d5751901bso88143385a.1 for ; Fri, 12 Apr 2024 13:07:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20230601.gappssmtp.com; s=20230601; t=1712952419; x=1713557219; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=2adXF7SakD5Sp9oQK7Fg5NWHJFJq783c1g6ZtzWtXWs=; b=rVhPfGXreer+M4kMlhQlpR68Miwdt+6DPkl+c14/LQr2uk3h7oSXlH3pbetW/wFuhC F4vcrmVoW4m+OpVe4+MK3SzZU88o5aVXBne5Yk8oHfXvhdfSZqhHnPJC+V6KvqNshbLM +RqZqnE/C1HHwkOqIkCUCeR7TyzvqV/vh0PU9brJjoRvP6fEoEathjA4ZVo4LaV5VwW+ 7fZSnCu+cPSGPhlxdU8VPA/3rVfn/9ZIVMvT8B/3brF56CcZM60irGWK6VBYo+OZS0K8 q9x/D76wANsQyqUV01P2p8EVDsiMuEMtncHMEFY8VbYFZXzq42Rk5o1jdw2Al1thPhbk p+Xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712952419; x=1713557219; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=2adXF7SakD5Sp9oQK7Fg5NWHJFJq783c1g6ZtzWtXWs=; b=r0fxPfk5yebyxjNYJe4SUbKo9kHC24IxA1+ZVh3M4JTKwzbaC5qq3OHkYDhyjJ6oab dX0yRQR85fmWHTruTWQNbgPYpxUWLPYelz/Br3MPjDdiIv6YICDqNIdYDwg74FYv/lta KLBihVwtppxMWgqX/D5LRdCeY19GiWUsjAPDKO5V/DWLYtMAmWgVm6CI77z2h6dNE4rR Ihbx0xFSXi6NdjvkrXgzcTLu9T3xD8I1Ek3d6EqbisYgNU2jG6MgtsHB2tzInbjn4OW4 xGzsk3zjieOsRPrYX0BrBMLmqib7Sta0cZNLlbXXF3tIvuWYr46PaIAm1lYjMiEIbxFQ YzxQ== X-Gm-Message-State: AOJu0YwiC5rZOVWHBJ/uyc2WGGsAmGSoDLoPG/zt+3ZFNiG+dCvxQaXj 7q295mes6R3tiH58nJqOPcp8mXwYssfExn7W48LEM0Q9BdvDFJ7aWithWuNArT7ODd2uPmVs+m9 q X-Google-Smtp-Source: AGHT+IGYmtb8GHFoZCf1Fx2xF3uJbgSNmtHlw1Kd/8BAkVciqts5hJbOiDKDcKbQYOflUo6R0mgZNA== X-Received: by 2002:ae9:e201:0:b0:78d:61e8:7421 with SMTP id c1-20020ae9e201000000b0078d61e87421mr4055886qkc.5.1712952419297; Fri, 12 Apr 2024 13:06:59 -0700 (PDT) Received: from localhost (076-182-020-124.res.spectrum.com. [76.182.20.124]) by smtp.gmail.com with ESMTPSA id m3-20020ae9e003000000b0078d4bca760asm2818046qkk.34.2024.04.12.13.06.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 Apr 2024 13:06:58 -0700 (PDT) Date: Fri, 12 Apr 2024 16:06:57 -0400 From: Josef Bacik To: fdmanana@kernel.org Cc: linux-btrfs@vger.kernel.org Subject: Re: [PATCH v2 13/15] btrfs: add a shrinker for extent maps Message-ID: <20240412200657.GA1222511@perftesting> References: <1cb649870b6cad4411da7998735ab1141bb9f2f0.1712837044.git.fdmanana@suse.com> Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1cb649870b6cad4411da7998735ab1141bb9f2f0.1712837044.git.fdmanana@suse.com> On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana > > Extent maps are used either to represent existing file extent items, or to > represent new extents that are going to be written and the respective file > extent items are created when the ordered extent completes. > > We currently don't have any limit for how many extent maps we can have, > neither per inode nor globally. Most of the time this not too noticeable > because extent maps are removed in the following situations: > > 1) When evicting an inode; > > 2) When releasing folios (pages) through the btrfs_release_folio() address > space operation callback. > > However we won't release extent maps in the folio range if the folio is > either dirty or under writeback or if the inode's i_size is less than > or equals to 16M (see try_release_extent_mapping(). This 16M i_size > constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: > extent_io and extent_state optimizations"), but there's no explanation > about why we have it or why the 16M value. > > This means that for buffered IO we can reach an OOM situation due to too > many extent maps if either of the following happens: > > 1) There's a set of tasks constantly doing IO on many files with a size > not larger than 16M, specially if they keep the files open for very > long periods, therefore preventing inode eviction. > > This requires a really high number of such files, and having many non > mergeable extent maps (due to random 4K writes for example) and a > machine with very little memory; > > 2) There's a set tasks constantly doing random write IO (therefore > creating many non mergeable extent maps) on files and keeping them > open for long periods of time, so inode eviction doesn't happen and > there's always a lot of dirty pages or pages under writeback, > preventing btrfs_release_folio() from releasing the respective extent > maps. > > This second case was actually reported in the thread pointed by the Link > tag below, and it requires a very large file under heavy IO and a machine > with very little amount of RAM, which is probably hard to happen in > practice in a real world use case. > > However when using direct IO this is not so hard to happen, because the > page cache is not used, and therefore btrfs_release_folio() is never > called. Which means extent maps are dropped only when evicting the inode, > and that means that if we have tasks that keep a file descriptor open and > keep doing IO on a very large file (or files), we can exhaust memory due > to an unbounded amount of extent maps. This is especially easy to happen > if we have a huge file with millions of small extents and their extent > maps are not mergeable (non contiguous offsets and disk locations). > This was reported in that thread with the following fio test: > > $ cat test.sh > #!/bin/bash > > DEV=/dev/sdj > MNT=/mnt/sdj > MOUNT_OPTIONS="-o ssd" > MKFS_OPTIONS="" > > cat < /tmp/fio-job.ini > [global] > name=fio-rand-write > filename=$MNT/fio-rand-write > rw=randwrite > bs=4K > direct=1 > numjobs=16 > fallocate=none > time_based > runtime=90000 > > [file1] > size=300G > ioengine=libaio > iodepth=16 > > EOF > > umount $MNT &> /dev/null > mkfs.btrfs -f $MKFS_OPTIONS $DEV > mount $MOUNT_OPTIONS $DEV $MNT > > fio /tmp/fio-job.ini > umount $MNT > > Monitoring the btrfs_extent_map slab while running the test with: > > $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ > /sys/kernel/slab/btrfs_extent_map/total_objects' > > Shows the number of active and total extent maps skyrocketing to tens of > millions, and on systems with a short amount of memory it's easy and quick > to get into an OOM situation, as reported in that thread. > > So to avoid this issue add a shrinker that will remove extents maps, as > long as they are not pinned, and takes proper care with any concurrent > fsync to avoid missing extents (setting the full sync flag while in the > middle of a fast fsync). This shrinker is similar to the one ext4 uses > for its extent_status structure, which is analogous to btrfs' extent_map > structure. > > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/ > Signed-off-by: Filipe Manana I don't like this for a few reasons 1. We're always starting with the first root and the first inode. We're just going to constantly screw that first inode over and over again. 2. I really, really hate our inode rb-tree, I want to reduce it's use, not add more users. It would be nice if we could just utilize ->s_inodes_lru instead for this, which would also give us the nice advantage of not having to think about order since it's already in LRU order. 3. We're registering our own shrinker without a proper LRU setup. I think it would make sense if we wanted to have a LRU for our extent maps, but I think that's not a great idea. We could get the same benefit by adding our own ->nr_cached_objects() and ->free_cached_objects(), I think that's a better approach no matter what other changes you make instead of registering our own shrinker. The concept I whole heartedly agree with, this just needs some tweaks to be more fair and cleaner. The rest of the code is fine, you can add Reviewed-by: Josef Bacik to the rest of it. Thanks, Josef