From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bob Peterson Date: Tue, 3 Jul 2018 09:28:46 -0400 (EDT) Subject: [Cluster-devel] [GFS2 PATCH] GFS2: Eliminate bitmap clones In-Reply-To: References: <913319296.47519402.1530554284901.JavaMail.zimbra@redhat.com> Message-ID: <1706893656.47709899.1530624526353.JavaMail.zimbra@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi Steve, ----- Original Message ----- > > Do we really still need "clone bitmaps" in gfs2? If so, why? > > I think maybe we can get rid of them. Can someone (Steve Whitehouse > > perhaps?) think of a scenario in which they're still needed? If so, > > please elaborate and give an example. (snip) > You need to ensure that the blocks cannot be reused in the same > transaction (thats true of all metadata blocks, not just inodes) in > order that recovery will work correctly. You cannot just eliminate the > bitmaps without adding a mechanism to prevent this reuse, > > Steve. I don't see how it's possible for a transaction to reuse the same blocks, even when transactions are combined. As you know, GFS2 (unlike GFS1) marks only one type of metadata in its bitmaps, and that's for dinode blocks. Any other metadata associated with a dinode are marked as data blocks in the bitmap, and they remain marked as such until freed. So if you have a process that truncates a file, for example, and transitions its blocks from data to free, then searches, finds and reallocates those blocks as data again, there would still only be one copy of the bitmap buffer data in the ail lists, right? And it should always reflect the most recent status of those bits, which is data, right? So a journal replay will still replay the latest known version of those bitmaps. If a dinode references indirect blocks (marked as data) then truncates the file to 0, the indirect blocks still remain because the metadata for indirect blocks is never shrunk. If the dinode is unlinked rather than deleted, its indirect blocks and data blocks will all remain "data" until the inode is actually evicted. When the inode is evicted and those blocks actually freed, that's all done in separate transactions as per Andreas's "shrinker" patches, and we know those don't search for free blocks to assign. If a dinode is unlinked, and someone goes after free blocks, they won't find those blocks anyway because they're still not "free" until the inode is evicted. And, of course, the only process that searches the bitmaps for unlinked blocks is the eviction process itself (which actually does something with them) and inplace_reserve, which just tries to kick off a potential eviction (but never actually does an eviction itself). It's a little bit different with directories, because the hash table is kind of data and kind of metadata, but even so, we don't ever shrink the directory hash tables nor free leaf blocks or leaf continuation blocks, as per bz#223783 (which suggests we might want to in the future.) The clones today cost us file fragmentation, file system fragmentation, and performance required to do kmalloc/kfrees, and twice as much work setting and clearing bits, so I question whether the savings in shrinking hash tables or freeing unused continuation leafs outweigh the potential savings we might get by eliminating the bitmap clones. Again, I don't see a scenario that can get us into trouble, even with journal replay. Perhaps I should be worried about extended attributes that are freed and reused? I'll look into that. Bob