From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3A4EC3A59B for ; Mon, 19 Aug 2019 08:58:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id BC5A22086C for ; Mon, 19 Aug 2019 08:58:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726627AbfHSI6B (ORCPT ); Mon, 19 Aug 2019 04:58:01 -0400 Received: from mx2.suse.de ([195.135.220.15]:51686 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726168AbfHSI6B (ORCPT ); Mon, 19 Aug 2019 04:58:01 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 2840CAF56 for ; Mon, 19 Aug 2019 08:58:00 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id B12611E155E; Mon, 19 Aug 2019 10:57:59 +0200 (CEST) Date: Mon, 19 Aug 2019 10:57:59 +0200 From: Jan Kara To: linux-ext4@vger.kernel.org Subject: JBD2 transaction running out of space Message-ID: <20190819085759.GB2491@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Hello, I've recently got a bug report where JBD2 assertion failed due to transaction commit running out of journal space. After closer inspection of the crash dump it seems that the problem is that there were too many journal descriptor blocks (more that max_transaction_size >> 5 + 32 we estimate in jbd2_log_space_left()) due to descriptor blocks with revoke records. In fact the estimate on the number of descriptor blocks looks pretty arbitrary and there can be much more descriptor blocks needed for revoke records. We need one revoke record for every metadata block freed. So in the worst case (1k blocksize, 64-bit journal feature enabled, checksumming enabled) we fit 125 revoke record in one descriptor block. In common cases its about 500 revoke records per descriptor block. Now when we free large directories or large file with data journalling enabled, we can have *lots* of blocks to revoke - with extent mapped files easily millions in a single transaction which can mean 10k descriptor blocks - clearly more than the estimate of 128 descriptor blocks per transaction ;) Now users clearly don't hit this problem frequently so this is not common case but still it is possible and malicious user could use this to DoS the machine so I think we need to get even the weird corner-cases fixed. The question is how because as sketched above the worst case is too bad to account for in the common case. I have considered three options: 1) Count number of revoke records currently in the transaction and add needed revoke descriptor blocks to the expected transaction size. This is easy enough but does not solve all the corner cases - single handle can add lot of revoke blocks which may overflow the space we reserve for descriptor blocks. 2) Add argument to jbd2_journal_start() telling how many metadata blocks we are going to free and we would account necessary revoke descriptor blocks into reserved credits. This could work, we would generally need to pass inode->i_blocks / blocksize as the estimate of metadata blocks to free (for inodes to which this applies) as we don't have better estimate but I guess that's bearable. It would require some changes on ext4 side but not too intrusive. 3) Use the fact that we need to revoke only blocks that are currently in the journal. Thus the number of revoke records we really may need to store is well bound (by the journal size). What is a bit painful is tracking of which blocks are journalled. We could use a variant of counting Bloom filters to store that information with low memory consumption (say 64k of memory in common case) and high-enough accuracy but still that will be some work to write. On the plus side it would reduce the amount revoke records we have to store even in common case. Overall I'm probably leaning towards 2) but I'm happy to hear more opinions or ideas :) Honza -- Jan Kara SUSE Labs, CR