From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C5C6C04EB8 for ; Wed, 12 Dec 2018 09:48:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 422622084E for ; Wed, 12 Dec 2018 09:48:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1544608118; bh=dMu1gQ5yM9jNrnFnLsmDuhJV1cAZXkFHQ8/MG/1MN7w=; h=Date:From:To:Cc:Subject:References:In-Reply-To:List-ID:From; b=yKHqoya4N+929bD3+xPYU/ldjebSJwTixltPzpgYN2OROI8qx7yGhAkPaseIxervR kdRFvGHr4mqAH+6aBO2JJWgcqNfsl7P2vRrJhADuwbD6vrDxq1M7W1FjMv6ICFc3zA xAw1oXHdQDkp2q9kdNfkY+GmaPYpf9vgf3H/NuHs= DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 422622084E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726969AbeLLJsh (ORCPT ); Wed, 12 Dec 2018 04:48:37 -0500 Received: from mx2.suse.de ([195.135.220.15]:33350 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726869AbeLLJsg (ORCPT ); Wed, 12 Dec 2018 04:48:36 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 6076CAFF4; Wed, 12 Dec 2018 09:48:33 +0000 (UTC) Date: Wed, 12 Dec 2018 10:48:32 +0100 From: Michal Hocko To: "Kirill A. Shutemov" Cc: Andrew Morton , Liu Bo , Jan Kara , Dave Chinner , Theodore Ts'o , Johannes Weiner , Vladimir Davydov , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, LKML , Hugh Dickins Subject: Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback Message-ID: <20181212094832.GN1286@dhcp22.suse.cz> References: <20181211132645.31053-1-mhocko@kernel.org> <20181212094249.cw4xjrdchqsp2tkt@kshutemo-mobl1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181212094249.cw4xjrdchqsp2tkt@kshutemo-mobl1> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote: > On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: > > From: Michal Hocko > > > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > > ext4 writeback > > task1: > > [] wait_on_page_bit+0x82/0xa0 > > [] shrink_page_list+0x907/0x960 > > [] shrink_inactive_list+0x2c7/0x680 > > [] shrink_node_memcg+0x404/0x830 > > [] shrink_node+0xd8/0x300 > > [] do_try_to_free_pages+0x10d/0x330 > > [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > > [] try_charge+0x14d/0x720 > > [] memcg_kmem_charge_memcg+0x3c/0xa0 > > [] memcg_kmem_charge+0x7e/0xd0 > > [] __alloc_pages_nodemask+0x178/0x260 > > [] alloc_pages_current+0x95/0x140 > > [] pte_alloc_one+0x17/0x40 > > [] __pte_alloc+0x1e/0x110 > > [] alloc_set_pte+0x5fe/0xc20 > > [] do_fault+0x103/0x970 > > [] handle_mm_fault+0x61e/0xd10 > > [] __do_page_fault+0x252/0x4d0 > > [] do_page_fault+0x30/0x80 > > [] page_fault+0x28/0x30 > > [] 0xffffffffffffffff > > > > task2: > > [] __lock_page+0x86/0xa0 > > [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > > [] ext4_writepages+0x479/0xd60 > > [] do_writepages+0x1e/0x30 > > [] __writeback_single_inode+0x45/0x320 > > [] writeback_sb_inodes+0x272/0x600 > > [] __writeback_inodes_wb+0x92/0xc0 > > [] wb_writeback+0x268/0x300 > > [] wb_workfn+0xb4/0x390 > > [] process_one_work+0x189/0x420 > > [] worker_thread+0x4e/0x4b0 > > [] kthread+0xe6/0x100 > > [] ret_from_fork+0x41/0x50 > > [] 0xffffffffffffffff > > > > He adds > > : task1 is waiting for the PageWriteback bit of the page that task2 has > > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > > : bit the page which tasks1 has locked. > > > > More precisely task1 is handling a page fault and it has a page locked > > while it charges a new page table to a memcg. That in turn hits a memory > > limit reclaim and the memcg reclaim for legacy controller is waiting on > > the writeback but that is never going to finish because the writeback > > itself is waiting for the page locked in the #PF path. So this is > > essentially ABBA deadlock. > > Side node: > > Do we have PG_writeback vs. PG_locked ordering documentated somewhere? I am not aware of any > IIUC, the trace from task2 suggests that we must not wait for writeback > on the locked page. > > But that not what I see for many wait_on_page_writeback() users: it usally > called with the page locked. I see it for truncate, shmem, swapfile, > splice... > > Maybe the problem is within task2 codepath after all? Jack and David have explained that this is due to an optimization multiple filesystems do. They lock and set wribeback on multiple pages and then send a largeer IO at once. So in this case we have the following pattern lock_page(B) SetPageWriteback(B) unlock_page(B) lock_page(A) lock_page(A) pte_alloc_pne shrink_page_list wait_on_page_writeback(B) SetPageWriteback(A) unlock_page(A) # flush A, B to clear the writeback -- Michal Hocko SUSE Labs