From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EE55FC433FE for ; Thu, 3 Dec 2020 22:12:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 623F222285 for ; Thu, 3 Dec 2020 22:12:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 623F222285 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7CFB06B0036; Thu, 3 Dec 2020 17:12:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 758E36B005C; Thu, 3 Dec 2020 17:12:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D2156B0068; Thu, 3 Dec 2020 17:12:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0205.hostedemail.com [216.40.44.205]) by kanga.kvack.org (Postfix) with ESMTP id 44A176B0036 for ; Thu, 3 Dec 2020 17:12:03 -0500 (EST) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id E3EE1181AEF31 for ; Thu, 3 Dec 2020 22:12:02 +0000 (UTC) X-FDA: 77553369684.30.birth84_16159ac273bf Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id C2E3D180B3AA7 for ; Thu, 3 Dec 2020 22:12:02 +0000 (UTC) X-HE-Tag: birth84_16159ac273bf X-Filterd-Recvd-Size: 5995 Received: from mail-lj1-f195.google.com (mail-lj1-f195.google.com [209.85.208.195]) by imf08.hostedemail.com (Postfix) with ESMTP for ; Thu, 3 Dec 2020 22:12:02 +0000 (UTC) Received: by mail-lj1-f195.google.com with SMTP id y7so4291861lji.8 for ; Thu, 03 Dec 2020 14:12:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=QF1ANqOQBqZLcWlpFbE2WrA4M94QqjKls+F32sntMMY=; b=T1kC1Kg7NGIo6tMwYyuuSIkbUM2VuNQz7qLiEEJZtDFibEewJn1anNB9g2y80hPxkX iuvhJivzZgd6XEn3oUb3bwhkxuLp8C0ke4zyslQ0f29RT2lbodQCYp6n1erOyIs4LoZy EUqgqNTRaA9IbgSoFHG/SrrM2zc5q8yCV93/QpCAmfQCi2kdb6q8EmgwRQYRFCvDvS8t Bg2+nLT7Gmk2KOP7TRTnEDvSGvD7h+4Nk1weYi2DWGEOEuqKcxzWnMnngV0S8DBlk/Kk ToaYG83BoQojs0cXLtz/RJ3CPGKDOIbfzrgK58ch/RGfQUP/Kdm39CIG8jJOESnFVjs8 0gIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=QF1ANqOQBqZLcWlpFbE2WrA4M94QqjKls+F32sntMMY=; b=A94bYBO6s/3Zh2SyDNbHMztuP3bDZPjYa6PRrujlIr0pDWxVuPcWI4+xd9kJoqJ6Ap lqzxFOSHJY3QFZNyuWSmTalDBSiBdmves8+LdCr1IgUNIjYVnSkjlahYmuOjeuJ+zMgk LwO1sIqDNhMACoi9vR+tZgaQFjGxNWM/22ikpCi+r2m4YJcHWbEEloqIq5QtjZ8rd0RH 5Fh+FhxN/W1OQ/W79TOO7Cd9gAZdyJ4TNrA46M9arwjtVVZcQLc/cpOK5WK6/ncK+txI 6/lN8otNLtc5k6PVJHV5sNM6nwrXJPtpsY0iIdNhXozh+jBWJb/jrdVl2Ej6mxpjWb8i X7Zw== X-Gm-Message-State: AOAM532gWZazKbVSzSONgdsJaZ2GtuiNXsLnvoDU6Rh8dmKohTEO6gdZ KI2d7LgYMQGENg2HUChCyhF++W20toEYpggh/fESZA== X-Google-Smtp-Source: ABdhPJxZcU+lCfn5RqsAVafhGSLZ7zVxw6S18ukw3rg0CDcwpepwRKN8lz8uLajcxg5pr4bmkUNKYM+C/v1mj2n/svs= X-Received: by 2002:a2e:b16a:: with SMTP id a10mr2083994ljm.446.1607033520505; Thu, 03 Dec 2020 14:12:00 -0800 (PST) MIME-Version: 1.0 References: <20201203220242.158165-1-mike.kravetz@oracle.com> In-Reply-To: <20201203220242.158165-1-mike.kravetz@oracle.com> From: Shakeel Butt Date: Thu, 3 Dec 2020 14:11:49 -0800 Message-ID: Subject: Re: [PATCH] hugetlb_cgroup: fix offline of hugetlb cgroup with reservations To: Mike Kravetz Cc: Linux MM , LKML , Cgroups , Mina Almasry , David Rientjes , Greg Thelen , Sandipan Das , Shuah Khan , Adrian Moreno , Andrew Morton , stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Dec 3, 2020 at 2:04 PM Mike Kravetz wrote: > > Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload > using hugetlbfs. In this environment the issue is reproduced by: > 1 - Start a simple pod that uses the recently added HugePages medium > feature (pod yaml attached) > 2 - Start a DPDK app. It doesn't need to run successfully (as in transfer > packets) nor interact with real hardware. It seems just initializing > the EAL layer (which handles hugepage reservation and locking) is > enough to trigger the issue > 3 - Delete the Pod (or let it "Complete"). > > This would result in a kworker thread going into a tight loop (top output): > 1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 > kworker/28:7+cgroup_destroy > > 'perf top -g' reports: > - 63.28% 0.01% [kernel] [k] worker_thread > - 49.97% worker_thread > - 52.64% process_one_work > - 62.08% css_killed_work_fn > - hugetlb_cgroup_css_offline > 41.52% _raw_spin_lock > - 2.82% _cond_resched > rcu_all_qs > 2.66% PageHuge > - 0.57% schedule > - 0.57% __schedule > > We are spinning in the do-while loop in hugetlb_cgroup_css_offline. > Worse yet, we are holding the master cgroup lock (cgroup_mutex) while > infinitely spinning. Little else can be done on the system as the > cgroup_mutex can not be acquired. > > Do note that the issue can be reproduced by simply offlining a hugetlb > cgroup containing pages with reservation counts. > > The loop in hugetlb_cgroup_css_offline is moving page counts from the > cgroup being offlined to the parent cgroup. This is done for each hstate, > and is repeated until hugetlb_cgroup_have_usage returns false. The routine > moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts. > The routine hugetlb_cgroup_have_usage is checking for both 'usage' and > 'reservation' counts. Discussion about what to do with reservation > counts when reparenting was discussed here: > > https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/ > > The decision was made to leave a zombie cgroup for with reservation > counts. Unfortunately, the code checking reservation counts was > incorrectly added to hugetlb_cgroup_have_usage. > > To fix the issue, simply remove the check for reservation counts. While > fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed. > The hstate index is not reinitialized each time through the do-while loop. > Fix this as well. > > Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations") > Cc: > Reported-by: Adrian Moreno > Tested-by: Adrian Moreno > Signed-off-by: Mike Kravetz Reviewed-by: Shakeel Butt