From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6713BC56202 for ; Fri, 20 Nov 2020 20:28:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AB9B521D40 for ; Fri, 20 Nov 2020 20:28:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=soleen.com header.i=@soleen.com header.b="Nx+WiLu9" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AB9B521D40 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=soleen.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EEF096B0036; Fri, 20 Nov 2020 15:28:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E9F3A6B005C; Fri, 20 Nov 2020 15:28:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D8DE46B005D; Fri, 20 Nov 2020 15:28:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0101.hostedemail.com [216.40.44.101]) by kanga.kvack.org (Postfix) with ESMTP id AB4E36B0036 for ; Fri, 20 Nov 2020 15:28:25 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 55D9A8249980 for ; Fri, 20 Nov 2020 20:28:25 +0000 (UTC) X-FDA: 77505934170.12.fact74_120ee5c2734e Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin12.hostedemail.com (Postfix) with ESMTP id 375771801E575 for ; Fri, 20 Nov 2020 20:28:25 +0000 (UTC) X-HE-Tag: fact74_120ee5c2734e X-Filterd-Recvd-Size: 6631 Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf46.hostedemail.com (Postfix) with ESMTP for ; Fri, 20 Nov 2020 20:28:24 +0000 (UTC) Received: by mail-ej1-f54.google.com with SMTP id bo9so8883712ejb.13 for ; Fri, 20 Nov 2020 12:28:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; h=mime-version:from:date:message-id:subject:to; bh=C2+Om1iIS7SDk2zlxSKIOz8Ahx5egILl43SxFs0HB/8=; b=Nx+WiLu9FunBPwZt7+wY6Ng4MUP2xGGW/GY9jvlh1w6+8EjqHJEFRZYgx8WTB2p0mX wc61KHahpfZXfN/zWA17BbDtl4Mvs4yxZ1DGzYyTyWPASDwHYg7+Gsx13C+zZhlBWI/d j3Ll1ENgs9gPPeRmSmkylGcb9YJVansAMW+KogIudTowTAdCdJVFxnem+1Dt3pMpD6Lh lSogdpxCTH/Lqs9Z7XwuwRBDL89EPKqFTEmGPt1KJAkmvXylzEr1BnI2XO53074Z2mpu uk7KwDYAGUpFcO1FM9+hcS1tPOLs6io5ZjXoK6IDaRrkiTHwRIjDdKDCXRiAICAeYJV7 ZkMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=C2+Om1iIS7SDk2zlxSKIOz8Ahx5egILl43SxFs0HB/8=; b=Va1I4b3utPs3z150Pi5fGFxYi74Vt4hrwhLuge/Jb0NtwjqUmibg11W6Rm57f3jnF3 gXL2N5IgohNWl6vAFYIPH5Vrf+nGpGxyvvskCsJX6qEMMQv8HYhUNa2eMcrK+GBXaOlq 9aVy7h68z8IyH9U7UBUHugNXsviXSnXBf1SGz1Q+zOdbSa89Pz2s0vACZxXTy02AKFP8 2qNvvOwrYv9V6aMUnITkpQPJcNP8DjvFSHpFqaLFKYYs7HRBLMri9TvR4J6lQoYHlRNj WWkr6zdj/uITOnjRt3I+EHFEznkdT6xvKbID+bTpFTzXNZzduVF4hBpT0b0ceemUgwW/ wROQ== X-Gm-Message-State: AOAM533XELJmKb+vrOE1subczeRHJYU/TrIjrxUm8ND734Jv5Di8NTqy tq9en/G7UaSwlFP65u2/elPxZ+Xc40Ase7WHo0Bx2KHJ+v1fqg== X-Google-Smtp-Source: ABdhPJyXLCpYiOYdP512p/IvxNXIuaAsgTsA/gUCfUH4rHtjojj/tfSwglLF4TVa+xWdBwCU5jASeelXOwb4lO/9g0A= X-Received: by 2002:a17:906:d41:: with SMTP id r1mr33589487ejh.383.1605904102943; Fri, 20 Nov 2020 12:28:22 -0800 (PST) MIME-Version: 1.0 From: Pavel Tatashin Date: Fri, 20 Nov 2020 15:27:46 -0500 Message-ID: Subject: Pinning ZONE_MOVABLE pages To: linux-mm , Andrew Morton , Vlastimil Babka , LKML , Michal Hocko , David Hildenbrand , Oscar Salvador , Dan Williams , Sasha Levin , Tyler Hicks , Joonsoo Kim , sthemmin@microsoft.com Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Recently, I encountered a hang that is happening during memory hot remove operation. It turns out that the hang is caused by pinned user pages in ZONE_MOVABLE. Kernel expects that all pages in ZONE_MOVABLE can be migrated, but this is not the case if a user applications such as through dpdk libraries pinned them via vfio dma map. Kernel keeps trying to hot-remove them, but refcnt never gets to zero, so we are looping until the hardware watchdog kicks in. We cannot do dma unmaps before hot-remove, because hot-remove is a slow operation, and we have thousands for network flows handled by dpdk that we just cannot suspend for the duration of hot-remove operation. The solution is for dpdk to allocate pages from a zone below ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible. There is no user interface that we have that allows applications to select what zone the memory should come from. I've spoken with Stephen Hemminger, and he said that DPDK is moving in the direction of using transparent huge pages instead of HugeTLBs, which means that we need to allow at least anonymous, and anonymous transparent huge pages to come from non-movable zones on demand. Here is what I am proposing: 1. Add a new flag that is passed through pin_user_pages_* down to fault handlers, and allow the fault handler to allocate from a non-movable zone. Sample function stacks through which this info needs to be passed is this: pin_user_pages_remote(gup_flags) __get_user_pages_remote(gup_flags) __gup_longterm_locked(gup_flags) __get_user_pages_locked(gup_flags) __get_user_pages(gup_flags) faultin_page(gup_flags) Convert gup_flags into fault_flags handle_mm_fault(fault_flags) >From handle_mm_fault(), the stack diverges into various faults, examples include: Transparent Huge Page handle_mm_fault(fault_flags) __handle_mm_fault(fault_flags) Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask create_huge_pmd(vmf); do_huge_pmd_anonymous_page(vmf); mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from vmf.gfp_mask should be passed as well. There are several other similar paths in a transparent huge page, also there is a named path where allocation is based on filesystems, and the flag should be honored there as well, but it does not have to be added at the same time. Regular Pages handle_mm_fault(fault_flags) __handle_mm_fault(fault_flags) Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask handle_pte_fault(vmf) do_anonymous_page(vmf); page = alloc_zeroed_user_highpage_movable(vma, vmf->address); -> replace change this call according to gfp_mask. The above only take care of the case if user application faults on the page during pinning time, but there are also cases where pages already exist. 2. Add an internal move_pages_zone() similar to move_pages() syscall but instead of migrating to a different NUMA node, migrate pages from ZONE_MOVABLE to another zone. Call move_pages_zone() on demand prior to pinning pages from vfio_pin_map_dma() for instance. 3. Perhaps, it also makes sense to add madvise() flag, to allocate pages from non-movable zone. When a user application knows that it will do DMA mapping, and pin pages for a long time, the memory that it allocates should never be migrated or hot-removed, so make sure that it comes from the appropriate place. The benefit of adding madvise() flag is that we won't have to deal with slow page migration during pin time, but the disadvantage is that we would need to change the user interface. Before I start working on the above approaches, I would like to get an opinion from the community on an appropriate path forward for this problem. If what I described sounds reasonable, or if there are other ideas on how to address the problem that I am seeing. Thank you, Pasha