From patchwork Wed Aug 18 06:30:57 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443379 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5C62C432BE for ; Wed, 18 Aug 2021 06:31:14 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6EFEF60720 for ; Wed, 18 Aug 2021 06:31:14 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 6EFEF60720 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 0AD846B0074; Wed, 18 Aug 2021 02:31:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E3F6F6B0075; Wed, 18 Aug 2021 02:31:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C6A8E6B0078; Wed, 18 Aug 2021 02:31:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0107.hostedemail.com [216.40.44.107]) by kanga.kvack.org (Postfix) with ESMTP id AAB506B0074 for ; Wed, 18 Aug 2021 02:31:13 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 5FE438249980 for ; Wed, 18 Aug 2021 06:31:13 +0000 (UTC) X-FDA: 78487229226.28.BBB4431 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf18.hostedemail.com (Postfix) with ESMTP id 1839D4001EBB for ; Wed, 18 Aug 2021 06:31:12 +0000 (UTC) Received: by mail-yb1-f202.google.com with SMTP id p71-20020a25424a0000b029056092741626so1764846yba.19 for ; Tue, 17 Aug 2021 23:31:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=axDz8FmbUkmwoP5Po1n0WrbURWOp6eK+Sy9KgBfPnX8=; b=tMJdGo0VdmwhtsmfaZ+8rkd4E9lkxdQ44NTW3d4/TNpCinw/fAFDzywEKib15cClPi W29Wg6lpdN+bZgNq9Vd3h4q/1uzr+5hqEfenQ6HyCLa6QbL33fZSCMLzbTULgjUGwpI+ BXmZg8SRzvZbXVtkQBlaLzEKm8WV77UfxsQ8EG6tQhxw6/Eh23ojg4fHfZC7QgWEo1ji DDtk9ZcuKvr1BNn+esg0iIEsM4XejbvsJ+161wy6H40w1DQU0zhTeR6Qc0qXfc3jhHfB OSV1hDsLPUfb0gbAc/64uz9OCgvP0IMwvygx6lIZRwZoCsDlkU53OLv5tpYI+ITQhfO/ zGEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=axDz8FmbUkmwoP5Po1n0WrbURWOp6eK+Sy9KgBfPnX8=; b=HEyHpRA7YkfffNvslIvW6HuN/6lnGvyQzgi2vghhPPVeVN1+ZUdzybOhUmp8V1aIsd R6fuYvHIpvNaRU142y3ulk0Hu9MLPLhdMWDHgdWfAbIZbPHtXgIshRoGehpK84rKwtDo 5tSnPqV8YuFY1LsWdBgNyDHlrWBWFvpwRw77Zp/UM4h4INljVWXA58iykQGq6VBngklo rsRicm6hIe9fEMon2Pnmx2zr8mAqhb8NSrlKRey0Bl7gmRdM8+cbn9EYrklokoXvCtKk lW64Y3rnhPEoczcQygRAiBGYyJ05hDiyz3qZ18SGS7sYMABsjNr0g83Bu9xDoPa91RVC MSkw== X-Gm-Message-State: AOAM533RJ4Wj2uAqtXyGyNOhUUTEKfFVtgLHkrIUi6WCw6BsLiZ0OUjk DRJPW3BODZlhqj+fBg4YowtzkJpwM9h0gx2aA6A/hNtOB2FHNaK+tsi74IrNsA92Ev/+6euQD6R MDW0nY//owiss6a0Zn+KAhFxRVqnkJyheQOyWmPpErefzhtBP/L5D1Mmw X-Google-Smtp-Source: ABdhPJz9k/2W2LmgADoTsZbfpB5d6btee0nxlX6DF/x7Ki85TCn69a1MVMx/WICufFas9KT4dBOAj2dLDVU= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a25:cf8a:: with SMTP id f132mr9779192ybg.387.1629268272384; Tue, 17 Aug 2021 23:31:12 -0700 (PDT) Date: Wed, 18 Aug 2021 00:30:57 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-2-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young() From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=tMJdGo0V; spf=pass (imf18.hostedemail.com: domain of 3MKkcYQYKCAU3z4mftlttlqj.htrqnsz2-rrp0fhp.twl@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3MKkcYQYKCAU3z4mftlttlqj.htrqnsz2-rrp0fhp.twl@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 1839D4001EBB X-Stat-Signature: ierg7zqi1iutxm8fjk9smhyogpegi6rn X-HE-Tag: 1629268272-765849 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Some architectures set the accessed bit in PTEs automatically, e.g., x86, and arm64 v8.2 and later. On architectures that do not have this capability, clearing the accessed bit in a PTE triggers a page fault following the TLB miss. Being aware of this capability can help make better decisions, i.e., whether to limit the size of each batch of PTEs and the burst of batches when clearing the accessed bit. Signed-off-by: Yu Zhao --- arch/arm64/include/asm/cpufeature.h | 19 ++++++------------- arch/arm64/include/asm/pgtable.h | 10 ++++------ arch/arm64/kernel/cpufeature.c | 19 +++++++++++++++++++ arch/arm64/mm/proc.S | 12 ------------ arch/arm64/tools/cpucaps | 1 + arch/x86/include/asm/pgtable.h | 6 +++--- include/linux/pgtable.h | 12 ++++++++++++ mm/memory.c | 14 +------------- 8 files changed, 46 insertions(+), 47 deletions(-) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 9bb9d11750d7..2020b9e818c8 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -776,6 +776,12 @@ static inline bool system_supports_tlb_range(void) cpus_have_const_cap(ARM64_HAS_TLB_RANGE); } +/* Check whether hardware update of the Access flag is supported. */ +static inline bool system_has_hw_af(void) +{ + return IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && cpus_have_const_cap(ARM64_HW_AF); +} + extern int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt); static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) @@ -799,19 +805,6 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) } } -/* Check whether hardware update of the Access flag is supported */ -static inline bool cpu_has_hw_af(void) -{ - u64 mmfr1; - - if (!IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) - return false; - - mmfr1 = read_cpuid(ID_AA64MMFR1_EL1); - return cpuid_feature_extract_unsigned_field(mmfr1, - ID_AA64MMFR1_HADBS_SHIFT); -} - static inline bool cpu_has_pan(void) { u64 mmfr1 = read_cpuid(ID_AA64MMFR1_EL1); diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index f09bf5c02891..b63a6a7b62ee 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -993,13 +993,11 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, * page after fork() + CoW for pfn mappings. We don't always have a * hardware-managed access flag on arm64. */ -static inline bool arch_faults_on_old_pte(void) +static inline bool arch_has_hw_pte_young(void) { - WARN_ON(preemptible()); - - return !cpu_has_hw_af(); + return system_has_hw_af(); } -#define arch_faults_on_old_pte arch_faults_on_old_pte +#define arch_has_hw_pte_young arch_has_hw_pte_young /* * Experimentally, it's cheap to set the access flag in hardware and we @@ -1007,7 +1005,7 @@ static inline bool arch_faults_on_old_pte(void) */ static inline bool arch_wants_old_prefaulted_pte(void) { - return !arch_faults_on_old_pte(); + return arch_has_hw_pte_young(); } #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 0ead8bfedf20..d05de77626f5 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1650,6 +1650,14 @@ static bool has_hw_dbm(const struct arm64_cpu_capabilities *cap, return true; } +static void cpu_enable_hw_af(struct arm64_cpu_capabilities const *cap) +{ + u64 val = read_sysreg(tcr_el1); + + write_sysreg(val | TCR_HA, tcr_el1); + isb(); + local_flush_tlb_all(); +} #endif #ifdef CONFIG_ARM64_AMU_EXTN @@ -2126,6 +2134,17 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .matches = has_hw_dbm, .cpu_enable = cpu_enable_hw_dbm, }, + { + .desc = "Hardware update of the Access flag", + .type = ARM64_CPUCAP_SYSTEM_FEATURE, + .capability = ARM64_HW_AF, + .sys_reg = SYS_ID_AA64MMFR1_EL1, + .sign = FTR_UNSIGNED, + .field_pos = ID_AA64MMFR1_HADBS_SHIFT, + .min_field_value = 1, + .matches = has_cpuid_feature, + .cpu_enable = cpu_enable_hw_af, + }, #endif { .desc = "CRC32 instructions", diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S index 35936c5ae1ce..b066d5712e3d 100644 --- a/arch/arm64/mm/proc.S +++ b/arch/arm64/mm/proc.S @@ -478,18 +478,6 @@ SYM_FUNC_START(__cpu_setup) * Set the IPS bits in TCR_EL1. */ tcr_compute_pa_size tcr, #TCR_IPS_SHIFT, x5, x6 -#ifdef CONFIG_ARM64_HW_AFDBM - /* - * Enable hardware update of the Access Flags bit. - * Hardware dirty bit management is enabled later, - * via capabilities. - */ - mrs x9, ID_AA64MMFR1_EL1 - and x9, x9, #0xf - cbz x9, 1f - orr tcr, tcr, #TCR_HA // hardware Access flag update -1: -#endif /* CONFIG_ARM64_HW_AFDBM */ msr mair_el1, mair msr tcr_el1, tcr /* diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps index 49305c2e6dfd..d52f50671e60 100644 --- a/arch/arm64/tools/cpucaps +++ b/arch/arm64/tools/cpucaps @@ -35,6 +35,7 @@ HAS_STAGE2_FWB HAS_SYSREG_GIC_CPUIF HAS_TLB_RANGE HAS_VIRT_HOST_EXTN +HW_AF HW_DBM KVM_PROTECTED_MODE MISMATCHED_CACHE_TYPE diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 448cd01eb3ec..3908780fc408 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1397,10 +1397,10 @@ static inline bool arch_has_pfn_modify_check(void) return boot_cpu_has_bug(X86_BUG_L1TF); } -#define arch_faults_on_old_pte arch_faults_on_old_pte -static inline bool arch_faults_on_old_pte(void) +#define arch_has_hw_pte_young arch_has_hw_pte_young +static inline bool arch_has_hw_pte_young(void) { - return false; + return true; } #endif /* __ASSEMBLY__ */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e24d2c992b11..3a8221fa2c76 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -258,6 +258,18 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif +#ifndef arch_has_hw_pte_young +static inline bool arch_has_hw_pte_young(void) +{ + /* + * Those arches which have hw access flag feature need to implement + * their own helper. By default, "false" means pagefault will be hit + * on old pte. + */ + return false; +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index 25fc46e87214..2f96179db219 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -121,18 +121,6 @@ int randomize_va_space __read_mostly = 2; #endif -#ifndef arch_faults_on_old_pte -static inline bool arch_faults_on_old_pte(void) -{ - /* - * Those arches which don't have hw access flag feature need to - * implement their own helper. By default, "true" means pagefault - * will be hit on old pte. - */ - return true; -} -#endif - #ifndef arch_wants_old_prefaulted_pte static inline bool arch_wants_old_prefaulted_pte(void) { @@ -2769,7 +2757,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src, * On architectures with software "accessed" bits, we would * take a double page fault, so mark it accessed here. */ - if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) { pte_t entry; vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); From patchwork Wed Aug 18 06:30:58 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443381 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AF8BCC4338F for ; Wed, 18 Aug 2021 06:31:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4766360720 for ; Wed, 18 Aug 2021 06:31:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4766360720 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 95F966B0075; Wed, 18 Aug 2021 02:31:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8E8E16B0078; Wed, 18 Aug 2021 02:31:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 789176B007B; Wed, 18 Aug 2021 02:31:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0176.hostedemail.com [216.40.44.176]) by kanga.kvack.org (Postfix) with ESMTP id 5F1446B0075 for ; Wed, 18 Aug 2021 02:31:15 -0400 (EDT) Received: from smtpin38.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id E31B4184138FF for ; Wed, 18 Aug 2021 06:31:14 +0000 (UTC) X-FDA: 78487229268.38.72C5F94 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf12.hostedemail.com (Postfix) with ESMTP id 9C9951004EDF for ; Wed, 18 Aug 2021 06:31:14 +0000 (UTC) Received: by mail-yb1-f201.google.com with SMTP id w201-20020a25dfd2000000b00594695384d1so1788511ybg.20 for ; Tue, 17 Aug 2021 23:31:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=r28OOl9H0j68087MxXenLvNFF4q/L5LmucJPA1P1rF0=; b=rVacPqv+ksAvDz+9AzHa3UA8YFJnchxgAhITykWT0kxNmHdqCt/DAMGYBXtFWBzdAw UzvzekE5eXZhtYXiInhdds4rbazWL2e8SFPVHUTGBWlGoWlFjQjsFG7H8fdm3sVFRrQW CFVuV2fjqGRF2ixE+rJiDYBWDDGU/m+XKJeE5lrzOOrs34yZw1Ln+xkk4ovZWFoIZoZa qCrHFEneIQhdNP2RXcj/F4CpHXU0rbR7/8YAGVM5R73tsv7EDGpvPGvc4a+NpC+R39BQ Ur9JQcBxVNokNwE8GV6vwh3SSZ/rfbxBm3LxCp2vVHVrgTnlp/1EuNSgPFpyUabNGaju +HbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=r28OOl9H0j68087MxXenLvNFF4q/L5LmucJPA1P1rF0=; b=rhGqOrpzKCVP1kqrXzQr7nNlykZjoqhG7Aq1WDZC1vtx5VJ50/IhsIwiM5AFT6svC9 4fRnksFFREBF45DFYDPdJro4p2VtkMglxu/jN5Zps+jwHG3n2Yew7LIBG1Tm5cS9f6XG VfNl2V007FhQOTpn10z3qc9TT36kUxTMxlqmkzD4L9MbliQnXgkXV534Wr6T0QoiLWEa DjpXuG+KWc/6CUwaamMcAqIV6sSs6ybQ74nQVc72+S2IpAcNNoH/I9o7347+LucMk2kT twwmAkCcS6MSQuKxsszpUsvXofF+giBWH9PF8uGuzt8kVyVtjq4wSbhvQTaLvVLCcxdk mqhw== X-Gm-Message-State: AOAM533tFDVqNULti6Tdyly25YgIWTiusXKk7azjs60a3jRmYck14xKV azSMcKZnE2+7zk87soK22ZDyrs5BXK3F5NaIEFQWZX8ieeX7nyAj4Hf1cyolRZO8lorC6fQok14 di+2QdQEk9Va3eNj0TtEC/70S6aJ3NByCxJWMTJW3yBdmYFxAyRSLt2/f X-Google-Smtp-Source: ABdhPJxKa+6gSOMXKC9+9yKBARFPL7honqMsWyRtIkHW0A5Nhi7abOtfmKUl4eee1vwoKxR4P+tGRKOMAyI= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a25:2688:: with SMTP id m130mr9020507ybm.146.1629268273873; Tue, 17 Aug 2021 23:31:13 -0700 (PDT) Date: Wed, 18 Aug 2021 00:30:58 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-3-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 02/11] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=rVacPqv+; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of 3MakcYQYKCAY405ngumuumrk.iusrot03-ssq1giq.uxm@flex--yuzhao.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3MakcYQYKCAY405ngumuumrk.iusrot03-ssq1giq.uxm@flex--yuzhao.bounces.google.com X-Stat-Signature: 9q7crf4ydtcm4irdjb66ijaq6wsee4g7 X-Rspamd-Queue-Id: 9C9951004EDF X-Rspamd-Server: rspam05 X-HE-Tag: 1629268274-142487 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Some architectures support the accessed bit on non-leaf PMD entries, e.g., x86_64 sets the accessed bit on a non-leaf PMD entry when using it as part of linear address translation [1]. As an optimization, page table walkers who are interested in the accessed bit can skip the PTEs under a non-leaf PMD entry if the accessed bit is cleared on this non-leaf PMD entry. Although an inline function may be preferable, this capability is added as a configuration option to look consistent when used with the existing macros. [1]: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 (October 2019), section 4.8 Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- arch/Kconfig | 9 +++++++++ arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 3 ++- arch/x86/mm/pgtable.c | 5 ++++- include/linux/pgtable.h | 4 ++-- 5 files changed, 18 insertions(+), 4 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 129df498a8e1..5b6b4f95372f 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1282,6 +1282,15 @@ config ARCH_SPLIT_ARG64 config ARCH_HAS_ELFCORE_COMPAT bool +config ARCH_HAS_NONLEAF_PMD_YOUNG + bool + depends on PGTABLE_LEVELS > 2 + help + Architectures that select this are able to set the accessed bit on + non-leaf PMD entries in addition to leaf PTE entries where pages are + mapped. For them, page table walkers that clear the accessed bit may + stop at non-leaf PMD entries if they do not see the accessed bit. + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 88fb922c23a0..36a81d31f711 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -84,6 +84,7 @@ config X86 select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL + select ARCH_HAS_NONLEAF_PMD_YOUNG if X86_64 select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 select ARCH_HAS_COPY_MC if X86_64 select ARCH_HAS_SET_MEMORY diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 3908780fc408..01a1763123ff 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -817,7 +817,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) static inline int pmd_bad(pmd_t pmd) { - return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; + return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) != + (_KERNPG_TABLE & ~_PAGE_ACCESSED); } static inline unsigned long pages_to_mb(unsigned long npg) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 3481b35cb4ec..a224193d84bf 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma, return ret; } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) { @@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma, return ret; } +#endif + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE int pudp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pud_t *pudp) { diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 3a8221fa2c76..483d5ff7a33e 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -211,7 +211,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, #endif #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) @@ -232,7 +232,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, BUILD_BUG(); return 0; } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */ #endif #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH From patchwork Wed Aug 18 06:30:59 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443383 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A70C6C432BE for ; Wed, 18 Aug 2021 06:31:18 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 53AEF60720 for ; Wed, 18 Aug 2021 06:31:18 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 53AEF60720 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 010A56B0078; Wed, 18 Aug 2021 02:31:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EDAD36B007B; Wed, 18 Aug 2021 02:31:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D07326B007D; Wed, 18 Aug 2021 02:31:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0187.hostedemail.com [216.40.44.187]) by kanga.kvack.org (Postfix) with ESMTP id B423C6B0078 for ; Wed, 18 Aug 2021 02:31:16 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 55BB620BF4 for ; Wed, 18 Aug 2021 06:31:16 +0000 (UTC) X-FDA: 78487229352.25.08BFBEA Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf25.hostedemail.com (Postfix) with ESMTP id 189DBB000298 for ; Wed, 18 Aug 2021 06:31:15 +0000 (UTC) Received: by mail-yb1-f201.google.com with SMTP id f3-20020a25cf030000b029055a2303fc2dso1806506ybg.11 for ; Tue, 17 Aug 2021 23:31:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=IbNlCcnWCb+KjCwpSHdD59CG4potZ0Ug8pgSOhUgoys=; b=XDyq5Z6SoXVLKLMxSUBELU3/JYB+XigRWxLmHl+mLVjgRyxzm26UQCeVaYop0jlk8Q /j3euucaicnXQLnrC9vi8lKXjBup6gx5AJkPyNKArxfWHKlWgPq9vlAt4zuxjNiwgemL Z2I3+TjwsltWgI++n6c2qaGMKlifmtnYBUI+VvL25BYzfpP2AkXI/s1WUdnXgGEc13Li Hy37gyZ+oYRHjqbMj3AL08iB3J+yX1CN2BCMOp7sjdwdce3uEAXIkkrp5qc5hemlY8nh jm2ZLuv0bnfOtB7jLWewZP2vlOyubRZD7CM7jL7+jk6KgOx+CtLcBKkSFg1JcPNbrOOr j0ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=IbNlCcnWCb+KjCwpSHdD59CG4potZ0Ug8pgSOhUgoys=; b=jZPnPE0jlqRbi22Deu5gHa1uKexbIyxs+qLUoz108hwcNg2p4fkRksxOKcF+t3Hu7J cZwi4LZpWP3Q1saFiAhDBxy5+39OQXWUZAqq3o9opVIsIkfVxFoIGAuCeW/gjlhNWOW8 CSk55NNDESHOChrepRAkN1ETnNQnZsuGEh4AMoCzRMKAelXsVVmFV42ID0ie0IZVCw3O jIJvVyveRLRQ/cooePoIqonuqh8Tski/BHX41c67cC9d6exL+C6wbjo+k3CBOlvCBgT8 qHpf8oj8cCNtz0Gp8WJW4PLA3+F174ogs2XL9JMH7FpmGXZjNB6zuZOvoeqlp6864KS4 H5Dw== X-Gm-Message-State: AOAM533xoeOvffOGFMmpCR9kqnN4I1lzVBqSzp2fV0KytdcnlrcXAP7x /VtlqfrSGaCkxWS0oLRRnb57MyUd5m8A/5CCQcNmpylSBNPBPWZlXUmZAF4IATlGaDlGtAdA9AE S5RHs+i9x+BtXZgMSQrsJLDB/ptRZzJOL/7DXB4SIxVaikMUxobKT+QoB X-Google-Smtp-Source: ABdhPJx0xmbEVoVaj90boaH3OUfKySCnDfxk00twIs2YN0dnUnaA3Xv1FOpcTPviSrYSArDiy17HMf1TW/4= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a25:b5ce:: with SMTP id d14mr9544572ybg.415.1629268275365; Tue, 17 Aug 2021 23:31:15 -0700 (PDT) Date: Wed, 18 Aug 2021 00:30:59 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-4-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 03/11] mm/vmscan.c: refactor shrink_node() From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=XDyq5Z6S; spf=pass (imf25.hostedemail.com: domain of 3M6kcYQYKCAg627piwowwotm.kwutqv25-uus3iks.wzo@flex--yuzhao.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3M6kcYQYKCAg627piwowwotm.kwutqv25-uus3iks.wzo@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 189DBB000298 X-Stat-Signature: sidfqz1997bcoo1mjmyswmcg3wwau9xb X-HE-Tag: 1629268275-683964 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch refactors shrink_node(). This will make the upcoming changes to mm/vmscan.c more readable. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- mm/vmscan.c | 186 +++++++++++++++++++++++++++------------------------- 1 file changed, 98 insertions(+), 88 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4620df62f0ff..b6d14880bd76 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2437,6 +2437,103 @@ enum scan_balance { SCAN_FILE, }; +static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) +{ + unsigned long file; + struct lruvec *target_lruvec; + + target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); + + /* + * Determine the scan balance between anon and file LRUs. + */ + spin_lock_irq(&target_lruvec->lru_lock); + sc->anon_cost = target_lruvec->anon_cost; + sc->file_cost = target_lruvec->file_cost; + spin_unlock_irq(&target_lruvec->lru_lock); + + /* + * Target desirable inactive:active list ratios for the anon + * and file LRU lists. + */ + if (!sc->force_deactivate) { + unsigned long refaults; + + refaults = lruvec_page_state(target_lruvec, + WORKINGSET_ACTIVATE_ANON); + if (refaults != target_lruvec->refaults[0] || + inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) + sc->may_deactivate |= DEACTIVATE_ANON; + else + sc->may_deactivate &= ~DEACTIVATE_ANON; + + /* + * When refaults are being observed, it means a new + * workingset is being established. Deactivate to get + * rid of any stale active pages quickly. + */ + refaults = lruvec_page_state(target_lruvec, + WORKINGSET_ACTIVATE_FILE); + if (refaults != target_lruvec->refaults[1] || + inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) + sc->may_deactivate |= DEACTIVATE_FILE; + else + sc->may_deactivate &= ~DEACTIVATE_FILE; + } else + sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; + + /* + * If we have plenty of inactive file pages that aren't + * thrashing, try to reclaim those first before touching + * anonymous pages. + */ + file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); + if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) + sc->cache_trim_mode = 1; + else + sc->cache_trim_mode = 0; + + /* + * Prevent the reclaimer from falling into the cache trap: as + * cache pages start out inactive, every cache fault will tip + * the scan balance towards the file LRU. And as the file LRU + * shrinks, so does the window for rotation from references. + * This means we have a runaway feedback loop where a tiny + * thrashing file LRU becomes infinitely more attractive than + * anon pages. Try to detect this based on file LRU size. + */ + if (!cgroup_reclaim(sc)) { + unsigned long total_high_wmark = 0; + unsigned long free, anon; + int z; + + free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); + file = node_page_state(pgdat, NR_ACTIVE_FILE) + + node_page_state(pgdat, NR_INACTIVE_FILE); + + for (z = 0; z < MAX_NR_ZONES; z++) { + struct zone *zone = &pgdat->node_zones[z]; + + if (!managed_zone(zone)) + continue; + + total_high_wmark += high_wmark_pages(zone); + } + + /* + * Consider anon: if that's low too, this isn't a + * runaway file reclaim problem, but rather just + * extreme pressure. Reclaim as per usual then. + */ + anon = node_page_state(pgdat, NR_INACTIVE_ANON); + + sc->file_is_tiny = + file + free <= total_high_wmark && + !(sc->may_deactivate & DEACTIVATE_ANON) && + anon >> sc->priority; + } +} + /* * Determine how aggressively the anon and file LRU lists should be * scanned. The relative value of each set of LRU lists is determined @@ -2882,7 +2979,6 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long nr_reclaimed, nr_scanned; struct lruvec *target_lruvec; bool reclaimable = false; - unsigned long file; target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); @@ -2892,93 +2988,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) nr_reclaimed = sc->nr_reclaimed; nr_scanned = sc->nr_scanned; - /* - * Determine the scan balance between anon and file LRUs. - */ - spin_lock_irq(&target_lruvec->lru_lock); - sc->anon_cost = target_lruvec->anon_cost; - sc->file_cost = target_lruvec->file_cost; - spin_unlock_irq(&target_lruvec->lru_lock); - - /* - * Target desirable inactive:active list ratios for the anon - * and file LRU lists. - */ - if (!sc->force_deactivate) { - unsigned long refaults; - - refaults = lruvec_page_state(target_lruvec, - WORKINGSET_ACTIVATE_ANON); - if (refaults != target_lruvec->refaults[0] || - inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) - sc->may_deactivate |= DEACTIVATE_ANON; - else - sc->may_deactivate &= ~DEACTIVATE_ANON; - - /* - * When refaults are being observed, it means a new - * workingset is being established. Deactivate to get - * rid of any stale active pages quickly. - */ - refaults = lruvec_page_state(target_lruvec, - WORKINGSET_ACTIVATE_FILE); - if (refaults != target_lruvec->refaults[1] || - inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) - sc->may_deactivate |= DEACTIVATE_FILE; - else - sc->may_deactivate &= ~DEACTIVATE_FILE; - } else - sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; - - /* - * If we have plenty of inactive file pages that aren't - * thrashing, try to reclaim those first before touching - * anonymous pages. - */ - file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); - if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) - sc->cache_trim_mode = 1; - else - sc->cache_trim_mode = 0; - - /* - * Prevent the reclaimer from falling into the cache trap: as - * cache pages start out inactive, every cache fault will tip - * the scan balance towards the file LRU. And as the file LRU - * shrinks, so does the window for rotation from references. - * This means we have a runaway feedback loop where a tiny - * thrashing file LRU becomes infinitely more attractive than - * anon pages. Try to detect this based on file LRU size. - */ - if (!cgroup_reclaim(sc)) { - unsigned long total_high_wmark = 0; - unsigned long free, anon; - int z; - - free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); - file = node_page_state(pgdat, NR_ACTIVE_FILE) + - node_page_state(pgdat, NR_INACTIVE_FILE); - - for (z = 0; z < MAX_NR_ZONES; z++) { - struct zone *zone = &pgdat->node_zones[z]; - if (!managed_zone(zone)) - continue; - - total_high_wmark += high_wmark_pages(zone); - } - - /* - * Consider anon: if that's low too, this isn't a - * runaway file reclaim problem, but rather just - * extreme pressure. Reclaim as per usual then. - */ - anon = node_page_state(pgdat, NR_INACTIVE_ANON); - - sc->file_is_tiny = - file + free <= total_high_wmark && - !(sc->may_deactivate & DEACTIVATE_ANON) && - anon >> sc->priority; - } + prepare_scan_count(pgdat, sc); shrink_node_memcgs(pgdat, sc); From patchwork Wed Aug 18 06:31:00 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443385 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 150A0C4320E for ; Wed, 18 Aug 2021 06:31:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 96E6A60720 for ; Wed, 18 Aug 2021 06:31:20 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 96E6A60720 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 5695E6B007B; Wed, 18 Aug 2021 02:31:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4C95A6B007D; Wed, 18 Aug 2021 02:31:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2CDA18D0001; Wed, 18 Aug 2021 02:31:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0109.hostedemail.com [216.40.44.109]) by kanga.kvack.org (Postfix) with ESMTP id 0239A6B007B for ; Wed, 18 Aug 2021 02:31:17 -0400 (EDT) Received: from smtpin37.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 9DBBA8249980 for ; Wed, 18 Aug 2021 06:31:17 +0000 (UTC) X-FDA: 78487229394.37.243B0BF Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf29.hostedemail.com (Postfix) with ESMTP id 43B37900802E for ; Wed, 18 Aug 2021 06:31:17 +0000 (UTC) Received: by mail-yb1-f202.google.com with SMTP id d69-20020a25e6480000b02904f4a117bd74so1786127ybh.17 for ; Tue, 17 Aug 2021 23:31:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=KLByEW8oqSJ6WHKGE1qm3K6ThZaMDwxGAGCFyocHr5I=; b=oSLNHRddKSi52S3NNpMxk7i0dMFCIRw/Ughyj9grAnePWCkCwCppAIjrxJ0L73mhlS +wrylPcHwcxJpGFA2mG8qU5IgizoMse5rdRTnytJrgOy4uNilAu6ADLCuhKAvZz1rGjg /fjzb0aI3dO+8gU3ssEOQHu1PTa6tNwzS3xhC9mbcQmrp8dDhoeq1VJFnoP4mxs17KTU DRRxvKj6ZVAXrBe/bxyA4S3SMK70FfzB+Vw1R4JvLOXSy0oInNR024Qri2xeQvWQSQRB QGFSOdNKyZnkhoJMzaS9/GqSLTRYg59ta/5CxuCfSkGoCUTqxE/Di/i4DFc+gOzYOcyZ 57RQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=KLByEW8oqSJ6WHKGE1qm3K6ThZaMDwxGAGCFyocHr5I=; b=JmOYqasbgRhMEog6OhY2Dz4pR+SIwaD2gVpW5IPjhL3yDp/Xa2ceSMjvQ3YfySP1CV 9AQWe6GdGfY7VDws0RNcP+GQy2XPTiaPH9B892Zhe+iGq5a/Hmi3GRxzDIpZTChc6ZVO /4FJlKyRiThwOq+44wH7sDD6x4HKl/WSSd3d/vHGpn4PjGrOPTFOwVeEd+2hR3AgM0VS fadhxfwCDf69884NHArbvtBQA3wFquVFFFLDrgRiJHeE2tlN1tEDB5fnb3VKLOQ/Q8mg fRons0RrQXenYmHptuplEyxFc7my9N/jiAxyNr5ZFkGcElMLRQA0Z+rG3Dm8k+EWhulf IHUw== X-Gm-Message-State: AOAM532doWUqf8fFpzk4OO5p6LCNKvsqoGzLw08lgrfdcjV+YUx/enTA 7V59Vo99eeN6rgDcVZEwKYo1TbLQU99OxTB8Mc+VdocDFshhq3djSgk0DXoM+0r77AqZn6/sN/V uteMn63fPsj6LoXB+jrjJsriywf3LOS+mG97Vt6itjZmUPE3iVxjp5yi9 X-Google-Smtp-Source: ABdhPJxWBteTbvBwVadJUVDS2TXnjjvk48ocUKWsqtHIFkmH4QUSDPet5WNamyiqbp6gJL2dSdW6Onc014o= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a25:4042:: with SMTP id n63mr8942131yba.254.1629268276531; Tue, 17 Aug 2021 23:31:16 -0700 (PDT) Date: Wed, 18 Aug 2021 00:31:00 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-5-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 04/11] mm: multigenerational lru: groundwork From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=oSLNHRdd; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf29.hostedemail.com: domain of 3NKkcYQYKCAk738qjxpxxpun.lxvurw36-vvt4jlt.x0p@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3NKkcYQYKCAk738qjxpxxpun.lxvurw36-vvt4jlt.x0p@flex--yuzhao.bounces.google.com X-Stat-Signature: qxf8ht5ebkkgrypatqidrgos3oqwefkq X-Rspamd-Queue-Id: 43B37900802E X-Rspamd-Server: rspam05 X-HE-Tag: 1629268277-97904 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: For each lruvec, evictable pages are divided into multiple generations. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[2] separately for anon and file types as clean file pages can be evicted regardless of swap and writeback constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into page->flags. The sliding window technique is used to prevent truncated generation numbers from overlapping. Each truncated generation number is an index to lrugen->lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]. Each generation is then divided into multiple tiers. Tiers represent levels of usage from file descriptors only. Pages accessed N times via file descriptors belong to tier order_base_2(N). Each generation contains at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2 bits in page->flags. In contrast to moving across generations which requires list operations, moving across tiers only involves operations on page->flags and therefore has a negligible cost. A feedback loop modeled after the PID controller monitors refault rates of all tiers and decides when to protect pages from which tiers. The framework comprises two conceptually independent components: the aging and the eviction, which can be invoked separately from user space for the purpose of working set estimation and proactive reclaim. The aging produces young generations. Given an lruvec, the aging traverses lruvec_memcg()->mm_list and calls walk_page_range() to scan PTEs for accessed pages (a mm_struct list is maintained for each memcg). Upon finding one, the aging updates its generation number to max_seq (modulo MAX_NR_GENS). After each round of traversal, the aging increments max_seq. The aging is due when both min_seq[2] have caught up with max_seq-1. The eviction consumes old generations. Given an lruvec, the eviction scans pages on lrugen->lists indexed by anon and file min_seq[2] (modulo MAX_NR_GENS). It first tries to select a type based on the values of min_seq[2]. If they are equal, it selects the type that has a lower refault rate. The eviction sorts a page according to its updated generation number if the aging has found this page accessed. It also moves a page to the next generation if this page is from an upper tier that has a higher refault rate than the base tier. The eviction increments min_seq[2] of a selected type when it finds lrugen->lists indexed by min_seq[2] of this selected type are empty. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- fs/fuse/dev.c | 3 +- include/linux/cgroup.h | 15 +- include/linux/mm.h | 2 + include/linux/mm_inline.h | 201 ++++++++++++++++++ include/linux/mmzone.h | 92 +++++++++ include/linux/page-flags-layout.h | 19 +- include/linux/page-flags.h | 4 +- kernel/bounds.c | 3 + kernel/cgroup/cgroup-internal.h | 1 - mm/huge_memory.c | 3 +- mm/mm_init.c | 6 +- mm/mmzone.c | 2 + mm/swapfile.c | 2 + mm/vmscan.c | 329 ++++++++++++++++++++++++++++++ 14 files changed, 669 insertions(+), 13 deletions(-) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 1c8f79b3dd06..673d987652ee 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -785,7 +785,8 @@ static int fuse_check_page(struct page *page) 1 << PG_active | 1 << PG_workingset | 1 << PG_reclaim | - 1 << PG_waiters))) { + 1 << PG_waiters | + LRU_GEN_MASK | LRU_USAGE_MASK))) { dump_page(page, "fuse: trying to steal weird page"); return 1; } diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 7bf60454a313..1ebc27c8fee7 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp) css_put(&cgrp->self); } +extern struct mutex cgroup_mutex; + +static inline void cgroup_lock(void) +{ + mutex_lock(&cgroup_mutex); +} + +static inline void cgroup_unlock(void) +{ + mutex_unlock(&cgroup_mutex); +} + /** * task_css_set_check - obtain a task's css_set with extra access conditions * @task: the task to obtain css_set for @@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp) * as locks used during the cgroup_subsys::attach() methods. */ #ifdef CONFIG_PROVE_RCU -extern struct mutex cgroup_mutex; extern spinlock_t css_set_lock; #define task_css_set_check(task, __c) \ rcu_dereference_check((task)->cgroups, \ @@ -707,6 +718,8 @@ struct cgroup; static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; } static inline void css_get(struct cgroup_subsys_state *css) {} static inline void css_put(struct cgroup_subsys_state *css) {} +static inline void cgroup_lock(void) {} +static inline void cgroup_unlock(void) {} static inline int cgroup_attach_task_all(struct task_struct *from, struct task_struct *t) { return 0; } static inline int cgroupstats_build(struct cgroupstats *stats, diff --git a/include/linux/mm.h b/include/linux/mm.h index 7ca22e6e694a..159b7c94e067 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1092,6 +1092,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH) #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH) #define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH) +#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH) +#define LRU_USAGE_PGOFF (LRU_GEN_PGOFF - LRU_USAGE_WIDTH) /* * Define the bit shifts to access each section. For non-existent diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 355ea1ee32bd..19e722ec7cf3 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -79,11 +79,206 @@ static __always_inline enum lru_list page_lru(struct page *page) return lru; } +#ifdef CONFIG_LRU_GEN + +#ifdef CONFIG_LRU_GEN_ENABLED +DECLARE_STATIC_KEY_TRUE(lru_gen_static_key); + +static inline bool lru_gen_enabled(void) +{ + return static_branch_likely(&lru_gen_static_key); +} +#else +DECLARE_STATIC_KEY_FALSE(lru_gen_static_key); + +static inline bool lru_gen_enabled(void) +{ + return static_branch_unlikely(&lru_gen_static_key); +} +#endif + +/* Return an index within the sliding window that tracks MAX_NR_GENS generations. */ +static inline int lru_gen_from_seq(unsigned long seq) +{ + return seq % MAX_NR_GENS; +} + +/* Return a proper index regardless whether we keep a full history of stats. */ +static inline int lru_hist_from_seq(int seq) +{ + return seq % NR_STAT_GENS; +} + +/* Convert the level of usage to a tier. See the comment on MAX_NR_TIERS. */ +static inline int lru_tier_from_usage(int usage) +{ + VM_BUG_ON(usage > BIT(LRU_USAGE_WIDTH)); + + return order_base_2(usage + 1); +} + +/* The youngest and the second youngest generations are counted as active. */ +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen) +{ + unsigned long max_seq = READ_ONCE(lruvec->evictable.max_seq); + + VM_BUG_ON(!max_seq); + VM_BUG_ON(gen >= MAX_NR_GENS); + + return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1); +} + +/* Update the sizes of the multigenerational lru lists. */ +static inline void lru_gen_update_size(struct page *page, struct lruvec *lruvec, + int old_gen, int new_gen) +{ + int type = page_is_file_lru(page); + int zone = page_zonenum(page); + int delta = thp_nr_pages(page); + enum lru_list lru = type * LRU_FILE; + struct lrugen *lrugen = &lruvec->evictable; + + lockdep_assert_held(&lruvec->lru_lock); + VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS); + VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS); + VM_BUG_ON(old_gen == -1 && new_gen == -1); + + if (old_gen >= 0) + WRITE_ONCE(lrugen->sizes[old_gen][type][zone], + lrugen->sizes[old_gen][type][zone] - delta); + if (new_gen >= 0) + WRITE_ONCE(lrugen->sizes[new_gen][type][zone], + lrugen->sizes[new_gen][type][zone] + delta); + + if (old_gen < 0) { + if (lru_gen_is_active(lruvec, new_gen)) + lru += LRU_ACTIVE; + update_lru_size(lruvec, lru, zone, delta); + return; + } + + if (new_gen < 0) { + if (lru_gen_is_active(lruvec, old_gen)) + lru += LRU_ACTIVE; + update_lru_size(lruvec, lru, zone, -delta); + return; + } + + if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) { + update_lru_size(lruvec, lru, zone, -delta); + update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta); + } + + VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen)); +} + +/* Add a page to one of the multigenerational lru lists. Return true on success. */ +static inline bool lru_gen_add_page(struct page *page, struct lruvec *lruvec, bool reclaiming) +{ + int gen; + unsigned long old_flags, new_flags; + int type = page_is_file_lru(page); + int zone = page_zonenum(page); + struct lrugen *lrugen = &lruvec->evictable; + + if (PageUnevictable(page) || !lrugen->enabled[type]) + return false; + /* + * If a page shouldn't be considered for eviction, i.e., a page mapped + * upon fault during which the accessed bit is set, add it to the + * youngest generation. + * + * If a page can't be evicted immediately, i.e., an anon page not in + * swap cache or a dirty page pending writeback, add it to the second + * oldest generation. + * + * If a page could be evicted immediately, e.g., a clean page, add it to + * the oldest generation. + */ + if (PageActive(page)) + gen = lru_gen_from_seq(lrugen->max_seq); + else if ((!type && !PageSwapCache(page)) || + (PageReclaim(page) && (PageDirty(page) || PageWriteback(page)))) + gen = lru_gen_from_seq(lrugen->min_seq[type] + 1); + else + gen = lru_gen_from_seq(lrugen->min_seq[type]); + + do { + new_flags = old_flags = READ_ONCE(page->flags); + VM_BUG_ON_PAGE(new_flags & LRU_GEN_MASK, page); + + new_flags &= ~(LRU_GEN_MASK | BIT(PG_active)); + new_flags |= (gen + 1UL) << LRU_GEN_PGOFF; + } while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags); + + lru_gen_update_size(page, lruvec, -1, gen); + if (reclaiming) + list_add_tail(&page->lru, &lrugen->lists[gen][type][zone]); + else + list_add(&page->lru, &lrugen->lists[gen][type][zone]); + + return true; +} + +/* Delete a page from one of the multigenerational lru lists. Return true on success. */ +static inline bool lru_gen_del_page(struct page *page, struct lruvec *lruvec, bool reclaiming) +{ + int gen; + unsigned long old_flags, new_flags; + + do { + new_flags = old_flags = READ_ONCE(page->flags); + if (!(new_flags & LRU_GEN_MASK)) + return false; + + VM_BUG_ON_PAGE(PageActive(page), page); + VM_BUG_ON_PAGE(PageUnevictable(page), page); + + gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; + + new_flags &= ~LRU_GEN_MASK; + if ((new_flags & LRU_TIER_FLAGS) != LRU_TIER_FLAGS) + new_flags &= ~(LRU_USAGE_MASK | LRU_TIER_FLAGS); + /* see the comment on PageReferenced()/PageReclaim() in shrink_page_list() */ + if (reclaiming) + new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim)); + else if (lru_gen_is_active(lruvec, gen)) + new_flags |= BIT(PG_active); + } while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags); + + lru_gen_update_size(page, lruvec, gen, -1); + list_del(&page->lru); + + return true; +} + +#else /* CONFIG_LRU_GEN */ + +static inline bool lru_gen_enabled(void) +{ + return false; +} + +static inline bool lru_gen_add_page(struct page *page, struct lruvec *lruvec, bool reclaiming) +{ + return false; +} + +static inline bool lru_gen_del_page(struct page *page, struct lruvec *lruvec, bool reclaiming) +{ + return false; +} + +#endif /* CONFIG_LRU_GEN */ + static __always_inline void add_page_to_lru_list(struct page *page, struct lruvec *lruvec) { enum lru_list lru = page_lru(page); + if (lru_gen_add_page(page, lruvec, false)) + return; + update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page)); list_add(&page->lru, &lruvec->lists[lru]); } @@ -93,6 +288,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page, { enum lru_list lru = page_lru(page); + if (lru_gen_add_page(page, lruvec, true)) + return; + update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page)); list_add_tail(&page->lru, &lruvec->lists[lru]); } @@ -100,6 +298,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page, static __always_inline void del_page_from_lru_list(struct page *page, struct lruvec *lruvec) { + if (lru_gen_del_page(page, lruvec, false)) + return; + list_del(&page->lru); update_lru_size(lruvec, page_lru(page), page_zonenum(page), -thp_nr_pages(page)); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fcb535560028..d6c2c3a4ba43 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -294,6 +294,94 @@ enum lruvec_flags { */ }; +struct lruvec; + +#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF) +#define LRU_USAGE_MASK ((BIT(LRU_USAGE_WIDTH) - 1) << LRU_USAGE_PGOFF) + +#ifdef CONFIG_LRU_GEN + +/* + * For each lruvec, evictable pages are divided into multiple generations. The + * youngest and the oldest generation numbers, AKA max_seq and min_seq, are + * monotonically increasing. The sliding window technique is used to track at + * most MAX_NR_GENS and at least MIN_NR_GENS generations. An offset within the + * window, AKA gen, indexes an array of per-type and per-zone lists for the + * corresponding generation. The counter in page->flags stores gen+1 while a + * page is on one of the multigenerational lru lists. Otherwise, it stores 0. + */ +#define MAX_NR_GENS ((unsigned int)CONFIG_NR_LRU_GENS) + +/* + * Each generation is then divided into multiple tiers. Tiers represent levels + * of usage from file descriptors, i.e., mark_page_accessed(). In contrast to + * moving across generations which requires the lru lock, moving across tiers + * only involves an atomic operation on page->flags and therefore has a + * negligible cost. + * + * The purposes of tiers are to: + * 1) estimate whether pages accessed multiple times via file descriptors are + * more active than pages accessed only via page tables by separating the two + * access types into upper tiers and the base tier and comparing refault rates + * across tiers. + * 2) improve buffered io performance by deferring the protection of pages + * accessed multiple times until the eviction. That is the protection happens + * in the reclaim path, not the access path. + * + * Pages accessed N times via file descriptors belong to tier order_base_2(N). + * The base tier may be marked by PageReferenced(). All upper tiers are marked + * by PageReferenced() && PageWorkingset(). Additional bits from page->flags are + * used to support more than one upper tier. + */ +#define MAX_NR_TIERS ((unsigned int)CONFIG_TIERS_PER_GEN) +#define LRU_TIER_FLAGS (BIT(PG_referenced) | BIT(PG_workingset)) + +/* Whether to keep historical stats for each generation. */ +#ifdef CONFIG_LRU_GEN_STATS +#define NR_STAT_GENS ((unsigned int)CONFIG_NR_LRU_GENS) +#else +#define NR_STAT_GENS 1U +#endif + +struct lrugen { + /* the aging increments the max generation number */ + unsigned long max_seq; + /* the eviction increments the min generation numbers */ + unsigned long min_seq[ANON_AND_FILE]; + /* the birth time of each generation in jiffies */ + unsigned long timestamps[MAX_NR_GENS]; + /* the multigenerational lru lists */ + struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; + /* the sizes of the multigenerational lru lists in pages */ + unsigned long sizes[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; + /* to determine which type and its tiers to evict */ + atomic_long_t refaulted[NR_STAT_GENS][ANON_AND_FILE][MAX_NR_TIERS]; + atomic_long_t evicted[NR_STAT_GENS][ANON_AND_FILE][MAX_NR_TIERS]; + /* the base tier isn't protected, hence the minus one */ + unsigned long protected[NR_STAT_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1]; + /* the exponential moving average of refaulted */ + unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS]; + /* the exponential moving average of evicted+protected */ + unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS]; + /* whether the multigenerational lru is enabled */ + bool enabled[ANON_AND_FILE]; +}; + +void lru_gen_init_lrugen(struct lruvec *lruvec); +void lru_gen_set_state(bool enable, bool main, bool swap); + +#else /* CONFIG_LRU_GEN */ + +static inline void lru_gen_init_lrugen(struct lruvec *lruvec) +{ +} + +static inline void lru_gen_set_state(bool enable, bool main, bool swap) +{ +} + +#endif /* CONFIG_LRU_GEN */ + struct lruvec { struct list_head lists[NR_LRU_LISTS]; /* per lruvec lru_lock for memcg */ @@ -311,6 +399,10 @@ struct lruvec { unsigned long refaults[ANON_AND_FILE]; /* Various lruvec state flags (enum lruvec_flags) */ unsigned long flags; +#ifdef CONFIG_LRU_GEN + /* unevictable pages are on LRU_UNEVICTABLE */ + struct lrugen evictable; +#endif #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index ef1e3e736e14..ce8d5732a3aa 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -26,6 +26,14 @@ #define ZONES_WIDTH ZONES_SHIFT +#ifdef CONFIG_LRU_GEN +/* LRU_GEN_WIDTH is generated from order_base_2(CONFIG_NR_LRU_GENS + 1). */ +#define LRU_USAGE_WIDTH (CONFIG_TIERS_PER_GEN - 2) +#else +#define LRU_GEN_WIDTH 0 +#define LRU_USAGE_WIDTH 0 +#endif + #ifdef CONFIG_SPARSEMEM #include #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) @@ -55,7 +63,8 @@ #define SECTIONS_WIDTH 0 #endif -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_USAGE_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \ + <= BITS_PER_LONG - NR_PAGEFLAGS #define NODES_WIDTH NODES_SHIFT #elif defined(CONFIG_SPARSEMEM_VMEMMAP) #error "Vmemmap: No space for nodes field in page flags" @@ -89,8 +98,8 @@ #define LAST_CPUPID_SHIFT 0 #endif -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \ - <= BITS_PER_LONG - NR_PAGEFLAGS +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_USAGE_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \ + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT #else #define LAST_CPUPID_WIDTH 0 @@ -100,8 +109,8 @@ #define LAST_CPUPID_NOT_IN_PAGE_FLAGS #endif -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \ - > BITS_PER_LONG - NR_PAGEFLAGS +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_USAGE_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \ + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS #error "Not enough bits in page flags" #endif diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 5922031ffab6..0156ac5f08f0 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -848,7 +848,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) 1UL << PG_private | 1UL << PG_private_2 | \ 1UL << PG_writeback | 1UL << PG_reserved | \ 1UL << PG_slab | 1UL << PG_active | \ - 1UL << PG_unevictable | __PG_MLOCKED) + 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK) /* * Flags checked when a page is prepped for return by the page allocator. @@ -859,7 +859,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) * alloc-free cycle to prevent from reusing the page. */ #define PAGE_FLAGS_CHECK_AT_PREP \ - (((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) + ((((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_USAGE_MASK) #define PAGE_FLAGS_PRIVATE \ (1UL << PG_private | 1UL << PG_private_2) diff --git a/kernel/bounds.c b/kernel/bounds.c index 9795d75b09b2..aba13aa7336c 100644 --- a/kernel/bounds.c +++ b/kernel/bounds.c @@ -22,6 +22,9 @@ int main(void) DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS)); #endif DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t)); +#ifdef CONFIG_LRU_GEN + DEFINE(LRU_GEN_WIDTH, order_base_2(CONFIG_NR_LRU_GENS + 1)); +#endif /* End of constants */ return 0; diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index bfbeabc17a9d..bec59189e206 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -146,7 +146,6 @@ struct cgroup_mgctx { #define DEFINE_CGROUP_MGCTX(name) \ struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name) -extern struct mutex cgroup_mutex; extern spinlock_t css_set_lock; extern struct cgroup_subsys *cgroup_subsys[]; extern struct list_head cgroup_roots; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index afff3ac87067..d5ccbfb50352 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2390,7 +2390,8 @@ static void __split_huge_page_tail(struct page *head, int tail, #ifdef CONFIG_64BIT (1L << PG_arch_2) | #endif - (1L << PG_dirty))); + (1L << PG_dirty) | + LRU_GEN_MASK | LRU_USAGE_MASK)); /* ->mapping in first tail page is compound_mapcount */ VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, diff --git a/mm/mm_init.c b/mm/mm_init.c index 9ddaf0e1b0ab..ef0deadb90a7 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void) shift = 8 * sizeof(unsigned long); width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH; + - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_USAGE_WIDTH; mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths", - "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n", + "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n", SECTIONS_WIDTH, NODES_WIDTH, ZONES_WIDTH, LAST_CPUPID_WIDTH, KASAN_TAG_WIDTH, + LRU_GEN_WIDTH, + LRU_USAGE_WIDTH, NR_PAGEFLAGS); mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts", "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n", diff --git a/mm/mmzone.c b/mm/mmzone.c index eb89d6e018e2..2055d66a7f22 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -81,6 +81,8 @@ void lruvec_init(struct lruvec *lruvec) for_each_lru(lru) INIT_LIST_HEAD(&lruvec->lists[lru]); + + lru_gen_init_lrugen(lruvec); } #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) diff --git a/mm/swapfile.c b/mm/swapfile.c index 1e07d1c776f2..19dacc4ae35e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2688,6 +2688,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) err = 0; atomic_inc(&proc_poll_event); wake_up_interruptible(&proc_poll_wait); + lru_gen_set_state(false, false, true); out_dput: filp_close(victim, NULL); @@ -3343,6 +3344,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) mutex_unlock(&swapon_mutex); atomic_inc(&proc_poll_event); wake_up_interruptible(&proc_poll_wait); + lru_gen_set_state(true, false, true); error = 0; goto out; diff --git a/mm/vmscan.c b/mm/vmscan.c index b6d14880bd76..a02b5ff37e31 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -49,6 +49,7 @@ #include #include #include +#include #include #include @@ -2731,6 +2732,334 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, } } +#ifdef CONFIG_LRU_GEN + +/* + * After a page is faulted in, the aging must scan it twice before the eviction + * can consider it. The first scan clears the accessed bit set during the + * initial fault. And the second scan makes sure it hasn't been used since the + * first scan. + */ +#define MIN_NR_GENS 2 + +#define MAX_BATCH_SIZE 8192 + +/****************************************************************************** + * shorthand helpers + ******************************************************************************/ + +#define DEFINE_MAX_SEQ(lruvec) \ + unsigned long max_seq = READ_ONCE((lruvec)->evictable.max_seq) + +#define DEFINE_MIN_SEQ(lruvec) \ + unsigned long min_seq[ANON_AND_FILE] = { \ + READ_ONCE((lruvec)->evictable.min_seq[0]), \ + READ_ONCE((lruvec)->evictable.min_seq[1]), \ + } + +#define for_each_type_zone(type, zone) \ + for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \ + for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++) + +#define for_each_gen_type_zone(gen, type, zone) \ + for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \ + for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \ + for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++) + +static int page_lru_gen(struct page *page) +{ + unsigned long flags = READ_ONCE(page->flags); + + return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; +} + +static int page_lru_tier(struct page *page) +{ + int usage; + unsigned long flags = READ_ONCE(page->flags); + + usage = (flags & LRU_TIER_FLAGS) == LRU_TIER_FLAGS ? + ((flags & LRU_USAGE_MASK) >> LRU_USAGE_PGOFF) + 1 : 0; + + return lru_tier_from_usage(usage); +} + +static int get_lo_wmark(unsigned long max_seq, unsigned long *min_seq, int swappiness) +{ + return max_seq - max(min_seq[!swappiness], min_seq[1]) + 1; +} + +static int get_hi_wmark(unsigned long max_seq, unsigned long *min_seq, int swappiness) +{ + return max_seq - min(min_seq[!swappiness], min_seq[1]) + 1; +} + +static int get_nr_gens(struct lruvec *lruvec, int type) +{ + return lruvec->evictable.max_seq - lruvec->evictable.min_seq[type] + 1; +} + +static int get_swappiness(struct mem_cgroup *memcg) +{ + return mem_cgroup_get_nr_swap_pages(memcg) >= (long)SWAP_CLUSTER_MAX ? + mem_cgroup_swappiness(memcg) : 0; +} + +static bool __maybe_unused seq_is_valid(struct lruvec *lruvec) +{ + return get_nr_gens(lruvec, 0) >= MIN_NR_GENS && + get_nr_gens(lruvec, 0) <= MAX_NR_GENS && + get_nr_gens(lruvec, 1) >= MIN_NR_GENS && + get_nr_gens(lruvec, 1) <= MAX_NR_GENS; +} + +/****************************************************************************** + * state change + ******************************************************************************/ + +#ifdef CONFIG_LRU_GEN_ENABLED +DEFINE_STATIC_KEY_TRUE(lru_gen_static_key); +#else +DEFINE_STATIC_KEY_FALSE(lru_gen_static_key); +#endif + +static DEFINE_MUTEX(lru_gen_state_mutex); +static int lru_gen_nr_swapfiles; + +static bool __maybe_unused state_is_valid(struct lruvec *lruvec) +{ + int gen, type, zone; + enum lru_list lru; + struct lrugen *lrugen = &lruvec->evictable; + + for_each_evictable_lru(lru) { + type = is_file_lru(lru); + + if (lrugen->enabled[type] && !list_empty(&lruvec->lists[lru])) + return false; + } + + for_each_gen_type_zone(gen, type, zone) { + if (!lrugen->enabled[type] && !list_empty(&lrugen->lists[gen][type][zone])) + return false; + + VM_WARN_ON_ONCE(!lrugen->enabled[type] && lrugen->sizes[gen][type][zone]); + } + + return true; +} + +static bool fill_lists(struct lruvec *lruvec) +{ + enum lru_list lru; + int remaining = MAX_BATCH_SIZE; + + for_each_evictable_lru(lru) { + int type = is_file_lru(lru); + bool active = is_active_lru(lru); + struct list_head *head = &lruvec->lists[lru]; + + if (!lruvec->evictable.enabled[type]) + continue; + + while (!list_empty(head)) { + bool success; + struct page *page = lru_to_page(head); + + VM_BUG_ON_PAGE(PageTail(page), page); + VM_BUG_ON_PAGE(PageUnevictable(page), page); + VM_BUG_ON_PAGE(PageActive(page) != active, page); + VM_BUG_ON_PAGE(page_is_file_lru(page) != type, page); + VM_BUG_ON_PAGE(page_lru_gen(page) >= 0, page); + + prefetchw_prev_lru_page(page, head, flags); + + del_page_from_lru_list(page, lruvec); + success = lru_gen_add_page(page, lruvec, false); + VM_BUG_ON(!success); + + if (!--remaining) + return false; + } + } + + return true; +} + +static bool drain_lists(struct lruvec *lruvec) +{ + int gen, type, zone; + int remaining = MAX_BATCH_SIZE; + + for_each_gen_type_zone(gen, type, zone) { + struct list_head *head = &lruvec->evictable.lists[gen][type][zone]; + + if (lruvec->evictable.enabled[type]) + continue; + + while (!list_empty(head)) { + bool success; + struct page *page = lru_to_page(head); + + VM_BUG_ON_PAGE(PageTail(page), page); + VM_BUG_ON_PAGE(PageUnevictable(page), page); + VM_BUG_ON_PAGE(PageActive(page), page); + VM_BUG_ON_PAGE(page_is_file_lru(page) != type, page); + VM_BUG_ON_PAGE(page_zonenum(page) != zone, page); + + prefetchw_prev_lru_page(page, head, flags); + + success = lru_gen_del_page(page, lruvec, false); + VM_BUG_ON(!success); + add_page_to_lru_list(page, lruvec); + + if (!--remaining) + return false; + } + } + + return true; +} + +/* + * For file page tracking, we enable/disable it according to the main switch. + * For anon page tracking, we only enabled it when the main switch is on and + * there is at least one swapfile; we disable it when there are no swapfiles + * regardless of the value of the main switch. Otherwise, we will eventually + * reach the max size of the sliding window and have to call inc_min_seq(). + */ +void lru_gen_set_state(bool enable, bool main, bool swap) +{ + struct mem_cgroup *memcg; + + mem_hotplug_begin(); + mutex_lock(&lru_gen_state_mutex); + cgroup_lock(); + + if (swap) { + if (enable) + swap = !lru_gen_nr_swapfiles++; + else + swap = !--lru_gen_nr_swapfiles; + } + + if (main && enable != lru_gen_enabled()) { + if (enable) + static_branch_enable(&lru_gen_static_key); + else + static_branch_disable(&lru_gen_static_key); + } else if (!swap || !lru_gen_enabled()) + goto unlock; + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + int nid; + + for_each_node_state(nid, N_MEMORY) { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); + struct lrugen *lrugen = &lruvec->evictable; + + spin_lock_irq(&lruvec->lru_lock); + + VM_BUG_ON(!seq_is_valid(lruvec)); + VM_BUG_ON(!state_is_valid(lruvec)); + + lrugen->enabled[0] = lru_gen_enabled() && lru_gen_nr_swapfiles; + lrugen->enabled[1] = lru_gen_enabled(); + + while (!(enable ? fill_lists(lruvec) : drain_lists(lruvec))) { + spin_unlock_irq(&lruvec->lru_lock); + cond_resched(); + spin_lock_irq(&lruvec->lru_lock); + } + + spin_unlock_irq(&lruvec->lru_lock); + } + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); +unlock: + cgroup_unlock(); + mutex_unlock(&lru_gen_state_mutex); + mem_hotplug_done(); +} + +static int __meminit __maybe_unused mem_notifier(struct notifier_block *self, + unsigned long action, void *arg) +{ + struct mem_cgroup *memcg; + struct pglist_data *pgdat; + struct memory_notify *mn = arg; + int nid = mn->status_change_nid; + + if (nid == NUMA_NO_NODE) + return NOTIFY_DONE; + + pgdat = NODE_DATA(nid); + + if (action != MEM_GOING_ONLINE) + return NOTIFY_DONE; + + mutex_lock(&lru_gen_state_mutex); + cgroup_lock(); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + struct lrugen *lrugen = &lruvec->evictable; + + VM_BUG_ON(!seq_is_valid(lruvec)); + VM_BUG_ON(!state_is_valid(lruvec)); + + lrugen->enabled[0] = lru_gen_enabled() && lru_gen_nr_swapfiles; + lrugen->enabled[1] = lru_gen_enabled(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + cgroup_unlock(); + mutex_unlock(&lru_gen_state_mutex); + + return NOTIFY_DONE; +} + +/****************************************************************************** + * initialization + ******************************************************************************/ + +void lru_gen_init_lrugen(struct lruvec *lruvec) +{ + int i; + int gen, type, zone; + struct lrugen *lrugen = &lruvec->evictable; + + lrugen->max_seq = MIN_NR_GENS + 1; + lrugen->enabled[0] = lru_gen_enabled() && lru_gen_nr_swapfiles; + lrugen->enabled[1] = lru_gen_enabled(); + + for (i = 0; i <= MIN_NR_GENS + 1; i++) + lrugen->timestamps[i] = jiffies; + + for_each_gen_type_zone(gen, type, zone) + INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]); +} + +static int __init init_lru_gen(void) +{ + BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS); + BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS); + + if (hotplug_memory_notifier(mem_notifier, 0)) + pr_err("lru_gen: failed to subscribe hotplug notifications\n"); + + return 0; +}; +/* + * We want to run as early as possible because debug code may call mm_alloc() + * and mmput(). Our only dependency mm_kobj is initialized one stage earlier. + */ +arch_initcall(init_lru_gen); + +#endif /* CONFIG_LRU_GEN */ + static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) { unsigned long nr[NR_LRU_LISTS]; From patchwork Wed Aug 18 06:31:01 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443387 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92D07C4338F for ; Wed, 18 Aug 2021 06:31:23 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3BCCD60720 for ; Wed, 18 Aug 2021 06:31:23 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 3BCCD60720 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 8EA5B6B007D; Wed, 18 Aug 2021 02:31:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 873738D0001; Wed, 18 Aug 2021 02:31:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6C5676B0080; Wed, 18 Aug 2021 02:31:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0086.hostedemail.com [216.40.44.86]) by kanga.kvack.org (Postfix) with ESMTP id 4C3D66B007D for ; Wed, 18 Aug 2021 02:31:19 -0400 (EDT) Received: from smtpin38.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 01DD522AC6 for ; Wed, 18 Aug 2021 06:31:19 +0000 (UTC) X-FDA: 78487229478.38.4FCB738 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf11.hostedemail.com (Postfix) with ESMTP id A3FD6F0058B7 for ; Wed, 18 Aug 2021 06:31:18 +0000 (UTC) Received: by mail-yb1-f201.google.com with SMTP id a62-20020a254d410000b0290592f360b0ccso1779558ybb.14 for ; Tue, 17 Aug 2021 23:31:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=jfrjqXsmv61MIW7XVF0D5wfNnO+ULVoUNMa8niOlBGQ=; b=kEmCqE0fGoy+x3LZ6IN9OQroYs/M7PGfxceaaMiw0N9yv+Yhl2PxOGZNo4Z1kuK4RA oiDWUZPK+/InA0lpXD+L+PPIhmNjqLYh1Xk3UbIroWi5bpN8eAcHNrxeFGhRx2aBycXj VvsILW8bwMP5asnncGzE95IcZRx0XNv4BsR2/n8Z3vWqguhsgzHtbujMCLFCHdCV1tJN RWwoUL6R7nIm79J3XUy5ybYI0x4GV2eBXG8ogWZhhaMawVOtogbCVzfl7DYCiYJrpwxx OWEiJrDJ0yD9o4gcd+H8rAfxuBiT4Nmxp3WRk5uxqqAZrqU7xHfnluKQbBQx7CF0stvo Gcnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=jfrjqXsmv61MIW7XVF0D5wfNnO+ULVoUNMa8niOlBGQ=; b=YuM6WAHkV6i5n+DFfdw4JW1PcazQZmmw09fr99EEwT4uo56N38/qxOYB6EivzV3Yca LIjwHiJkI409RTARgEXNkmss984Cvajjpgyb+kUFCGxb3sV5VQGsK+p5lUtVI1CeTyEB /qFHPbpHR0ORA4hEH1W9YLQgJAOVf6Z5hR60gvoKuCT+XxOCmRl9ElgW0KERbAUqqyzr xyLEJvDURuxcMwIUIMZmKr2y2oghZBJBS5HXujFM5kQC2y34XUOVcm5dmrRpQtUtLW84 gVbijejCCMfwsz3h7B43zRnMeP6wKP+5ZNXnN3Vr3xvI8vK9qZjXyragXP5+r7jD8M0n rR4w== X-Gm-Message-State: AOAM531PtWv8ev2ddQvJWS3updl/wCzKANw4xc9A6vNLCfWKQjxR98zh Z56/cGM0XJxLif31xAumwtHuRErxlQytFwwu57wOziMFRIFBGCEJnptI6Zo7y+zKmXMXhbDW4uH 5jYT/bBml4jX2CneGWgZkqMhwI4wcJnUh8V/Z1IoOMF+CzK+G38utAG6m X-Google-Smtp-Source: ABdhPJxVcQalP9BDTyAjtd3xQFG6qKv0PFi+YwMJh1S+798Q+5pcLqa0LKr8Xp9guTw6VHKvQ9x4/v7ys9A= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a25:38d1:: with SMTP id f200mr9696686yba.183.1629268277914; Tue, 17 Aug 2021 23:31:17 -0700 (PDT) Date: Wed, 18 Aug 2021 00:31:01 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-6-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 05/11] mm: multigenerational lru: protection From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=kEmCqE0f; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf11.hostedemail.com: domain of 3NakcYQYKCAo849rkyqyyqvo.mywvsx47-wwu5kmu.y1q@flex--yuzhao.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3NakcYQYKCAo849rkyqyyqvo.mywvsx47-wwu5kmu.y1q@flex--yuzhao.bounces.google.com X-Stat-Signature: imnj7fk4dgu1uknzz41wugrph3mjm1xd X-Rspamd-Queue-Id: A3FD6F0058B7 X-Rspamd-Server: rspam05 X-HE-Tag: 1629268278-394211 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The protection is based on page access types and patterns. There are two access types: one via page tables and the other via file descriptors. The protection of the former type is by design stronger because: 1) The uncertainty in determining the access patterns of the former type is higher due to the coalesced nature of the accessed bit. 2) The cost of evicting the former type is higher due to the TLB flushes required and the likelihood of involving I/O. 3) The penalty of under-protecting the former type is higher because applications usually do not prepare themselves for major faults like they do for blocked I/O. For example, client applications commonly dedicate blocked I/O to separate threads to avoid UI janks that negatively affect user experience. There are also two access patterns: one with temporal locality and the other without. The latter pattern, e.g., random and sequential, needs to be explicitly excluded to avoid weakening the protection of the former pattern. Generally the former type follows the former pattern unless MADV_SEQUENTIAL is specified and the latter type follows the latter pattern unless outlying refaults have been observed. Upon faulting, a page is added to the youngest generation, which provides the strongest protection as the eviction will not consider this page before the aging has scanned it at least twice. The first scan clears the accessed bit set during the initial fault. And the second scan makes sure this page has not been used since the first scan. A page from any other generations is brought back to the youngest generation whenever the aging finds the accessed bit set on any of the PTEs mapping this page. Unmapped pages are initially added to the oldest generation and then conditionally protected by tiers. Pages accessed N times via file descriptors belong to tier order_base_2(N). Each tier keeps track of how many pages from it have refaulted. Tier 0 is the base tier and pages from it are evicted unconditionally because there are no better candidates. Pages from an upper tier are either evicted or moved to the next generation, depending on whether this upper tier has a higher refault rate than the base tier. This model has the following advantages: 1) It removes the cost in the buffered access path and reduces the overall cost of protection because pages are conditionally protected in the reclaim path. 2) It takes mapped pages into account and avoids overprotecting pages accessed multiple times via file descriptors. 3 Additional tiers improve the protection of pages accessed more than twice. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- include/linux/mm.h | 32 ++++++++++++ include/linux/sched.h | 3 ++ mm/memory.c | 7 +++ mm/swap.c | 51 +++++++++++++++++- mm/vmscan.c | 91 +++++++++++++++++++++++++++++++- mm/workingset.c | 119 +++++++++++++++++++++++++++++++++++++++++- 6 files changed, 298 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 159b7c94e067..7a91518792ba 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1778,6 +1778,25 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t nr, bool even_cows); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); + +static inline void task_enter_nonseq_fault(void) +{ + WARN_ON(current->in_nonseq_fault); + + current->in_nonseq_fault = 1; +} + +static inline void task_exit_nonseq_fault(void) +{ + WARN_ON(!current->in_nonseq_fault); + + current->in_nonseq_fault = 0; +} + +static inline bool task_in_nonseq_fault(void) +{ + return current->in_nonseq_fault; +} #else static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, @@ -1799,6 +1818,19 @@ static inline void unmap_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t nr, bool even_cows) { } static inline void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows) { } + +static inline void task_enter_nonseq_fault(void) +{ +} + +static inline void task_exit_nonseq_fault(void) +{ +} + +static inline bool task_in_nonseq_fault(void) +{ + return false; +} #endif static inline void unmap_shared_mapping_range(struct address_space *mapping, diff --git a/include/linux/sched.h b/include/linux/sched.h index ec8d07d88641..fd41c9c86cd1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -843,6 +843,9 @@ struct task_struct { #ifdef CONFIG_MEMCG unsigned in_user_fault:1; #endif +#ifdef CONFIG_MMU + unsigned in_nonseq_fault:1; +#endif #ifdef CONFIG_COMPAT_BRK unsigned brk_randomized:1; #endif diff --git a/mm/memory.c b/mm/memory.c index 2f96179db219..fa40a5b7a7a7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4752,6 +4752,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, struct pt_regs *regs) { vm_fault_t ret; + bool nonseq_fault = !(vma->vm_flags & VM_SEQ_READ); __set_current_state(TASK_RUNNING); @@ -4773,11 +4774,17 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, if (flags & FAULT_FLAG_USER) mem_cgroup_enter_user_fault(); + if (nonseq_fault) + task_enter_nonseq_fault(); + if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags); else ret = __handle_mm_fault(vma, address, flags); + if (nonseq_fault) + task_exit_nonseq_fault(); + if (flags & FAULT_FLAG_USER) { mem_cgroup_exit_user_fault(); /* diff --git a/mm/swap.c b/mm/swap.c index 19600430e536..0d3fb2ee3fd6 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -411,6 +411,43 @@ static void __lru_cache_activate_page(struct page *page) local_unlock(&lru_pvecs.lock); } +#ifdef CONFIG_LRU_GEN +static void page_inc_usage(struct page *page) +{ + unsigned long usage; + unsigned long old_flags, new_flags; + + if (PageUnevictable(page)) + return; + + /* see the comment on MAX_NR_TIERS */ + do { + new_flags = old_flags = READ_ONCE(page->flags); + + if (!(new_flags & BIT(PG_referenced))) { + new_flags |= BIT(PG_referenced); + continue; + } + + if (!(new_flags & BIT(PG_workingset))) { + new_flags |= BIT(PG_workingset); + continue; + } + + usage = new_flags & LRU_USAGE_MASK; + usage = min(usage + BIT(LRU_USAGE_PGOFF), LRU_USAGE_MASK); + + new_flags &= ~LRU_USAGE_MASK; + new_flags |= usage; + } while (new_flags != old_flags && + cmpxchg(&page->flags, old_flags, new_flags) != old_flags); +} +#else +static void page_inc_usage(struct page *page) +{ +} +#endif /* CONFIG_LRU_GEN */ + /* * Mark a page as having seen activity. * @@ -425,6 +462,11 @@ void mark_page_accessed(struct page *page) { page = compound_head(page); + if (lru_gen_enabled()) { + page_inc_usage(page); + return; + } + if (!PageReferenced(page)) { SetPageReferenced(page); } else if (PageUnevictable(page)) { @@ -468,6 +510,11 @@ void lru_cache_add(struct page *page) VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page); VM_BUG_ON_PAGE(PageLRU(page), page); + /* see the comment in lru_gen_add_page() */ + if (lru_gen_enabled() && !PageUnevictable(page) && + task_in_nonseq_fault() && !(current->flags & PF_MEMALLOC)) + SetPageActive(page); + get_page(page); local_lock(&lru_pvecs.lock); pvec = this_cpu_ptr(&lru_pvecs.lru_add); @@ -569,7 +616,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) { - if (PageActive(page) && !PageUnevictable(page)) { + if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) { int nr_pages = thp_nr_pages(page); del_page_from_lru_list(page, lruvec); @@ -684,7 +731,7 @@ void deactivate_file_page(struct page *page) */ void deactivate_page(struct page *page) { - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) { struct pagevec *pvec; local_lock(&lru_pvecs.lock); diff --git a/mm/vmscan.c b/mm/vmscan.c index a02b5ff37e31..788b4d1ce149 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1094,9 +1094,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; - mem_cgroup_swapout(page, swap); + + /* get a shadow entry before page_memcg() is cleared */ if (reclaimed && !mapping_exiting(mapping)) shadow = workingset_eviction(page, target_memcg); + mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page, swap, shadow); xa_unlock_irqrestore(&mapping->i_pages, flags); put_swap_page(page, swap); @@ -2813,6 +2815,93 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec) get_nr_gens(lruvec, 1) <= MAX_NR_GENS; } +/****************************************************************************** + * refault feedback loop + ******************************************************************************/ + +/* + * A feedback loop modeled after the PID controller. Currently supports the + * proportional (P) and the integral (I) terms; the derivative (D) term can be + * added if necessary. The setpoint (SP) is the desired position; the process + * variable (PV) is the measured position. The error is the difference between + * the SP and the PV. A positive error results in a positive control output + * correction, which, in our case, is to allow eviction. + * + * The P term is the refault rate of the current generation being evicted. The I + * term is the exponential moving average of the refault rates of the previous + * generations, using the smoothing factor 1/2. + * + * Our goal is to make sure upper tiers have similar refault rates as the base + * tier. That is we try to be fair to all tiers by maintaining similar refault + * rates across them. + */ +struct controller_pos { + unsigned long refaulted; + unsigned long total; + int gain; +}; + +static void read_controller_pos(struct controller_pos *pos, struct lruvec *lruvec, + int type, int tier, int gain) +{ + struct lrugen *lrugen = &lruvec->evictable; + int hist = lru_hist_from_seq(lrugen->min_seq[type]); + + pos->refaulted = lrugen->avg_refaulted[type][tier] + + atomic_long_read(&lrugen->refaulted[hist][type][tier]); + pos->total = lrugen->avg_total[type][tier] + + atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + pos->total += lrugen->protected[hist][type][tier - 1]; + pos->gain = gain; +} + +static void reset_controller_pos(struct lruvec *lruvec, int gen, int type) +{ + int tier; + int hist = lru_hist_from_seq(gen); + struct lrugen *lrugen = &lruvec->evictable; + bool carryover = gen == lru_gen_from_seq(lrugen->min_seq[type]); + + if (!carryover && NR_STAT_GENS == 1) + return; + + for (tier = 0; tier < MAX_NR_TIERS; tier++) { + if (carryover) { + unsigned long sum; + + sum = lrugen->avg_refaulted[type][tier] + + atomic_long_read(&lrugen->refaulted[hist][type][tier]); + WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2); + + sum = lrugen->avg_total[type][tier] + + atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + sum += lrugen->protected[hist][type][tier - 1]; + WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2); + + if (NR_STAT_GENS > 1) + continue; + } + + atomic_long_set(&lrugen->refaulted[hist][type][tier], 0); + atomic_long_set(&lrugen->evicted[hist][type][tier], 0); + if (tier) + WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0); + } +} + +static bool positive_ctrl_err(struct controller_pos *sp, struct controller_pos *pv) +{ + /* + * Allow eviction if the PV has a limited number of refaulted pages or a + * lower refault rate than the SP. + */ + return pv->refaulted < SWAP_CLUSTER_MAX || + pv->refaulted * max(sp->total, 1UL) * sp->gain <= + sp->refaulted * max(pv->total, 1UL) * pv->gain; +} + /****************************************************************************** * state change ******************************************************************************/ diff --git a/mm/workingset.c b/mm/workingset.c index 5ba3e42446fa..75dbfba773a6 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly; static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, bool workingset) { - eviction >>= bucket_order; eviction &= EVICTION_MASK; eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; eviction = (eviction << NODES_SHIFT) | pgdat->node_id; @@ -212,10 +211,116 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *memcgidp = memcgid; *pgdat = NODE_DATA(nid); - *evictionp = entry << bucket_order; + *evictionp = entry; *workingsetp = workingset; } +#ifdef CONFIG_LRU_GEN + +static int page_get_usage(struct page *page) +{ + unsigned long flags = READ_ONCE(page->flags); + + BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_USAGE_WIDTH > BITS_PER_LONG - EVICTION_SHIFT); + + /* see the comment on MAX_NR_TIERS */ + return flags & BIT(PG_workingset) ? + (flags & LRU_USAGE_MASK) >> LRU_USAGE_PGOFF : 0; +} + +/* Return a token to be stored in the shadow entry of a page being evicted. */ +static void *lru_gen_eviction(struct page *page) +{ + int hist, tier; + unsigned long token; + unsigned long min_seq; + struct lruvec *lruvec; + struct lrugen *lrugen; + int type = page_is_file_lru(page); + int usage = page_get_usage(page); + bool workingset = PageWorkingset(page); + struct mem_cgroup *memcg = page_memcg(page); + struct pglist_data *pgdat = page_pgdat(page); + + lruvec = mem_cgroup_lruvec(memcg, pgdat); + lrugen = &lruvec->evictable; + min_seq = READ_ONCE(lrugen->min_seq[type]); + token = (min_seq << LRU_USAGE_WIDTH) | usage; + + hist = lru_hist_from_seq(min_seq); + tier = lru_tier_from_usage(usage + workingset); + atomic_long_add(thp_nr_pages(page), &lrugen->evicted[hist][type][tier]); + + return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset); +} + +/* Count a refaulted page based on the token stored in its shadow entry. */ +static void lru_gen_refault(struct page *page, void *shadow) +{ + int hist, tier, usage; + int memcg_id; + bool workingset; + unsigned long token; + unsigned long min_seq; + struct lruvec *lruvec; + struct lrugen *lrugen; + struct mem_cgroup *memcg; + struct pglist_data *pgdat; + int type = page_is_file_lru(page); + + unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset); + if (page_pgdat(page) != pgdat) + return; + + rcu_read_lock(); + memcg = page_memcg_rcu(page); + if (mem_cgroup_id(memcg) != memcg_id) + goto unlock; + + usage = token & (BIT(LRU_USAGE_WIDTH) - 1); + if (usage && !workingset) + goto unlock; + + token >>= LRU_USAGE_WIDTH; + lruvec = mem_cgroup_lruvec(memcg, pgdat); + lrugen = &lruvec->evictable; + min_seq = READ_ONCE(lrugen->min_seq[type]); + if (token != (min_seq & (EVICTION_MASK >> LRU_USAGE_WIDTH))) + goto unlock; + + hist = lru_hist_from_seq(min_seq); + tier = lru_tier_from_usage(usage + workingset); + atomic_long_add(thp_nr_pages(page), &lrugen->refaulted[hist][type][tier]); + inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type); + + /* + * Tiers don't offer any protection to pages accessed via page tables. + * That's what generations do. Tiers can't fully protect pages after + * their usage has exceeded the max value. Conservatively count these + * two conditions as stalls even though they might not indicate any real + * memory pressure. + */ + if (task_in_nonseq_fault() || usage + workingset == BIT(LRU_USAGE_WIDTH)) { + SetPageWorkingset(page); + inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type); + } +unlock: + rcu_read_unlock(); +} + +#else /* CONFIG_LRU_GEN */ + +static void *lru_gen_eviction(struct page *page) +{ + return NULL; +} + +static void lru_gen_refault(struct page *page, void *shadow) +{ +} + +#endif /* CONFIG_LRU_GEN */ + /** * workingset_age_nonresident - age non-resident entries as LRU ages * @lruvec: the lruvec that was aged @@ -264,10 +369,14 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) VM_BUG_ON_PAGE(page_count(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page); + if (lru_gen_enabled()) + return lru_gen_eviction(page); + lruvec = mem_cgroup_lruvec(target_memcg, pgdat); /* XXX: target_memcg can be NULL, go through lruvec */ memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); eviction = atomic_long_read(&lruvec->nonresident_age); + eviction >>= bucket_order; workingset_age_nonresident(lruvec, thp_nr_pages(page)); return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); } @@ -296,7 +405,13 @@ void workingset_refault(struct page *page, void *shadow) bool workingset; int memcgid; + if (lru_gen_enabled()) { + lru_gen_refault(page, shadow); + return; + } + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); + eviction <<= bucket_order; rcu_read_lock(); /* From patchwork Wed Aug 18 06:31:02 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443389 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DA1FC432BE for ; Wed, 18 Aug 2021 06:31:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C9E496103A for ; Wed, 18 Aug 2021 06:31:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org C9E496103A Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 454C36B007E; Wed, 18 Aug 2021 02:31:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3DEF56B0080; Wed, 18 Aug 2021 02:31:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 27F2E6B0081; Wed, 18 Aug 2021 02:31:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0226.hostedemail.com [216.40.44.226]) by kanga.kvack.org (Postfix) with ESMTP id 0669D6B007E for ; Wed, 18 Aug 2021 02:31:21 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id A3C7E1841C1EC for ; Wed, 18 Aug 2021 06:31:20 +0000 (UTC) X-FDA: 78487229520.11.54AB6F9 Received: from mail-il1-f202.google.com (mail-il1-f202.google.com [209.85.166.202]) by imf08.hostedemail.com (Postfix) with ESMTP id 5ACDF3004A58 for ; Wed, 18 Aug 2021 06:31:20 +0000 (UTC) Received: by mail-il1-f202.google.com with SMTP id c20-20020a9294140000b02902141528bc7cso698606ili.3 for ; Tue, 17 Aug 2021 23:31:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=aBs1pFrO53Rncja9fuL1uL4T+2CTQbWp/Gm5wFcNO54=; b=LGz2Nl/Mf512SIdop9E9gI9OF+TdKt1NO0NGoDEE8mnBztbmkJs12hsmhp4vglamuo kwiKQ2zz1tryY/kWMhRJ2lAkN2VVlWng+y8I9UKlUdW8DveB4he27WkolMLyEiki7AKE 8kIj9ed9E8Nq/nNhZODdHzd0qRHAyGCLDhIv52y4Aqv92izbx46So60qHY3aHT9gRQPa PBrNjB56MimjDIVp+AmFqcR1COdjH8QX+Tb8NHCz2zzq23z0QAkIsjS6YbANgjDjdMlJ O5MnS1rvZWXtyQmrdePmMfGff2dNs2P74rnHoGqeBiwdv/53DWQDwlEqF1rr+oXEnHIL Er7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=aBs1pFrO53Rncja9fuL1uL4T+2CTQbWp/Gm5wFcNO54=; b=ccQnJrqjc+Px59ahn30tVqxlwLL+/WUyTuUaS/iX3sNScM/Ko3BLid2e4srXqC0ERq JvydTCtrqIHV58lNPdUvxwPaPrprpfsi3qAe0N2HHJGL2BIm+c2QOaV8HvFDwKaMVJM4 UYD+LU2pILKA9dZhGKqVKaDWxKM+7Cvk2m3bQakDddDObqQrZMp+Dvihg7sdXIFkOSNA x9kB0PYfTlZvX8nGhgQx8Evd2yWq717x+KXUS1q/h4ncNBJV46jNPqEqvpPtMpMzFzNo 789TaLUSxksUVX4bU/v5seQqYVBrT82gBNmExxUT9jwX9EW+wzbzuM/qTJ+o72lah0M8 +ARw== X-Gm-Message-State: AOAM530ZdEdZ5pAeyqqqI9Bqgh0O+0ynkQYXkuziCf/B0ajH8122k5sT gls/SCTqGymPpAciOm9cOckNCn6AyseMrpxVCnMqOhbOtimXLd9GCDZRlAhVvxfNd3G5rlx4FkU s1F4KFfqscSOcvcR/Y5hS3eshWPNVSbwWr1xzeYofLjI1n9pm1x0JM4ai X-Google-Smtp-Source: ABdhPJxkf+A7lOf48PabGNLMm+/xDsY/XQj18H+ZNhHGpzdfJ357A44QxsN1scjme8OdHZpi43TRW9e6JvA= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a6b:2bd6:: with SMTP id r205mr6000886ior.122.1629268279406; Tue, 17 Aug 2021 23:31:19 -0700 (PDT) Date: Wed, 18 Aug 2021 00:31:02 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-7-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 06/11] mm: multigenerational lru: mm_struct list From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov X-Rspamd-Queue-Id: 5ACDF3004A58 Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b="LGz2Nl/M"; spf=pass (imf08.hostedemail.com: domain of 3N6kcYQYKCAwA6Btm0s00sxq.o0yxuz69-yyw7mow.03s@flex--yuzhao.bounces.google.com designates 209.85.166.202 as permitted sender) smtp.mailfrom=3N6kcYQYKCAwA6Btm0s00sxq.o0yxuz69-yyw7mow.03s@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam01 X-Stat-Signature: 1z7x61aak98o6q9r3h9m8dhjfzp8w8cq X-HE-Tag: 1629268280-900776 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: To scan PTEs for accessed pages, a mm_struct list is maintained for each memcg. When multiple threads traverse the same memcg->mm_list, each of them gets a unique mm_struct and therefore they can run walk_page_range() concurrently to reach page tables of all processes of this memcg. And to skip page tables of processes that have been sleeping since the last walk, the usage of mm_struct is also tracked between context switches. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- fs/exec.c | 2 + include/linux/memcontrol.h | 6 + include/linux/mm_types.h | 107 +++++++++++++ kernel/exit.c | 1 + kernel/fork.c | 10 ++ kernel/kthread.c | 1 + kernel/sched/core.c | 2 + mm/memcontrol.c | 28 ++++ mm/vmscan.c | 313 +++++++++++++++++++++++++++++++++++++ 9 files changed, 470 insertions(+) diff --git a/fs/exec.c b/fs/exec.c index 38f63451b928..7ead083bcb39 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1005,6 +1005,7 @@ static int exec_mmap(struct mm_struct *mm) active_mm = tsk->active_mm; tsk->active_mm = mm; tsk->mm = mm; + lru_gen_add_mm(mm); /* * This prevents preemption while active_mm is being loaded and * it and mm are being updated, which could cause problems for @@ -1015,6 +1016,7 @@ static int exec_mmap(struct mm_struct *mm) if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) local_irq_enable(); activate_mm(active_mm, mm); + lru_gen_switch_mm(active_mm, mm); if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) local_irq_enable(); tsk->mm->vmacache_seqnum = 0; diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index bfe5c486f4ad..5e223cecb5c2 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -230,6 +230,8 @@ struct obj_cgroup { }; }; +struct lru_gen_mm_list; + /* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide @@ -349,6 +351,10 @@ struct mem_cgroup { struct deferred_split deferred_split_queue; #endif +#ifdef CONFIG_LRU_GEN + struct lru_gen_mm_list *mm_list; +#endif + struct mem_cgroup_per_node *nodeinfo[]; }; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 52bbd2b7cb46..d9a2ba150ce8 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -15,6 +15,8 @@ #include #include #include +#include +#include #include @@ -571,6 +573,22 @@ struct mm_struct { #ifdef CONFIG_IOMMU_SUPPORT u32 pasid; +#endif +#ifdef CONFIG_LRU_GEN + struct { + /* the node of a global or per-memcg mm_struct list */ + struct list_head list; +#ifdef CONFIG_MEMCG + /* points to the memcg of the owner task above */ + struct mem_cgroup *memcg; +#endif + /* whether this mm_struct has been used since the last walk */ + nodemask_t nodes; +#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + /* the number of CPUs using this mm_struct */ + atomic_t nr_cpus; +#endif + } lrugen; #endif } __randomize_layout; @@ -598,6 +616,95 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) return (struct cpumask *)&mm->cpu_bitmap; } +#ifdef CONFIG_LRU_GEN + +void lru_gen_init_mm(struct mm_struct *mm); +void lru_gen_add_mm(struct mm_struct *mm); +void lru_gen_del_mm(struct mm_struct *mm); +#ifdef CONFIG_MEMCG +int lru_gen_alloc_mm_list(struct mem_cgroup *memcg); +void lru_gen_free_mm_list(struct mem_cgroup *memcg); +void lru_gen_migrate_mm(struct mm_struct *mm); +#endif + +/* Track the usage of each mm_struct so that we can skip inactive ones. */ +static inline void lru_gen_switch_mm(struct mm_struct *old, struct mm_struct *new) +{ + /* exclude init_mm, efi_mm, etc. */ + if (!core_kernel_data((unsigned long)old)) { + VM_BUG_ON(old == &init_mm); + + nodes_setall(old->lrugen.nodes); +#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + atomic_dec(&old->lrugen.nr_cpus); + VM_BUG_ON_MM(atomic_read(&old->lrugen.nr_cpus) < 0, old); +#endif + } else + VM_BUG_ON_MM(READ_ONCE(old->lrugen.list.prev) || + READ_ONCE(old->lrugen.list.next), old); + + if (!core_kernel_data((unsigned long)new)) { + VM_BUG_ON(new == &init_mm); + +#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + atomic_inc(&new->lrugen.nr_cpus); + VM_BUG_ON_MM(atomic_read(&new->lrugen.nr_cpus) < 0, new); +#endif + } else + VM_BUG_ON_MM(READ_ONCE(new->lrugen.list.prev) || + READ_ONCE(new->lrugen.list.next), new); +} + +/* Return whether this mm_struct is being used on any CPUs. */ +static inline bool lru_gen_mm_is_active(struct mm_struct *mm) +{ +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + return !cpumask_empty(mm_cpumask(mm)); +#else + return atomic_read(&mm->lrugen.nr_cpus); +#endif +} + +#else /* CONFIG_LRU_GEN */ + +static inline void lru_gen_init_mm(struct mm_struct *mm) +{ +} + +static inline void lru_gen_add_mm(struct mm_struct *mm) +{ +} + +static inline void lru_gen_del_mm(struct mm_struct *mm) +{ +} + +#ifdef CONFIG_MEMCG +static inline int lru_gen_alloc_mm_list(struct mem_cgroup *memcg) +{ + return 0; +} + +static inline void lru_gen_free_mm_list(struct mem_cgroup *memcg) +{ +} + +static inline void lru_gen_migrate_mm(struct mm_struct *mm) +{ +} +#endif + +static inline void lru_gen_switch_mm(struct mm_struct *old, struct mm_struct *new) +{ +} + +static inline bool lru_gen_mm_is_active(struct mm_struct *mm) +{ + return false; +} + +#endif /* CONFIG_LRU_GEN */ + struct mmu_gather; extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm); diff --git a/kernel/exit.c b/kernel/exit.c index 9a89e7f36acb..c24d5ffae792 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -422,6 +422,7 @@ void mm_update_next_owner(struct mm_struct *mm) goto retry; } WRITE_ONCE(mm->owner, c); + lru_gen_migrate_mm(mm); task_unlock(c); put_task_struct(c); } diff --git a/kernel/fork.c b/kernel/fork.c index bc94b2cc5995..e5f5dd5ac584 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -669,6 +669,7 @@ static void check_mm(struct mm_struct *mm) #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS VM_BUG_ON_MM(mm->pmd_huge_pte, mm); #endif + VM_BUG_ON_MM(lru_gen_mm_is_active(mm), mm); } #define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL)) @@ -1066,6 +1067,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, goto fail_nocontext; mm->user_ns = get_user_ns(user_ns); + lru_gen_init_mm(mm); return mm; fail_nocontext: @@ -1108,6 +1110,7 @@ static inline void __mmput(struct mm_struct *mm) } if (mm->binfmt) module_put(mm->binfmt->module); + lru_gen_del_mm(mm); mmdrop(mm); } @@ -2530,6 +2533,13 @@ pid_t kernel_clone(struct kernel_clone_args *args) get_task_struct(p); } + if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) { + /* lock the task to synchronize with memcg migration */ + task_lock(p); + lru_gen_add_mm(p->mm); + task_unlock(p); + } + wake_up_new_task(p); /* forking complete and child started to run, tell ptracer */ diff --git a/kernel/kthread.c b/kernel/kthread.c index 5b37a8567168..fd827fdad26b 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1361,6 +1361,7 @@ void kthread_use_mm(struct mm_struct *mm) tsk->mm = mm; membarrier_update_current_mm(mm); switch_mm_irqs_off(active_mm, mm, tsk); + lru_gen_switch_mm(active_mm, mm); local_irq_enable(); task_unlock(tsk); #ifdef finish_arch_post_lock_switch diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 20ffcc044134..eea1457704ed 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4665,6 +4665,7 @@ context_switch(struct rq *rq, struct task_struct *prev, * finish_task_switch()'s mmdrop(). */ switch_mm_irqs_off(prev->active_mm, next->mm, next); + lru_gen_switch_mm(prev->active_mm, next->mm); if (!prev->mm) { // from kernel /* will mmdrop() in finish_task_switch(). */ @@ -8391,6 +8392,7 @@ void idle_task_exit(void) if (mm != &init_mm) { switch_mm(mm, &init_mm, current); + lru_gen_switch_mm(mm, &init_mm); finish_arch_post_lock_switch(); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 702a81dfe72d..8597992797d0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5172,6 +5172,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu); + lru_gen_free_mm_list(memcg); kfree(memcg); } @@ -5221,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (alloc_mem_cgroup_per_node_info(memcg, node)) goto fail; + if (lru_gen_alloc_mm_list(memcg)) + goto fail; + if (memcg_wb_domain_init(memcg, GFP_KERNEL)) goto fail; @@ -6182,6 +6186,29 @@ static void mem_cgroup_move_task(void) } #endif +#ifdef CONFIG_LRU_GEN +static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ + struct cgroup_subsys_state *css; + struct task_struct *task = NULL; + + cgroup_taskset_for_each_leader(task, css, tset) + break; + + if (!task) + return; + + task_lock(task); + if (task->mm && task->mm->owner == task) + lru_gen_migrate_mm(task->mm); + task_unlock(task); +} +#else +static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ +} +#endif + static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value) { if (value == PAGE_COUNTER_MAX) @@ -6523,6 +6550,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_reset = mem_cgroup_css_reset, .css_rstat_flush = mem_cgroup_css_rstat_flush, .can_attach = mem_cgroup_can_attach, + .attach = mem_cgroup_attach, .cancel_attach = mem_cgroup_cancel_attach, .post_attach = mem_cgroup_move_task, .dfl_cftypes = memory_files, diff --git a/mm/vmscan.c b/mm/vmscan.c index 788b4d1ce149..15eadf2a135e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2902,6 +2902,312 @@ static bool positive_ctrl_err(struct controller_pos *sp, struct controller_pos * sp->refaulted * max(pv->total, 1UL) * pv->gain; } +/****************************************************************************** + * mm_struct list + ******************************************************************************/ + +enum { + MM_SCHED_ACTIVE, /* running processes */ + MM_SCHED_INACTIVE, /* sleeping processes */ + MM_LOCK_CONTENTION, /* lock contentions */ + MM_VMA_INTERVAL, /* VMAs within the range of each PUD/PMD/PTE */ + MM_LEAF_OTHER_NODE, /* entries not from the node under reclaim */ + MM_LEAF_OTHER_MEMCG, /* entries not from the memcg under reclaim */ + MM_LEAF_OLD, /* old entries */ + MM_LEAF_YOUNG, /* young entries */ + MM_LEAF_DIRTY, /* dirty entries */ + MM_LEAF_HOLE, /* non-present entries */ + MM_NONLEAF_OLD, /* old non-leaf PMD entries */ + MM_NONLEAF_YOUNG, /* young non-leaf PMD entries */ + NR_MM_STATS +}; + +/* mnemonic codes for the stats above */ +#define MM_STAT_CODES "aicvnmoydhlu" + +struct lru_gen_mm_list { + /* the head of a global or per-memcg mm_struct list */ + struct list_head head; + /* protects the list */ + spinlock_t lock; + struct { + /* set to max_seq after each round of walk */ + unsigned long cur_seq; + /* the next mm_struct on the list to walk */ + struct list_head *iter; + /* to wait for the last walker to finish */ + struct wait_queue_head wait; + /* the number of concurrent walkers */ + int nr_walkers; + /* stats for debugging */ + unsigned long stats[NR_STAT_GENS][NR_MM_STATS]; + } nodes[0]; +}; + +static struct lru_gen_mm_list *global_mm_list; + +static struct lru_gen_mm_list *alloc_mm_list(void) +{ + int nid; + struct lru_gen_mm_list *mm_list; + + mm_list = kvzalloc(struct_size(mm_list, nodes, nr_node_ids), GFP_KERNEL); + if (!mm_list) + return NULL; + + INIT_LIST_HEAD(&mm_list->head); + spin_lock_init(&mm_list->lock); + + for_each_node(nid) { + mm_list->nodes[nid].cur_seq = MIN_NR_GENS; + mm_list->nodes[nid].iter = &mm_list->head; + init_waitqueue_head(&mm_list->nodes[nid].wait); + } + + return mm_list; +} + +static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg) +{ +#ifdef CONFIG_MEMCG + if (!mem_cgroup_disabled()) + return memcg ? memcg->mm_list : root_mem_cgroup->mm_list; +#endif + VM_BUG_ON(memcg); + + return global_mm_list; +} + +void lru_gen_init_mm(struct mm_struct *mm) +{ + INIT_LIST_HEAD(&mm->lrugen.list); +#ifdef CONFIG_MEMCG + mm->lrugen.memcg = NULL; +#endif +#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + atomic_set(&mm->lrugen.nr_cpus, 0); +#endif + nodes_clear(mm->lrugen.nodes); +} + +void lru_gen_add_mm(struct mm_struct *mm) +{ + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); + + VM_BUG_ON_MM(!list_empty(&mm->lrugen.list), mm); +#ifdef CONFIG_MEMCG + VM_BUG_ON_MM(mm->lrugen.memcg, mm); + WRITE_ONCE(mm->lrugen.memcg, memcg); +#endif + spin_lock(&mm_list->lock); + list_add_tail(&mm->lrugen.list, &mm_list->head); + spin_unlock(&mm_list->lock); +} + +void lru_gen_del_mm(struct mm_struct *mm) +{ + int nid; +#ifdef CONFIG_MEMCG + struct lru_gen_mm_list *mm_list = get_mm_list(mm->lrugen.memcg); +#else + struct lru_gen_mm_list *mm_list = get_mm_list(NULL); +#endif + + spin_lock(&mm_list->lock); + + for_each_node(nid) { + if (mm_list->nodes[nid].iter != &mm->lrugen.list) + continue; + + mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next; + if (mm_list->nodes[nid].iter == &mm_list->head) + WRITE_ONCE(mm_list->nodes[nid].cur_seq, + mm_list->nodes[nid].cur_seq + 1); + } + + list_del_init(&mm->lrugen.list); + + spin_unlock(&mm_list->lock); + +#ifdef CONFIG_MEMCG + mem_cgroup_put(mm->lrugen.memcg); + WRITE_ONCE(mm->lrugen.memcg, NULL); +#endif +} + +#ifdef CONFIG_MEMCG +int lru_gen_alloc_mm_list(struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled()) + return 0; + + memcg->mm_list = alloc_mm_list(); + + return memcg->mm_list ? 0 : -ENOMEM; +} + +void lru_gen_free_mm_list(struct mem_cgroup *memcg) +{ + kvfree(memcg->mm_list); + memcg->mm_list = NULL; +} + +void lru_gen_migrate_mm(struct mm_struct *mm) +{ + struct mem_cgroup *memcg; + + lockdep_assert_held(&mm->owner->alloc_lock); + + if (mem_cgroup_disabled()) + return; + + rcu_read_lock(); + memcg = mem_cgroup_from_task(mm->owner); + rcu_read_unlock(); + if (memcg == mm->lrugen.memcg) + return; + + VM_BUG_ON_MM(!mm->lrugen.memcg, mm); + VM_BUG_ON_MM(list_empty(&mm->lrugen.list), mm); + + lru_gen_del_mm(mm); + lru_gen_add_mm(mm); +} + +static bool mm_is_migrated(struct mm_struct *mm, struct mem_cgroup *memcg) +{ + return READ_ONCE(mm->lrugen.memcg) != memcg; +} +#else +static bool mm_is_migrated(struct mm_struct *mm, struct mem_cgroup *memcg) +{ + return false; +} +#endif + +struct mm_walk_args { + struct mem_cgroup *memcg; + unsigned long max_seq; + unsigned long start_pfn; + unsigned long end_pfn; + unsigned long next_addr; + int node_id; + int swappiness; + int batch_size; + int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; + int mm_stats[NR_MM_STATS]; + unsigned long bitmap[0]; +}; + +static void reset_mm_stats(struct lru_gen_mm_list *mm_list, bool last, + struct mm_walk_args *args) +{ + int i; + int nid = args->node_id; + int hist = lru_hist_from_seq(args->max_seq); + + lockdep_assert_held(&mm_list->lock); + + for (i = 0; i < NR_MM_STATS; i++) { + WRITE_ONCE(mm_list->nodes[nid].stats[hist][i], + mm_list->nodes[nid].stats[hist][i] + args->mm_stats[i]); + args->mm_stats[i] = 0; + } + + if (!last || NR_STAT_GENS == 1) + return; + + hist = lru_hist_from_seq(args->max_seq + 1); + for (i = 0; i < NR_MM_STATS; i++) + WRITE_ONCE(mm_list->nodes[nid].stats[hist][i], 0); +} + +static bool should_skip_mm(struct mm_struct *mm, struct mm_walk_args *args) +{ + int type; + unsigned long size = 0; + + if (!lru_gen_mm_is_active(mm) && !node_isset(args->node_id, mm->lrugen.nodes)) + return true; + + if (mm_is_oom_victim(mm)) + return true; + + for (type = !args->swappiness; type < ANON_AND_FILE; type++) { + size += type ? get_mm_counter(mm, MM_FILEPAGES) : + get_mm_counter(mm, MM_ANONPAGES) + + get_mm_counter(mm, MM_SHMEMPAGES); + } + + /* leave the legwork to the rmap if the mappings are too sparse */ + if (size < max(SWAP_CLUSTER_MAX, mm_pgtables_bytes(mm) / PAGE_SIZE)) + return true; + + return !mmget_not_zero(mm); +} + +/* To support multiple walkers that concurrently walk an mm_struct list. */ +static bool get_next_mm(struct mm_walk_args *args, struct mm_struct **iter) +{ + bool last = true; + struct mm_struct *mm = NULL; + int nid = args->node_id; + struct lru_gen_mm_list *mm_list = get_mm_list(args->memcg); + + if (*iter) + mmput_async(*iter); + else if (args->max_seq <= READ_ONCE(mm_list->nodes[nid].cur_seq)) + return false; + + spin_lock(&mm_list->lock); + + VM_BUG_ON(args->max_seq > mm_list->nodes[nid].cur_seq + 1); + VM_BUG_ON(*iter && args->max_seq < mm_list->nodes[nid].cur_seq); + VM_BUG_ON(*iter && !mm_list->nodes[nid].nr_walkers); + + if (args->max_seq <= mm_list->nodes[nid].cur_seq) { + last = *iter; + goto done; + } + + if (mm_list->nodes[nid].iter == &mm_list->head) { + VM_BUG_ON(*iter || mm_list->nodes[nid].nr_walkers); + mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next; + } + + while (!mm && mm_list->nodes[nid].iter != &mm_list->head) { + mm = list_entry(mm_list->nodes[nid].iter, struct mm_struct, lrugen.list); + mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next; + if (should_skip_mm(mm, args)) + mm = NULL; + + args->mm_stats[mm ? MM_SCHED_ACTIVE : MM_SCHED_INACTIVE]++; + } + + if (mm_list->nodes[nid].iter == &mm_list->head) + WRITE_ONCE(mm_list->nodes[nid].cur_seq, + mm_list->nodes[nid].cur_seq + 1); +done: + if (*iter && !mm) + mm_list->nodes[nid].nr_walkers--; + if (!*iter && mm) + mm_list->nodes[nid].nr_walkers++; + + last = last && !mm_list->nodes[nid].nr_walkers && + mm_list->nodes[nid].iter == &mm_list->head; + + reset_mm_stats(mm_list, last, args); + + spin_unlock(&mm_list->lock); + + *iter = mm; + if (mm) + node_clear(nid, mm->lrugen.nodes); + + return last; +} + /****************************************************************************** * state change ******************************************************************************/ @@ -3135,6 +3441,13 @@ static int __init init_lru_gen(void) { BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS); BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS); + BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1); + + if (mem_cgroup_disabled()) { + global_mm_list = alloc_mm_list(); + if (!global_mm_list) + return -ENOMEM; + } if (hotplug_memory_notifier(mem_notifier, 0)) pr_err("lru_gen: failed to subscribe hotplug notifications\n"); From patchwork Wed Aug 18 06:31:03 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443391 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 358D8C4338F for ; Wed, 18 Aug 2021 06:31:29 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B57026103A for ; Wed, 18 Aug 2021 06:31:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B57026103A Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 604356B0080; Wed, 18 Aug 2021 02:31:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 53D226B0081; Wed, 18 Aug 2021 02:31:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2F5E56B0082; Wed, 18 Aug 2021 02:31:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0192.hostedemail.com [216.40.44.192]) by kanga.kvack.org (Postfix) with ESMTP id 09DB66B0080 for ; Wed, 18 Aug 2021 02:31:22 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B0BD61F04B for ; Wed, 18 Aug 2021 06:31:21 +0000 (UTC) X-FDA: 78487229562.04.D5BDBDC Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf03.hostedemail.com (Postfix) with ESMTP id 4A61C300E545 for ; Wed, 18 Aug 2021 06:31:21 +0000 (UTC) Received: by mail-yb1-f202.google.com with SMTP id j9-20020a2581490000b02905897d81c63fso1816607ybm.8 for ; Tue, 17 Aug 2021 23:31:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=z2kO3gETVLe3f6E2AsDubYJZRvsLJPaS2Ipw7IApwm0=; b=svUvVh0WR+PW4aMv/cjqN6UwuRRwX5PC1tZd6xXvNOy/dF8Chqt5qRRRjq0SnfJ2AF +cwdt9BhWmrjGBD/4Fg+vAciPLfsdMV2PLYq+fvQL3t4x04q1W3W8Oa3J5KSYyE3ZAV0 zR2Je0z2v+xfTvKPdP1h7Arw9sUZ7JdCwjjrDtrdnhxUez63ZVgIYusXupoB1KoCMu2V CqqVorApKEyBzmf5aj6oqIrJJJDqFjg9AvTIUKhgaVXO6alLlC0TJmIIV+05Jhg1JL0z AFIzfA1ROm9zCHUY5HHshQPJK4IA3ryYYMnXTfCFDbz350Tih+ZbrT0n96XqVqf7rZrr rNKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=z2kO3gETVLe3f6E2AsDubYJZRvsLJPaS2Ipw7IApwm0=; b=Bm9copLbl8nQeW5UjmgKA65VYui/y0YN+MeQ0siGnhr2HqGIBpOHROqxb+dbATYMTy SKfEZ/X1SvXo4dV2NdGJXly2uGs5UwdhdjI9lpIZvRjW1dr64ml3gKeCvuQkESwMrmpY YCrFUpMqh+7dD61RLLjxvvjXveBPyuf3+k5UkbVpzEPF8NlEVpBC3oAnk8ol6aOem8AR LbS1mdhDAjqg+dCKsi13Qr0pvDvBbZnFrwxj9Ya4t5FelOE8qEE6lSRjtYARlQQNxsQ8 muu5k8yOk3Ja11wC9vHGLYrBTb+eTmwbxAEqgZqZ7fDmBeGqmSEXfigjjrw1jiOd+Vgb A/lg== X-Gm-Message-State: AOAM532agMBYw9DwnasD/ShdtM0nmWSlv15ktwVJVj4dytE/Mr3pdY+N b8yfDuHXgBaMDddf8V+W77el6NiJZfAx2k/gchr24XfsH/JQfO0gcJmH/S90IBvtPBvv1it9o7d jt6ekkXqn3AOmUJPQ/w11c7rUYQbpDCjBjNKt9ZNGlTwqegZSHNpMqCHt X-Google-Smtp-Source: ABdhPJwWgmQ2RqBHX6PIwvutRAQrPQjtw1s0Rd6t0slktT/1MnRcEEur7wEpXJENz+97f/aiiyb14mlPB5U= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a25:db82:: with SMTP id g124mr9256176ybf.46.1629268280599; Tue, 17 Aug 2021 23:31:20 -0700 (PDT) Date: Wed, 18 Aug 2021 00:31:03 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-8-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 07/11] mm: multigenerational lru: aging From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 4A61C300E545 X-Stat-Signature: i8kr6aabws985fdgph5mc45o6mnd6itz Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=svUvVh0W; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf03.hostedemail.com: domain of 3OKkcYQYKCA0B7Cun1t11tyr.p1zyv07A-zzx8npx.14t@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3OKkcYQYKCA0B7Cun1t11tyr.p1zyv07A-zzx8npx.14t@flex--yuzhao.bounces.google.com X-HE-Tag: 1629268281-80077 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The aging produces young generations. Given an lruvec, the aging traverses lruvec_memcg()->mm_list and calls walk_page_range() to scan PTEs for accessed pages. Upon finding one, the aging updates its generation number to max_seq (modulo MAX_NR_GENS). After each round of traversal, the aging increments max_seq. The aging is due when both min_seq[2] have caught up with max_seq-1. The aging uses the following optimizations when walking page tables: 1) It skips page tables of processes that have been sleeping since the last walk. 2) It skips non-leaf PMD entries that have the accessed bit cleared when CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. 3) It does not zigzag between a PGD table and the same PMD or PTE table spanning multiple VMAs. In other words, it finishes all the VMAs within the range of the same PMD or PTE table before it returns to this PGD table. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- include/linux/memcontrol.h | 3 + include/linux/mmzone.h | 11 + include/linux/oom.h | 16 + include/linux/swap.h | 1 + mm/oom_kill.c | 4 +- mm/rmap.c | 7 + mm/swap.c | 4 +- mm/vmscan.c | 903 +++++++++++++++++++++++++++++++++++++ 8 files changed, 945 insertions(+), 4 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5e223cecb5c2..657d94344dfc 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1346,10 +1346,13 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) static inline void lock_page_memcg(struct page *page) { + /* to match page_memcg_rcu() */ + rcu_read_lock(); } static inline void unlock_page_memcg(struct page *page) { + rcu_read_unlock(); } static inline void mem_cgroup_handle_over_high(void) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d6c2c3a4ba43..b6005e881862 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -295,6 +295,7 @@ enum lruvec_flags { }; struct lruvec; +struct page_vma_mapped_walk; #define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF) #define LRU_USAGE_MASK ((BIT(LRU_USAGE_WIDTH) - 1) << LRU_USAGE_PGOFF) @@ -369,6 +370,7 @@ struct lrugen { void lru_gen_init_lrugen(struct lruvec *lruvec); void lru_gen_set_state(bool enable, bool main, bool swap); +void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw); #else /* CONFIG_LRU_GEN */ @@ -380,6 +382,10 @@ static inline void lru_gen_set_state(bool enable, bool main, bool swap) { } +static inline void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw) +{ +} + #endif /* CONFIG_LRU_GEN */ struct lruvec { @@ -874,6 +880,8 @@ struct deferred_split { }; #endif +struct mm_walk_args; + /* * On NUMA machines, each NUMA node would have a pg_data_t to describe * it's memory layout. On UMA machines there is a single pglist_data which @@ -979,6 +987,9 @@ typedef struct pglist_data { unsigned long flags; +#ifdef CONFIG_LRU_GEN + struct mm_walk_args *mm_walk_args; +#endif ZONE_PADDING(_pad2_) /* Per-node vmstats */ diff --git a/include/linux/oom.h b/include/linux/oom.h index 2db9a1432511..c4c8c7e71099 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -57,6 +57,22 @@ struct oom_control { extern struct mutex oom_lock; extern struct mutex oom_adj_mutex; +#ifdef CONFIG_MMU +extern struct task_struct *oom_reaper_list; +extern struct wait_queue_head oom_reaper_wait; + +static inline bool oom_reaping_in_progress(void) +{ + /* racy check to see if oom reaping could be in progress */ + return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait); +} +#else +static inline bool oom_reaping_in_progress(void) +{ + return false; +} +#endif + static inline void set_current_oom_origin(void) { current->signal->oom_flag_origin = true; diff --git a/include/linux/swap.h b/include/linux/swap.h index 6f5a43251593..c838e67dfa3a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -368,6 +368,7 @@ extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_file_page(struct page *page); extern void deactivate_page(struct page *page); +extern void activate_page(struct page *page); extern void mark_page_lazyfree(struct page *page); extern void swap_setup(void); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index c729a4c4a1ac..eca484ee3a3d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -507,8 +507,8 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) * victim (if that is possible) to help the OOM killer to move on. */ static struct task_struct *oom_reaper_th; -static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); -static struct task_struct *oom_reaper_list; +DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); +struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); bool __oom_reap_task_mm(struct mm_struct *mm) diff --git a/mm/rmap.c b/mm/rmap.c index b9eb5c12f3fe..f4963d60ff68 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -72,6 +72,7 @@ #include #include #include +#include #include @@ -789,6 +790,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma, } if (pvmw.pte) { + /* the multigenerational lru exploits the spatial locality */ + if (lru_gen_enabled() && pte_young(*pvmw.pte) && + !(vma->vm_flags & VM_SEQ_READ)) { + lru_gen_scan_around(&pvmw); + referenced++; + } if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) { /* diff --git a/mm/swap.c b/mm/swap.c index 0d3fb2ee3fd6..0315cfa9fa41 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -347,7 +347,7 @@ static bool need_activate_page_drain(int cpu) return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0; } -static void activate_page(struct page *page) +void activate_page(struct page *page) { page = compound_head(page); if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { @@ -367,7 +367,7 @@ static inline void activate_page_drain(int cpu) { } -static void activate_page(struct page *page) +void activate_page(struct page *page) { struct lruvec *lruvec; diff --git a/mm/vmscan.c b/mm/vmscan.c index 15eadf2a135e..757ba4f415cc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -50,6 +50,8 @@ #include #include #include +#include +#include #include #include @@ -3208,6 +3210,883 @@ static bool get_next_mm(struct mm_walk_args *args, struct mm_struct **iter) return last; } +/****************************************************************************** + * the aging + ******************************************************************************/ + +static int page_update_gen(struct page *page, int gen) +{ + unsigned long old_flags, new_flags; + + VM_BUG_ON(gen >= MAX_NR_GENS); + + do { + new_flags = old_flags = READ_ONCE(page->flags); + + if (!(new_flags & LRU_GEN_MASK)) { + new_flags |= BIT(PG_referenced); + continue; + } + + new_flags &= ~(LRU_GEN_MASK | LRU_USAGE_MASK | LRU_TIER_FLAGS); + new_flags |= (gen + 1UL) << LRU_GEN_PGOFF; + } while (new_flags != old_flags && + cmpxchg(&page->flags, old_flags, new_flags) != old_flags); + + return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; +} + +static void page_inc_gen(struct page *page, struct lruvec *lruvec, bool reclaiming) +{ + int old_gen, new_gen; + unsigned long old_flags, new_flags; + int type = page_is_file_lru(page); + int zone = page_zonenum(page); + struct lrugen *lrugen = &lruvec->evictable; + + old_gen = lru_gen_from_seq(lrugen->min_seq[type]); + + do { + new_flags = old_flags = READ_ONCE(page->flags); + VM_BUG_ON_PAGE(!(new_flags & LRU_GEN_MASK), page); + + new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; + /* page_update_gen() has updated this page? */ + if (new_gen >= 0 && new_gen != old_gen) { + list_move(&page->lru, &lrugen->lists[new_gen][type][zone]); + return; + } + + new_gen = (old_gen + 1) % MAX_NR_GENS; + + new_flags &= ~(LRU_GEN_MASK | LRU_USAGE_MASK | LRU_TIER_FLAGS); + new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF; + /* for rotate_reclaimable_page() */ + if (reclaiming) + new_flags |= BIT(PG_reclaim); + } while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags); + + lru_gen_update_size(page, lruvec, old_gen, new_gen); + if (reclaiming) + list_move(&page->lru, &lrugen->lists[new_gen][type][zone]); + else + list_move_tail(&page->lru, &lrugen->lists[new_gen][type][zone]); +} + +static void update_batch_size(struct page *page, int old_gen, int new_gen, + struct mm_walk_args *args) +{ + int type = page_is_file_lru(page); + int zone = page_zonenum(page); + int delta = thp_nr_pages(page); + + VM_BUG_ON(old_gen >= MAX_NR_GENS); + VM_BUG_ON(new_gen >= MAX_NR_GENS); + + args->batch_size++; + + args->nr_pages[old_gen][type][zone] -= delta; + args->nr_pages[new_gen][type][zone] += delta; +} + +static void reset_batch_size(struct lruvec *lruvec, struct mm_walk_args *args) +{ + int gen, type, zone; + struct lrugen *lrugen = &lruvec->evictable; + + if (!args->batch_size) + return; + + args->batch_size = 0; + + spin_lock_irq(&lruvec->lru_lock); + + for_each_gen_type_zone(gen, type, zone) { + enum lru_list lru = type * LRU_FILE; + int total = args->nr_pages[gen][type][zone]; + + if (!total) + continue; + + args->nr_pages[gen][type][zone] = 0; + WRITE_ONCE(lrugen->sizes[gen][type][zone], + lrugen->sizes[gen][type][zone] + total); + + if (lru_gen_is_active(lruvec, gen)) + lru += LRU_ACTIVE; + update_lru_size(lruvec, lru, zone, total); + } + + spin_unlock_irq(&lruvec->lru_lock); +} + +static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *walk) +{ + struct address_space *mapping; + struct vm_area_struct *vma = walk->vma; + struct mm_walk_args *args = walk->private; + + if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) || + (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ))) + return true; + + if (vma_is_anonymous(vma)) + return !args->swappiness; + + if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping)) + return true; + + mapping = vma->vm_file->f_mapping; + if (!mapping->a_ops->writepage) + return true; + + return (shmem_mapping(mapping) && !args->swappiness) || mapping_unevictable(mapping); +} + +/* + * Some userspace memory allocators create many single-page VMAs. So instead of + * returning back to the PGD table for each of such VMAs, we finish at least an + * entire PMD table and therefore avoid many zigzags. + */ +static bool get_next_vma(struct mm_walk *walk, unsigned long mask, unsigned long size, + unsigned long *start, unsigned long *end) +{ + unsigned long next = round_up(*end, size); + struct mm_walk_args *args = walk->private; + + VM_BUG_ON(mask & size); + VM_BUG_ON(*start >= *end); + VM_BUG_ON((next & mask) != (*start & mask)); + + while (walk->vma) { + if (next >= walk->vma->vm_end) { + walk->vma = walk->vma->vm_next; + continue; + } + + if ((next & mask) != (walk->vma->vm_start & mask)) + return false; + + if (should_skip_vma(walk->vma->vm_start, walk->vma->vm_end, walk)) { + walk->vma = walk->vma->vm_next; + continue; + } + + *start = max(next, walk->vma->vm_start); + next = (next | ~mask) + 1; + /* rounded-up boundaries can wrap to 0 */ + *end = next && next < walk->vma->vm_end ? next : walk->vma->vm_end; + + args->mm_stats[MM_VMA_INTERVAL]++; + + return true; + } + + return false; +} + +static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + int i; + pte_t *pte; + spinlock_t *ptl; + unsigned long addr; + int remote = 0; + struct mm_walk_args *args = walk->private; + int old_gen, new_gen = lru_gen_from_seq(args->max_seq); + + VM_BUG_ON(pmd_leaf(*pmd)); + + pte = pte_offset_map_lock(walk->mm, pmd, start & PMD_MASK, &ptl); + arch_enter_lazy_mmu_mode(); +restart: + for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { + struct page *page; + unsigned long pfn = pte_pfn(pte[i]); + + if (!pte_present(pte[i]) || is_zero_pfn(pfn)) { + args->mm_stats[MM_LEAF_HOLE]++; + continue; + } + + if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i]))) + continue; + + if (!pte_young(pte[i])) { + args->mm_stats[MM_LEAF_OLD]++; + continue; + } + + VM_BUG_ON(!pfn_valid(pfn)); + if (pfn < args->start_pfn || pfn >= args->end_pfn) { + args->mm_stats[MM_LEAF_OTHER_NODE]++; + remote++; + continue; + } + + page = compound_head(pfn_to_page(pfn)); + if (page_to_nid(page) != args->node_id) { + args->mm_stats[MM_LEAF_OTHER_NODE]++; + remote++; + continue; + } + + if (page_memcg_rcu(page) != args->memcg) { + args->mm_stats[MM_LEAF_OTHER_MEMCG]++; + continue; + } + + VM_BUG_ON(addr < walk->vma->vm_start || addr >= walk->vma->vm_end); + if (!ptep_test_and_clear_young(walk->vma, addr, pte + i)) + continue; + + if (pte_dirty(pte[i]) && !PageDirty(page) && + !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page))) { + set_page_dirty(page); + args->mm_stats[MM_LEAF_DIRTY]++; + } + + old_gen = page_update_gen(page, new_gen); + if (old_gen >= 0 && old_gen != new_gen) + update_batch_size(page, old_gen, new_gen, args); + args->mm_stats[MM_LEAF_YOUNG]++; + } + + if (i < PTRS_PER_PTE && get_next_vma(walk, PMD_MASK, PAGE_SIZE, &start, &end)) + goto restart; + + arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(pte, ptl); + + return IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && !remote; +} + +/* + * We scan PMD entries in two passes. The first pass reaches to PTE tables and + * doesn't take the PMD lock. The second pass clears the accessed bit on PMD + * entries and needs to take the PMD lock. + */ +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) +static void walk_pmd_range_locked(pud_t *pud, unsigned long start, + struct vm_area_struct *vma, struct mm_walk *walk) +{ + int i; + pmd_t *pmd; + spinlock_t *ptl; + struct mm_walk_args *args = walk->private; + int old_gen, new_gen = lru_gen_from_seq(args->max_seq); + + VM_BUG_ON(pud_leaf(*pud)); + + start &= PUD_MASK; + pmd = pmd_offset(pud, start); + ptl = pmd_lock(walk->mm, pmd); + arch_enter_lazy_mmu_mode(); + + for_each_set_bit(i, args->bitmap, PTRS_PER_PMD) { + struct page *page; + unsigned long pfn = pmd_pfn(pmd[i]); + unsigned long addr = start + i * PMD_SIZE; + + if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i])) { + args->mm_stats[MM_LEAF_HOLE]++; + continue; + } + + if (WARN_ON_ONCE(pmd_devmap(pmd[i]))) + continue; + + if (!pmd_young(pmd[i])) { + args->mm_stats[MM_LEAF_OLD]++; + continue; + } + + if (!pmd_trans_huge(pmd[i])) { + if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && + pmdp_test_and_clear_young(vma, addr, pmd + i)) + args->mm_stats[MM_NONLEAF_YOUNG]++; + continue; + } + + VM_BUG_ON(!pfn_valid(pfn)); + if (pfn < args->start_pfn || pfn >= args->end_pfn) { + args->mm_stats[MM_LEAF_OTHER_NODE]++; + continue; + } + + page = pfn_to_page(pfn); + VM_BUG_ON_PAGE(PageTail(page), page); + if (page_to_nid(page) != args->node_id) { + args->mm_stats[MM_LEAF_OTHER_NODE]++; + continue; + } + + if (page_memcg_rcu(page) != args->memcg) { + args->mm_stats[MM_LEAF_OTHER_MEMCG]++; + continue; + } + + VM_BUG_ON(addr < vma->vm_start || addr >= vma->vm_end); + if (!pmdp_test_and_clear_young(vma, addr, pmd + i)) + continue; + + if (pmd_dirty(pmd[i]) && !PageDirty(page) && + !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page))) { + set_page_dirty(page); + args->mm_stats[MM_LEAF_DIRTY]++; + } + + old_gen = page_update_gen(page, new_gen); + if (old_gen >= 0 && old_gen != new_gen) + update_batch_size(page, old_gen, new_gen, args); + args->mm_stats[MM_LEAF_YOUNG]++; + } + + arch_leave_lazy_mmu_mode(); + spin_unlock(ptl); + + bitmap_zero(args->bitmap, PTRS_PER_PMD); +} +#else +static void walk_pmd_range_locked(pud_t *pud, unsigned long start, + struct vm_area_struct *vma, struct mm_walk *walk) +{ +} +#endif + +static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + int i; + pmd_t *pmd; + unsigned long next; + unsigned long addr; + struct vm_area_struct *vma; + int leaf = 0; + int nonleaf = 0; + struct mm_walk_args *args = walk->private; + + VM_BUG_ON(pud_leaf(*pud)); + + pmd = pmd_offset(pud, start & PUD_MASK); +restart: + vma = walk->vma; + for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) { + pmd_t val = pmd_read_atomic(pmd + i); + + /* for pmd_read_atomic() */ + barrier(); + + next = pmd_addr_end(addr, end); + + if (!pmd_present(val)) { + args->mm_stats[MM_LEAF_HOLE]++; + continue; + } + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (pmd_trans_huge(val)) { + unsigned long pfn = pmd_pfn(val); + + if (is_huge_zero_pmd(val)) { + args->mm_stats[MM_LEAF_HOLE]++; + continue; + } + + if (!pmd_young(val)) { + args->mm_stats[MM_LEAF_OLD]++; + continue; + } + + if (pfn < args->start_pfn || pfn >= args->end_pfn) { + args->mm_stats[MM_LEAF_OTHER_NODE]++; + continue; + } + + __set_bit(i, args->bitmap); + leaf++; + continue; + } +#endif + +#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG + if (!pmd_young(val)) { + args->mm_stats[MM_NONLEAF_OLD]++; + continue; + } +#endif + if (walk_pte_range(&val, addr, next, walk)) { + __set_bit(i, args->bitmap); + nonleaf++; + } + } + + if (leaf) { + walk_pmd_range_locked(pud, start, vma, walk); + leaf = nonleaf = 0; + } + + if (i < PTRS_PER_PMD && get_next_vma(walk, PUD_MASK, PMD_SIZE, &start, &end)) + goto restart; + + if (nonleaf) + walk_pmd_range_locked(pud, start, vma, walk); +} + +static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + int i; + pud_t *pud; + unsigned long addr; + unsigned long next; + struct mm_walk_args *args = walk->private; + + VM_BUG_ON(p4d_leaf(*p4d)); + + pud = pud_offset(p4d, start & P4D_MASK); +restart: + for (i = pud_index(start), addr = start; addr != end; i++, addr = next) { + pud_t val = READ_ONCE(pud[i]); + + next = pud_addr_end(addr, end); + + if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val))) + continue; + + walk_pmd_range(&val, addr, next, walk); + + if (args->batch_size >= MAX_BATCH_SIZE) { + end = (addr | ~PUD_MASK) + 1; + goto done; + } + } + + if (i < PTRS_PER_PUD && get_next_vma(walk, P4D_MASK, PUD_SIZE, &start, &end)) + goto restart; + + end = round_up(end, P4D_SIZE); +done: + /* rounded-up boundaries can wrap to 0 */ + args->next_addr = end && walk->vma ? max(end, walk->vma->vm_start) : 0; + + return -EAGAIN; +} + +static void walk_mm(struct mm_walk_args *args, struct mm_struct *mm) +{ + static const struct mm_walk_ops mm_walk_ops = { + .test_walk = should_skip_vma, + .p4d_entry = walk_pud_range, + }; + + int err; + struct mem_cgroup *memcg = args->memcg; + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(args->node_id)); + + args->next_addr = FIRST_USER_ADDRESS; + + do { + unsigned long start = args->next_addr; + unsigned long end = mm->highest_vm_end; + + err = -EBUSY; + + rcu_read_lock(); +#ifdef CONFIG_MEMCG + if (memcg && atomic_read(&memcg->moving_account)) { + args->mm_stats[MM_LOCK_CONTENTION]++; + goto contended; + } +#endif + if (!mmap_read_trylock(mm)) { + args->mm_stats[MM_LOCK_CONTENTION]++; + goto contended; + } + + err = walk_page_range(mm, start, end, &mm_walk_ops, args); + + mmap_read_unlock(mm); + + reset_batch_size(lruvec, args); +contended: + rcu_read_unlock(); + + cond_resched(); + } while (err == -EAGAIN && args->next_addr && + !mm_is_oom_victim(mm) && !mm_is_migrated(mm, memcg)); +} + +static struct mm_walk_args *alloc_mm_walk_args(int nid) +{ + struct pglist_data *pgdat; + int size = sizeof(struct mm_walk_args); + + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) || + IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)) + size += sizeof(unsigned long) * BITS_TO_LONGS(PTRS_PER_PMD); + + if (!current_is_kswapd()) + return kvzalloc_node(size, GFP_KERNEL, nid); + + VM_BUG_ON(nid == NUMA_NO_NODE); + + pgdat = NODE_DATA(nid); + if (!pgdat->mm_walk_args) + pgdat->mm_walk_args = kvzalloc_node(size, GFP_KERNEL, nid); + + return pgdat->mm_walk_args; +} + +static void free_mm_walk_args(struct mm_walk_args *args) +{ + if (!current_is_kswapd()) + kvfree(args); +} + +static bool inc_min_seq(struct lruvec *lruvec, int type) +{ + int gen, zone; + int remaining = MAX_BATCH_SIZE; + struct lrugen *lrugen = &lruvec->evictable; + + VM_BUG_ON(!seq_is_valid(lruvec)); + + if (get_nr_gens(lruvec, type) != MAX_NR_GENS) + return true; + + gen = lru_gen_from_seq(lrugen->min_seq[type]); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + struct list_head *head = &lrugen->lists[gen][type][zone]; + + while (!list_empty(head)) { + struct page *page = lru_to_page(head); + + VM_BUG_ON_PAGE(PageTail(page), page); + VM_BUG_ON_PAGE(PageUnevictable(page), page); + VM_BUG_ON_PAGE(PageActive(page), page); + VM_BUG_ON_PAGE(page_is_file_lru(page) != type, page); + VM_BUG_ON_PAGE(page_zonenum(page) != zone, page); + + prefetchw_prev_lru_page(page, head, flags); + + page_inc_gen(page, lruvec, false); + + if (!--remaining) + return false; + } + + VM_BUG_ON(lrugen->sizes[gen][type][zone]); + } + + reset_controller_pos(lruvec, gen, type); + WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1); + + return true; +} + +static bool try_to_inc_min_seq(struct lruvec *lruvec, int type) +{ + int gen, zone; + bool success = false; + struct lrugen *lrugen = &lruvec->evictable; + + VM_BUG_ON(!seq_is_valid(lruvec)); + + while (get_nr_gens(lruvec, type) > MIN_NR_GENS) { + gen = lru_gen_from_seq(lrugen->min_seq[type]); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + if (!list_empty(&lrugen->lists[gen][type][zone])) + return success; + } + + reset_controller_pos(lruvec, gen, type); + WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1); + + success = true; + } + + return success; +} + +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq) +{ + int gen, type, zone; + struct lrugen *lrugen = &lruvec->evictable; + + spin_lock_irq(&lruvec->lru_lock); + + VM_BUG_ON(!seq_is_valid(lruvec)); + + if (max_seq != lrugen->max_seq) + goto unlock; + + for (type = 0; type < ANON_AND_FILE; type++) { + if (try_to_inc_min_seq(lruvec, type)) + continue; + + while (!inc_min_seq(lruvec, type)) { + spin_unlock_irq(&lruvec->lru_lock); + cond_resched(); + spin_lock_irq(&lruvec->lru_lock); + } + } + + gen = lru_gen_from_seq(lrugen->max_seq - 1); + for_each_type_zone(type, zone) { + enum lru_list lru = type * LRU_FILE; + long total = lrugen->sizes[gen][type][zone]; + + if (!total) + continue; + + WARN_ON_ONCE(total != (int)total); + + update_lru_size(lruvec, lru, zone, total); + update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -total); + } + + gen = lru_gen_from_seq(lrugen->max_seq + 1); + for_each_type_zone(type, zone) { + VM_BUG_ON(lrugen->sizes[gen][type][zone]); + VM_BUG_ON(!list_empty(&lrugen->lists[gen][type][zone])); + } + + for (type = 0; type < ANON_AND_FILE; type++) + reset_controller_pos(lruvec, gen, type); + + WRITE_ONCE(lrugen->timestamps[gen], jiffies); + /* make sure all preceding modifications appear first */ + smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); +unlock: + spin_unlock_irq(&lruvec->lru_lock); +} + +/* Main function used by the foreground, the background and the user-triggered aging. */ +static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, + struct scan_control *sc, int swappiness) +{ + bool last; + struct mm_walk_args *args; + struct mm_struct *mm = NULL; + struct lrugen *lrugen = &lruvec->evictable; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + int nid = pgdat->node_id; + + VM_BUG_ON(max_seq > READ_ONCE(lrugen->max_seq)); + + /* + * If we are not from run_aging() and clearing the accessed bit may + * trigger page faults, then don't proceed to clearing all accessed + * PTEs. Instead, fallback to lru_gen_scan_around(), which only clears a + * handful of accessed PTEs. This is less efficient but causes fewer + * page faults on CPUs that don't have the capability. + */ + if ((current->flags & PF_MEMALLOC) && !arch_has_hw_pte_young()) { + inc_max_seq(lruvec, max_seq); + return true; + } + + args = alloc_mm_walk_args(nid); + if (!args) + return false; + + args->memcg = memcg; + args->max_seq = max_seq; + args->start_pfn = pgdat->node_start_pfn; + args->end_pfn = pgdat_end_pfn(pgdat); + args->node_id = nid; + args->swappiness = swappiness; + + do { + last = get_next_mm(args, &mm); + if (mm) + walk_mm(args, mm); + + cond_resched(); + } while (mm); + + free_mm_walk_args(args); + + if (!last) { + /* don't wait unless we may have trouble reclaiming */ + if (!current_is_kswapd() && sc->priority < DEF_PRIORITY - 2) + wait_event_killable(mm_list->nodes[nid].wait, + max_seq < READ_ONCE(lrugen->max_seq)); + + return max_seq < READ_ONCE(lrugen->max_seq); + } + + VM_BUG_ON(max_seq != READ_ONCE(lrugen->max_seq)); + + inc_max_seq(lruvec, max_seq); + /* either we see any waiters or they will see updated max_seq */ + if (wq_has_sleeper(&mm_list->nodes[nid].wait)) + wake_up_all(&mm_list->nodes[nid].wait); + + wakeup_flusher_threads(WB_REASON_VMSCAN); + + return true; +} + +/* Protect the working set accessed within the last N milliseconds. */ +static unsigned long lru_gen_min_ttl; + +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) +{ + struct mem_cgroup *memcg; + + VM_BUG_ON(!current_is_kswapd()); + + if (sc->file_is_tiny && mutex_trylock(&oom_lock)) { + struct oom_control oc = { + .gfp_mask = sc->gfp_mask, + .order = sc->order, + }; + + /* to avoid overkilling */ + if (!oom_reaping_in_progress()) + out_of_memory(&oc); + + mutex_unlock(&oom_lock); + } + + if (READ_ONCE(lru_gen_min_ttl)) + sc->file_is_tiny = 1; + + if (!mem_cgroup_disabled() && !sc->force_deactivate) { + sc->force_deactivate = 1; + return; + } + + sc->force_deactivate = 0; + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + int swappiness = get_swappiness(memcg); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (get_lo_wmark(max_seq, min_seq, swappiness) == MIN_NR_GENS) + try_to_inc_max_seq(lruvec, max_seq, sc, swappiness); + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); +} + +#define NR_TO_SCAN (SWAP_CLUSTER_MAX * 2) +#define SIZE_TO_SCAN (NR_TO_SCAN * PAGE_SIZE) + +/* Scan the vicinity of an accessed PTE when shrink_page_list() uses the rmap. */ +void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw) +{ + int i; + pte_t *pte; + struct page *page; + int old_gen, new_gen; + unsigned long start; + unsigned long end; + unsigned long addr; + struct mem_cgroup *memcg = page_memcg(pvmw->page); + struct pglist_data *pgdat = page_pgdat(pvmw->page); + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + unsigned long bitmap[BITS_TO_LONGS(NR_TO_SCAN)] = {}; + + lockdep_assert_held(pvmw->ptl); + VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page); + + start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start); + end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end); + + if (end - start > SIZE_TO_SCAN) { + if (pvmw->address - start < SIZE_TO_SCAN / 2) + end = start + SIZE_TO_SCAN; + else if (end - pvmw->address < SIZE_TO_SCAN / 2) + start = end - SIZE_TO_SCAN; + else { + start = pvmw->address - SIZE_TO_SCAN / 2; + end = pvmw->address + SIZE_TO_SCAN / 2; + } + } + + pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE; + new_gen = lru_gen_from_seq(READ_ONCE(lruvec->evictable.max_seq)); + + rcu_read_lock(); + arch_enter_lazy_mmu_mode(); + + for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) { + unsigned long pfn = pte_pfn(pte[i]); + + if (!pte_present(pte[i]) || is_zero_pfn(pfn)) + continue; + + if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i]))) + continue; + + if (!pte_young(pte[i])) + continue; + + VM_BUG_ON(!pfn_valid(pfn)); + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + continue; + + page = compound_head(pfn_to_page(pfn)); + if (page_to_nid(page) != pgdat->node_id) + continue; + + if (page_memcg_rcu(page) != memcg) + continue; + + VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end); + if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) + continue; + + old_gen = page_lru_gen(page); + if (old_gen < 0) + SetPageReferenced(page); + else if (old_gen != new_gen) + __set_bit(i, bitmap); + + if (pte_dirty(pte[i]) && !PageDirty(page) && + !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page))) + set_page_dirty(page); + } + + arch_leave_lazy_mmu_mode(); + rcu_read_unlock(); + + if (bitmap_weight(bitmap, NR_TO_SCAN) < PAGEVEC_SIZE) { + for_each_set_bit(i, bitmap, NR_TO_SCAN) + activate_page(pte_page(pte[i])); + return; + } + + lock_page_memcg(pvmw->page); + spin_lock_irq(&lruvec->lru_lock); + + new_gen = lru_gen_from_seq(lruvec->evictable.max_seq); + + for_each_set_bit(i, bitmap, NR_TO_SCAN) { + page = compound_head(pte_page(pte[i])); + if (page_memcg_rcu(page) != memcg) + continue; + + old_gen = page_update_gen(page, new_gen); + if (old_gen >= 0 && old_gen != new_gen) + lru_gen_update_size(page, lruvec, old_gen, new_gen); + } + + spin_unlock_irq(&lruvec->lru_lock); + unlock_page_memcg(pvmw->page); +} + /****************************************************************************** * state change ******************************************************************************/ @@ -3392,9 +4271,18 @@ static int __meminit __maybe_unused mem_notifier(struct notifier_block *self, pgdat = NODE_DATA(nid); + if (action == MEM_CANCEL_ONLINE || action == MEM_OFFLINE) { + free_mm_walk_args(pgdat->mm_walk_args); + pgdat->mm_walk_args = NULL; + return NOTIFY_DONE; + } + if (action != MEM_GOING_ONLINE) return NOTIFY_DONE; + if (!WARN_ON_ONCE(pgdat->mm_walk_args)) + pgdat->mm_walk_args = alloc_mm_walk_args(NUMA_NO_NODE); + mutex_lock(&lru_gen_state_mutex); cgroup_lock(); @@ -3443,6 +4331,10 @@ static int __init init_lru_gen(void) BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS); BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1); + VM_BUG_ON(PMD_SIZE / PAGE_SIZE != PTRS_PER_PTE); + VM_BUG_ON(PUD_SIZE / PMD_SIZE != PTRS_PER_PMD); + VM_BUG_ON(P4D_SIZE / PUD_SIZE != PTRS_PER_PUD); + if (mem_cgroup_disabled()) { global_mm_list = alloc_mm_list(); if (!global_mm_list) @@ -3460,6 +4352,12 @@ static int __init init_lru_gen(void) */ arch_initcall(init_lru_gen); +#else /* CONFIG_LRU_GEN */ + +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) +{ +} + #endif /* CONFIG_LRU_GEN */ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) @@ -4313,6 +5211,11 @@ static void age_active_anon(struct pglist_data *pgdat, struct mem_cgroup *memcg; struct lruvec *lruvec; + if (lru_gen_enabled()) { + lru_gen_age_node(pgdat, sc); + return; + } + if (!total_swap_pages) return; From patchwork Wed Aug 18 06:31:04 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443393 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3048C4320A for ; Wed, 18 Aug 2021 06:31:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 777DD6109F for ; Wed, 18 Aug 2021 06:31:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 777DD6109F Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id DA3CF6B0081; Wed, 18 Aug 2021 02:31:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D2DD26B0082; Wed, 18 Aug 2021 02:31:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE3B36B0083; Wed, 18 Aug 2021 02:31:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0107.hostedemail.com [216.40.44.107]) by kanga.kvack.org (Postfix) with ESMTP id 899016B0081 for ; Wed, 18 Aug 2021 02:31:23 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 4707A22873 for ; Wed, 18 Aug 2021 06:31:23 +0000 (UTC) X-FDA: 78487229646.02.8986346 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf25.hostedemail.com (Postfix) with ESMTP id DC39CB0025A3 for ; Wed, 18 Aug 2021 06:31:22 +0000 (UTC) Received: by mail-yb1-f202.google.com with SMTP id f8-20020a2585480000b02905937897e3daso1836069ybn.2 for ; Tue, 17 Aug 2021 23:31:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=VNwpW6WIdnuG3AYWvj+4AM1ncawHlKx5Ng/ARryIXuE=; b=i8Y2GYtpM7PPZh/DNiRTva/K7VOlAWL7Ipv4DJKZysp0znS369tGr1/sEesSHeya08 K5N3BQXLn6KTOTQ1GM9AGZL8I2YcJ+spHm/Fd0/7XtOEqFBYWXAl5le7jGBKaofufkI5 OzSvHbtp/R/KoG1aqvuWt2aHIBin+CFZizS36WYIrEQOxgiYdmGy4f94pW60l+gC0sl5 ux75gso95f348dzFl7neo53KnmIr5C/uyENxzS7cQAKozMuIrA4mXDfWfEtQrANan8mN fpWxhXrxC4FQZQowBtIcWfx764xAqyz3q/FQ3bbmvl0367t7gwyEWzPrYWyg6b1SZUaw lzjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=VNwpW6WIdnuG3AYWvj+4AM1ncawHlKx5Ng/ARryIXuE=; b=QRZyLmsNYwF0G4p+6W6dnEdWUCtb/k6TLn+fza2RmDFlDQzYj/z/K75dRWnijFtZ/g sDr6aLPKaeXDBOih0+1sUAOms00rTydOphPq3pIB9oCfriLXcj0JyUlV+QZPxeBf2Q8G ltOZ4u9lMZ0TnxLA1v9fV5AqtENiWeCyX9O/j8IgByjvrAkCkNYOjtsjV4R0oYlx+Nat lUV5Y2lfqNgeXuEM05gOmjf8DHI/pykyrIG2KdsZ99OPhyoMd31LbFPJsbkSRX+8Lu8B reGJaQ//gT2H4sck9L3uos/N26KynTflPu9F8V0lYn7pONGjTveCL7JBU53tKptPxpCA KjVQ== X-Gm-Message-State: AOAM531aO0v6XHuA5x1ffnqeCTP67WiycMa2UISqnIxuG6dnkBncuG/W kUh2p4uOGPMG12dciqFDbTYgIggeppILvAq9xdo2Em9VECUTKcVrKNgpTc7AH4SsdnrRRH+Z7Se p2pMKoLF+bJbcm94amVWkSlZX0F/LAeR5n5I+hUEwQF8QlmJn9wgz90Aa X-Google-Smtp-Source: ABdhPJz7Rwe9d+2oKKXDKvePgtoi9ZiFTYPt74TFugQ+Cu9epD8Yd7UFPRBaiMGC6Io319u3NJRTQDCBQrA= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a5b:4cf:: with SMTP id u15mr9608628ybp.118.1629268282152; Tue, 17 Aug 2021 23:31:22 -0700 (PDT) Date: Wed, 18 Aug 2021 00:31:04 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-9-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 08/11] mm: multigenerational lru: eviction From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov X-Rspamd-Queue-Id: DC39CB0025A3 Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=i8Y2GYtp; spf=pass (imf25.hostedemail.com: domain of 3OqkcYQYKCA8D9Ewp3v33v0t.r310x29C-11zAprz.36v@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3OqkcYQYKCA8D9Ewp3v33v0t.r310x29C-11zAprz.36v@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam01 X-Stat-Signature: unw5omaufip3b6z17qs8rzm3rhywckaz X-HE-Tag: 1629268282-161513 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The eviction consumes old generations. Given an lruvec, the eviction scans pages on lrugen->lists indexed by anon and file min_seq[2] (modulo MAX_NR_GENS). It first tries to select a type based on the values of min_seq[2]. If they are equal, it selects the type that has a lower refault rate. The eviction sorts a page according to its updated generation number if the aging has found this page accessed. It also moves a page to the next generation if this page is from an upper tier that has a higher refault rate than the base tier. The eviction increments min_seq[2] of a selected type when it finds lrugen->lists indexed by min_seq[2] of this selected type are empty. With the aging and the eviction in place, implementing page reclaim becomes quite straightforward: 1) To reduce the latency, direct reclaim skips the aging unless both min_seq[2] are equal to max_seq-1. Then it invokes the eviction. 2) To avoid the aging in the direct reclaim path, kswapd invokes the aging if either of min_seq[2] is equal to max_seq-1. Then it invokes the eviction. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- mm/vmscan.c | 440 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 440 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 757ba4f415cc..2f1fffbd2d61 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1311,6 +1311,11 @@ static unsigned int shrink_page_list(struct list_head *page_list, if (!sc->may_unmap && page_mapped(page)) goto keep_locked; + /* lru_gen_scan_around() has updated this page? */ + if (lru_gen_enabled() && !ignore_references && + page_mapped(page) && PageReferenced(page)) + goto keep_locked; + may_enter_fs = (sc->gfp_mask & __GFP_FS) || (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); @@ -2447,6 +2452,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) unsigned long file; struct lruvec *target_lruvec; + if (lru_gen_enabled()) + return; + target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); /* @@ -4087,6 +4095,426 @@ void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw) unlock_page_memcg(pvmw->page); } +/****************************************************************************** + * the eviction + ******************************************************************************/ + +static bool sort_page(struct page *page, struct lruvec *lruvec, int tier_to_isolate) +{ + bool success; + int gen = page_lru_gen(page); + int type = page_is_file_lru(page); + int zone = page_zonenum(page); + int tier = page_lru_tier(page); + struct lrugen *lrugen = &lruvec->evictable; + + VM_BUG_ON_PAGE(gen < 0, page); + VM_BUG_ON_PAGE(tier_to_isolate < 0, page); + + /* a lazy-free page that has been written into? */ + if (type && PageDirty(page) && PageAnon(page)) { + success = lru_gen_del_page(page, lruvec, false); + VM_BUG_ON_PAGE(!success, page); + SetPageSwapBacked(page); + add_page_to_lru_list_tail(page, lruvec); + return true; + } + + /* page_update_gen() has updated this page? */ + if (gen != lru_gen_from_seq(lrugen->min_seq[type])) { + list_move(&page->lru, &lrugen->lists[gen][type][zone]); + return true; + } + + /* protect this page if its tier has a higher refault rate */ + if (tier_to_isolate < tier) { + int hist = lru_hist_from_seq(gen); + + page_inc_gen(page, lruvec, false); + WRITE_ONCE(lrugen->protected[hist][type][tier - 1], + lrugen->protected[hist][type][tier - 1] + thp_nr_pages(page)); + inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type); + return true; + } + + /* mark this page for reclaim if it's pending writeback */ + if (PageWriteback(page) || (type && PageDirty(page))) { + page_inc_gen(page, lruvec, true); + return true; + } + + return false; +} + +static bool isolate_page(struct page *page, struct lruvec *lruvec, struct scan_control *sc) +{ + bool success; + + if (!sc->may_unmap && page_mapped(page)) + return false; + + if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) && + (PageDirty(page) || (PageAnon(page) && !PageSwapCache(page)))) + return false; + + if (!get_page_unless_zero(page)) + return false; + + if (!TestClearPageLRU(page)) { + put_page(page); + return false; + } + + success = lru_gen_del_page(page, lruvec, true); + VM_BUG_ON_PAGE(!success, page); + + return true; +} + +static int scan_pages(struct lruvec *lruvec, struct scan_control *sc, long *nr_to_scan, + int type, int tier, struct list_head *list) +{ + bool success; + int gen, zone; + enum vm_event_item item; + int sorted = 0; + int scanned = 0; + int isolated = 0; + int remaining = MAX_BATCH_SIZE; + struct lrugen *lrugen = &lruvec->evictable; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + + VM_BUG_ON(!list_empty(list)); + + if (get_nr_gens(lruvec, type) == MIN_NR_GENS) + return -ENOENT; + + gen = lru_gen_from_seq(lrugen->min_seq[type]); + + for (zone = sc->reclaim_idx; zone >= 0; zone--) { + LIST_HEAD(moved); + int skipped = 0; + struct list_head *head = &lrugen->lists[gen][type][zone]; + + while (!list_empty(head)) { + struct page *page = lru_to_page(head); + int delta = thp_nr_pages(page); + + VM_BUG_ON_PAGE(PageTail(page), page); + VM_BUG_ON_PAGE(PageUnevictable(page), page); + VM_BUG_ON_PAGE(PageActive(page), page); + VM_BUG_ON_PAGE(page_is_file_lru(page) != type, page); + VM_BUG_ON_PAGE(page_zonenum(page) != zone, page); + + prefetchw_prev_lru_page(page, head, flags); + + scanned += delta; + + if (sort_page(page, lruvec, tier)) + sorted += delta; + else if (isolate_page(page, lruvec, sc)) { + list_add(&page->lru, list); + isolated += delta; + } else { + list_move(&page->lru, &moved); + skipped += delta; + } + + if (!--remaining) + break; + + if (max(isolated, skipped) >= SWAP_CLUSTER_MAX) + break; + } + + list_splice(&moved, head); + __count_zid_vm_events(PGSCAN_SKIP, zone, skipped); + + if (!remaining || isolated >= SWAP_CLUSTER_MAX) + break; + } + + success = try_to_inc_min_seq(lruvec, type); + + *nr_to_scan -= scanned; + + item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; + if (!cgroup_reclaim(sc)) { + __count_vm_events(item, isolated); + __count_vm_events(PGREFILL, sorted); + } + __count_memcg_events(memcg, item, isolated); + __count_memcg_events(memcg, PGREFILL, sorted); + __count_vm_events(PGSCAN_ANON + type, isolated); + + if (isolated) + return isolated; + /* + * We may have trouble finding eligible pages due to reclaim_idx, + * may_unmap and may_writepage. The following check makes sure we won't + * be stuck if we aren't making enough progress. + */ + return !remaining || success || *nr_to_scan <= 0 ? 0 : -ENOENT; +} + +static int get_tier_to_isolate(struct lruvec *lruvec, int type) +{ + int tier; + struct controller_pos sp, pv; + + /* + * Ideally we don't want to evict upper tiers that have higher refault + * rates. However, we need to leave a margin for the fluctuations in + * refault rates. So we use a larger gain factor to make sure upper + * tiers are indeed more active. We choose 2 because the lowest upper + * tier would have twice of the refault rate of the base tier, according + * to their numbers of accesses. + */ + read_controller_pos(&sp, lruvec, type, 0, 1); + for (tier = 1; tier < MAX_NR_TIERS; tier++) { + read_controller_pos(&pv, lruvec, type, tier, 2); + if (!positive_ctrl_err(&sp, &pv)) + break; + } + + return tier - 1; +} + +static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_to_isolate) +{ + int type, tier; + struct controller_pos sp, pv; + int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness }; + + /* + * Compare the refault rates between the base tiers of anon and file to + * determine which type to evict. Also need to compare the refault rates + * of the upper tiers of the selected type with that of the base tier of + * the other type to determine which tier of the selected type to evict. + */ + read_controller_pos(&sp, lruvec, 0, 0, gain[0]); + read_controller_pos(&pv, lruvec, 1, 0, gain[1]); + type = positive_ctrl_err(&sp, &pv); + + read_controller_pos(&sp, lruvec, !type, 0, gain[!type]); + for (tier = 1; tier < MAX_NR_TIERS; tier++) { + read_controller_pos(&pv, lruvec, type, tier, gain[type]); + if (!positive_ctrl_err(&sp, &pv)) + break; + } + + *tier_to_isolate = tier - 1; + + return type; +} + +static int isolate_pages(struct lruvec *lruvec, struct scan_control *sc, int swappiness, + long *nr_to_scan, int *type_to_scan, struct list_head *list) +{ + int i; + int type; + int isolated; + int tier = -1; + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + VM_BUG_ON(!seq_is_valid(lruvec)); + + if (get_hi_wmark(max_seq, min_seq, swappiness) == MIN_NR_GENS) + return 0; + /* + * Try to select a type based on generations and swappiness, and if that + * fails, fall back to get_type_to_scan(). When anon and file are both + * available from the same generation, swappiness 200 is interpreted as + * anon first and swappiness 1 is interpreted as file first. + */ + if (!swappiness) + type = 1; + else if (min_seq[0] > min_seq[1]) + type = 1; + else if (min_seq[0] < min_seq[1]) + type = 0; + else if (swappiness == 1) + type = 1; + else if (swappiness == 200) + type = 0; + else + type = get_type_to_scan(lruvec, swappiness, &tier); + + if (tier == -1) + tier = get_tier_to_isolate(lruvec, type); + + for (i = !swappiness; i < ANON_AND_FILE; i++) { + isolated = scan_pages(lruvec, sc, nr_to_scan, type, tier, list); + if (isolated >= 0) + break; + + type = !type; + tier = get_tier_to_isolate(lruvec, type); + } + + if (isolated < 0) + isolated = *nr_to_scan = 0; + + *type_to_scan = type; + + return isolated; +} + +/* Main function used by the foreground, the background and the user-triggered eviction. */ +static bool evict_pages(struct lruvec *lruvec, struct scan_control *sc, int swappiness, + long *nr_to_scan) +{ + int type; + int isolated; + int reclaimed; + LIST_HEAD(list); + struct page *page; + enum vm_event_item item; + struct reclaim_stat stat; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + + spin_lock_irq(&lruvec->lru_lock); + + isolated = isolate_pages(lruvec, sc, swappiness, nr_to_scan, &type, &list); + VM_BUG_ON(list_empty(&list) == !!isolated); + + if (isolated) + __mod_node_page_state(pgdat, NR_ISOLATED_ANON + type, isolated); + + spin_unlock_irq(&lruvec->lru_lock); + + if (!isolated) + goto done; + + reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false); + /* + * We need to prevent rejected pages from being added back to the same + * lists they were isolated from. Otherwise we may risk looping on them + * forever. + */ + list_for_each_entry(page, &list, lru) { + if (!page_evictable(page)) + continue; + + if (!PageReclaim(page) || !(PageDirty(page) || PageWriteback(page))) + SetPageActive(page); + + ClearPageReferenced(page); + ClearPageWorkingset(page); + } + + spin_lock_irq(&lruvec->lru_lock); + + move_pages_to_lru(lruvec, &list); + + __mod_node_page_state(pgdat, NR_ISOLATED_ANON + type, -isolated); + + item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; + if (!cgroup_reclaim(sc)) + __count_vm_events(item, reclaimed); + __count_memcg_events(memcg, item, reclaimed); + __count_vm_events(PGSTEAL_ANON + type, reclaimed); + + spin_unlock_irq(&lruvec->lru_lock); + + mem_cgroup_uncharge_list(&list); + free_unref_page_list(&list); + + sc->nr_reclaimed += reclaimed; +done: + return *nr_to_scan > 0 && sc->nr_reclaimed < sc->nr_to_reclaim; +} + +static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness) +{ + int gen, type, zone; + int nr_gens; + long nr_to_scan = 0; + struct lrugen *lrugen = &lruvec->evictable; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + for (type = !swappiness; type < ANON_AND_FILE; type++) { + unsigned long seq; + + for (seq = min_seq[type]; seq <= max_seq; seq++) { + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone <= sc->reclaim_idx; zone++) + nr_to_scan += READ_ONCE(lrugen->sizes[gen][type][zone]); + } + } + + if (nr_to_scan <= 0) + return 0; + + nr_gens = get_hi_wmark(max_seq, min_seq, swappiness); + + if (current_is_kswapd()) { + gen = lru_gen_from_seq(max_seq - nr_gens + 1); + if (time_is_before_eq_jiffies(READ_ONCE(lrugen->timestamps[gen]) + + READ_ONCE(lru_gen_min_ttl))) + sc->file_is_tiny = 0; + + /* leave the work to lru_gen_age_node() */ + if (nr_gens == MIN_NR_GENS) + return 0; + + if (nr_to_scan >= sc->nr_to_reclaim) + sc->force_deactivate = 0; + } + + nr_to_scan = max(nr_to_scan >> sc->priority, (long)!mem_cgroup_online(memcg)); + if (!nr_to_scan || nr_gens > MIN_NR_GENS) + return nr_to_scan; + + /* move onto other memcgs if we haven't tried them all yet */ + if (!mem_cgroup_disabled() && !sc->force_deactivate) { + sc->skipped_deactivate = 1; + return 0; + } + + return try_to_inc_max_seq(lruvec, max_seq, sc, swappiness) ? nr_to_scan : 0; +} + +static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) +{ + struct blk_plug plug; + long scanned = 0; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + + lru_add_drain(); + + blk_start_plug(&plug); + + while (true) { + long nr_to_scan; + int swappiness = sc->may_swap ? get_swappiness(memcg) : 0; + + nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness) - scanned; + if (nr_to_scan <= 0) + break; + + scanned += nr_to_scan; + + if (!evict_pages(lruvec, sc, swappiness, &nr_to_scan)) + break; + + scanned -= nr_to_scan; + + if (mem_cgroup_below_min(memcg) || + (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim)) + break; + + cond_resched(); + } + + blk_finish_plug(&plug); +} + /****************************************************************************** * state change ******************************************************************************/ @@ -4358,6 +4786,10 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) { } +static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) +{ +} + #endif /* CONFIG_LRU_GEN */ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) @@ -4371,6 +4803,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) struct blk_plug plug; bool scan_adjusted; + if (lru_gen_enabled()) { + lru_gen_shrink_lruvec(lruvec, sc); + return; + } + get_scan_count(lruvec, sc, nr); /* Record the original scan target for proportional adjustments later */ @@ -4837,6 +5274,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) struct lruvec *target_lruvec; unsigned long refaults; + if (lru_gen_enabled()) + return; + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON); target_lruvec->refaults[0] = refaults; From patchwork Wed Aug 18 06:31:05 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443395 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B02A6C4338F for ; Wed, 18 Aug 2021 06:31:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 514516103A for ; Wed, 18 Aug 2021 06:31:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 514516103A Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 1BC966B0082; Wed, 18 Aug 2021 02:31:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 148886B0083; Wed, 18 Aug 2021 02:31:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EDB7E8D0001; Wed, 18 Aug 2021 02:31:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0122.hostedemail.com [216.40.44.122]) by kanga.kvack.org (Postfix) with ESMTP id CF7766B0082 for ; Wed, 18 Aug 2021 02:31:24 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 855BB1842C4BE for ; Wed, 18 Aug 2021 06:31:24 +0000 (UTC) X-FDA: 78487229688.27.63DECA5 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf03.hostedemail.com (Postfix) with ESMTP id 4315030039A3 for ; Wed, 18 Aug 2021 06:31:24 +0000 (UTC) Received: by mail-yb1-f201.google.com with SMTP id d69-20020a25e6480000b02904f4a117bd74so1786362ybh.17 for ; Tue, 17 Aug 2021 23:31:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=Sa/+kI2QQ5yzW+73rZnyPuODQVSPnlfsCdgMQjH8k3w=; b=p/kWK0lL+i7CF7DanrRtUfMJFmmz8fQT4ClShk8ocdN5YWEXJPS5ufC7DcnocXjud4 mfpRwtLsUt85zDj/Vel9Sy3Jjj3hox0+z3qxcKB3JAPlTaBCLqAbVsujMvLixzDj8xpA 7YbURlMIa/XunHEgChappMMLMe1W0v1tpJNhoqYo6dtXDFTS7fgcidPpm1UdfhuQt+Cm GUlkTg8AKW4VeZqM8zlDD46cHZQg9YvXeKCIo/LYj1mk7JTr8d8wRdeibTuEBDC5GgLg sFPtmD9SiU8NbM6KRor52pD3S3prQeHrpScJtPebeGIPNqCVE2YafoCfIG0lG5wFHL7H fyuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=Sa/+kI2QQ5yzW+73rZnyPuODQVSPnlfsCdgMQjH8k3w=; b=Gg7WAcEuFJjIqPfsy7dFA0EDBhB5GOqUbL1z47Xgiv6tAvHtQYeEpHw/2v7Qk6EDhM 1iywY6ESesPeqxD6G2GGyUfugp4PX7XyqXOEY5B4MfArZOzGvOU6ekMimKd5D3TGuzTN ePo2+D6fFIfIAfgKxKnJZKujnhNJ0yD2Jhe+WGXh2SHNS56BnVHx2yxD5IdhV2k0Nspv BGrwr2gyRBJ8gx06nEkvT8WeVSYjyUIJDBrJ9jRkVDd0SYA84cpn3f+gv7REeB/VLDLj LVNyAUEq2MiS4JEz08SATFl5J0NPssI6PJfyri9CexnXs+ocn+mpSpMPxSartPFAXaJ2 /g3g== X-Gm-Message-State: AOAM530qMYJEMg8VzkiRKR3/s4rbAsfzh6ygUrzbXFUvWV7MtfR834TG TxDICFg/pD3oJcIYpJwVj/+Pe4uO+2a1Z6zz2rMrP82K1q7L/8xe47r0rR9M2Z6pJGogycICk2k kp6VBJ+DYiHYUCOZswEW6sPpaLibf7LdCP/fggG/QnDkqn4FBNJPTJoAe X-Google-Smtp-Source: ABdhPJz8Y4PWzvx36wd+Xufm5SH2b/iFjdnqIog9P1g4iCjFBZjzAPAITo1V9qboEamgEU0upnvDj06FKHU= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a25:bc82:: with SMTP id e2mr8957312ybk.307.1629268283542; Tue, 17 Aug 2021 23:31:23 -0700 (PDT) Date: Wed, 18 Aug 2021 00:31:05 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-10-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 09/11] mm: multigenerational lru: user interface From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov X-Rspamd-Queue-Id: 4315030039A3 Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b="p/kWK0lL"; spf=pass (imf03.hostedemail.com: domain of 3O6kcYQYKCBAEAFxq4w44w1u.s421y3AD-220Bqs0.47w@flex--yuzhao.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3O6kcYQYKCBAEAFxq4w44w1u.s421y3AD-220Bqs0.47w@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam01 X-Stat-Signature: 4pwa8jkozqtc6scuyzxpkd8sxheuqpu6 X-HE-Tag: 1629268284-50209 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add /sys/kernel/mm/lru_gen/enabled to enable and disable the multigenerational lru at runtime. Add /sys/kernel/mm/lru_gen/min_ttl_ms to protect the working set of a given number of milliseconds. The OOM killer is invoked if this working set cannot be kept in memory. Add /sys/kernel/debug/lru_gen to monitor the multigenerational lru and invoke the aging and the eviction. This file has the following output: memcg memcg_id memcg_path node node_id min_gen birth_time anon_size file_size ... max_gen birth_time anon_size file_size min_gen is the oldest generation number and max_gen is the youngest generation number. birth_time is in milliseconds. anon_size and file_size are in pages. This file takes the following input: + memcg_id node_id max_gen [swappiness] - memcg_id node_id min_gen [swappiness] [nr_to_reclaim] The first command line invokes the aging, which scans PTEs for accessed pages and then creates the next generation max_gen+1. A swap file and a non-zero swappiness, which overrides vm.swappiness, are required to scan PTEs mapping anon pages. The second command line invokes the eviction, which evicts generations less than or equal to min_gen. min_gen should be less than max_gen-1 as max_gen and max_gen-1 are not fully aged and therefore cannot be evicted. nr_to_reclaim can be used to limit the number of pages to evict. Multiple command lines are supported, as is concatenation with delimiters "," and ";". Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- include/linux/nodemask.h | 1 + mm/vmscan.c | 412 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 413 insertions(+) diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 567c3ddba2c4..90840c459abc 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state) #define first_online_node 0 #define first_memory_node 0 #define next_online_node(nid) (MAX_NUMNODES) +#define next_memory_node(nid) (MAX_NUMNODES) #define nr_node_ids 1U #define nr_online_nodes 1U diff --git a/mm/vmscan.c b/mm/vmscan.c index 2f1fffbd2d61..c6d539a73d00 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -52,6 +52,8 @@ #include #include #include +#include +#include #include #include @@ -4732,6 +4734,410 @@ static int __meminit __maybe_unused mem_notifier(struct notifier_block *self, return NOTIFY_DONE; } +/****************************************************************************** + * sysfs interface + ******************************************************************************/ + +static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl))); +} + +static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int msecs; + + if (kstrtouint(buf, 10, &msecs)) + return -EINVAL; + + WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs)); + + return len; +} + +static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR( + min_ttl_ms, 0644, show_min_ttl, store_min_ttl +); + +static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%d\n", lru_gen_enabled()); +} + +static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t len) +{ + int enable; + + if (kstrtoint(buf, 10, &enable)) + return -EINVAL; + + lru_gen_set_state(enable, true, false); + + return len; +} + +static struct kobj_attribute lru_gen_enabled_attr = __ATTR( + enabled, 0644, show_enable, store_enable +); + +static struct attribute *lru_gen_attrs[] = { + &lru_gen_min_ttl_attr.attr, + &lru_gen_enabled_attr.attr, + NULL +}; + +static struct attribute_group lru_gen_attr_group = { + .name = "lru_gen", + .attrs = lru_gen_attrs, +}; + +/****************************************************************************** + * debugfs interface + ******************************************************************************/ + +static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos) +{ + struct mem_cgroup *memcg; + loff_t nr_to_skip = *pos; + + m->private = kvmalloc(PATH_MAX, GFP_KERNEL); + if (!m->private) + return ERR_PTR(-ENOMEM); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + int nid; + + for_each_node_state(nid, N_MEMORY) { + if (!nr_to_skip--) + return mem_cgroup_lruvec(memcg, NODE_DATA(nid)); + } + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + return NULL; +} + +static void lru_gen_seq_stop(struct seq_file *m, void *v) +{ + if (!IS_ERR_OR_NULL(v)) + mem_cgroup_iter_break(NULL, lruvec_memcg(v)); + + kvfree(m->private); + m->private = NULL; +} + +static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + int nid = lruvec_pgdat(v)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(v); + + ++*pos; + + nid = next_memory_node(nid); + if (nid == MAX_NUMNODES) { + memcg = mem_cgroup_iter(NULL, memcg, NULL); + if (!memcg) + return NULL; + + nid = first_memory_node; + } + + return mem_cgroup_lruvec(memcg, NODE_DATA(nid)); +} + +static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec, + unsigned long max_seq, unsigned long *min_seq, + unsigned long seq) +{ + int i; + int type, tier; + int hist = lru_hist_from_seq(seq); + struct lrugen *lrugen = &lruvec->evictable; + int nid = lruvec_pgdat(lruvec)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); + + for (tier = 0; tier < MAX_NR_TIERS; tier++) { + seq_printf(m, " %10d", tier); + for (type = 0; type < ANON_AND_FILE; type++) { + unsigned long n[3] = {}; + + if (seq == max_seq) { + n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]); + n[1] = READ_ONCE(lrugen->avg_total[type][tier]); + + seq_printf(m, " %10luR %10luT %10lu ", n[0], n[1], n[2]); + } else if (seq == min_seq[type] || NR_STAT_GENS > 1) { + n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]); + n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]); + + seq_printf(m, " %10lur %10lue %10lup", n[0], n[1], n[2]); + } else + seq_puts(m, " 0 0 0 "); + } + seq_putc(m, '\n'); + } + + seq_puts(m, " "); + for (i = 0; i < NR_MM_STATS; i++) { + if (i == 6) + seq_puts(m, "\n "); + + if (seq == max_seq && NR_STAT_GENS == 1) + seq_printf(m, " %10lu%c", READ_ONCE(mm_list->nodes[nid].stats[hist][i]), + toupper(MM_STAT_CODES[i])); + else if (seq != max_seq && NR_STAT_GENS > 1) + seq_printf(m, " %10lu%c", READ_ONCE(mm_list->nodes[nid].stats[hist][i]), + MM_STAT_CODES[i]); + else + seq_puts(m, " 0 "); + } + seq_putc(m, '\n'); +} + +static int lru_gen_seq_show(struct seq_file *m, void *v) +{ + unsigned long seq; + bool full = !debugfs_real_fops(m->file)->write; + struct lruvec *lruvec = v; + struct lrugen *lrugen = &lruvec->evictable; + int nid = lruvec_pgdat(lruvec)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (nid == first_memory_node) { + const char *path = memcg ? m->private : ""; + +#ifdef CONFIG_MEMCG + if (memcg) + cgroup_path(memcg->css.cgroup, m->private, PATH_MAX); +#endif + seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path); + } + + seq_printf(m, " node %5d\n", nid); + + if (!full) + seq = min(min_seq[0], min_seq[1]); + else if (max_seq >= MAX_NR_GENS) + seq = max_seq - MAX_NR_GENS + 1; + else + seq = 0; + + for (; seq <= max_seq; seq++) { + int gen, type, zone; + unsigned int msecs; + + gen = lru_gen_from_seq(seq); + msecs = jiffies_to_msecs(jiffies - READ_ONCE(lrugen->timestamps[gen])); + + seq_printf(m, " %10lu %10u", seq, msecs); + + for (type = 0; type < ANON_AND_FILE; type++) { + long size = 0; + + if (seq < min_seq[type]) { + seq_puts(m, " -0 "); + continue; + } + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += READ_ONCE(lrugen->sizes[gen][type][zone]); + + seq_printf(m, " %10lu ", max(size, 0L)); + } + + seq_putc(m, '\n'); + + if (full) + lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq); + } + + return 0; +} + +static const struct seq_operations lru_gen_seq_ops = { + .start = lru_gen_seq_start, + .stop = lru_gen_seq_stop, + .next = lru_gen_seq_next, + .show = lru_gen_seq_show, +}; + +static int run_aging(struct lruvec *lruvec, unsigned long seq, int swappiness) +{ + struct scan_control sc = {}; + DEFINE_MAX_SEQ(lruvec); + + if (seq == max_seq) + try_to_inc_max_seq(lruvec, max_seq, &sc, swappiness); + + return seq > max_seq ? -EINVAL : 0; +} + +static int run_eviction(struct lruvec *lruvec, unsigned long seq, int swappiness, + unsigned long nr_to_reclaim) +{ + unsigned int flags; + struct blk_plug plug; + int err = -EINTR; + long nr_to_scan = LONG_MAX; + struct scan_control sc = { + .nr_to_reclaim = nr_to_reclaim, + .may_writepage = 1, + .may_unmap = 1, + .may_swap = 1, + .reclaim_idx = MAX_NR_ZONES - 1, + .gfp_mask = GFP_KERNEL, + }; + DEFINE_MAX_SEQ(lruvec); + + if (seq >= max_seq - 1) + return -EINVAL; + + flags = memalloc_noreclaim_save(); + + blk_start_plug(&plug); + + while (!signal_pending(current)) { + DEFINE_MIN_SEQ(lruvec); + + if (seq < min(min_seq[!swappiness], min_seq[swappiness < 200]) || + !evict_pages(lruvec, &sc, swappiness, &nr_to_scan)) { + err = 0; + break; + } + + cond_resched(); + } + + blk_finish_plug(&plug); + + memalloc_noreclaim_restore(flags); + + return err; +} + +static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq, + int swappiness, unsigned long nr_to_reclaim) +{ + struct lruvec *lruvec; + int err = -EINVAL; + struct mem_cgroup *memcg = NULL; + + if (!mem_cgroup_disabled()) { + rcu_read_lock(); + memcg = mem_cgroup_from_id(memcg_id); +#ifdef CONFIG_MEMCG + if (memcg && !css_tryget(&memcg->css)) + memcg = NULL; +#endif + rcu_read_unlock(); + + if (!memcg) + goto done; + } + if (memcg_id != mem_cgroup_id(memcg)) + goto done; + + if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY)) + goto done; + + lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); + + if (swappiness == -1) + swappiness = get_swappiness(memcg); + else if (swappiness > 200U) + goto done; + + switch (cmd) { + case '+': + err = run_aging(lruvec, seq, swappiness); + break; + case '-': + err = run_eviction(lruvec, seq, swappiness, nr_to_reclaim); + break; + } +done: + mem_cgroup_put(memcg); + + return err; +} + +static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, + size_t len, loff_t *pos) +{ + void *buf; + char *cur, *next; + int err = 0; + + buf = kvmalloc(len + 1, GFP_USER); + if (!buf) + return -ENOMEM; + + if (copy_from_user(buf, src, len)) { + kvfree(buf); + return -EFAULT; + } + + next = buf; + next[len] = '\0'; + + while ((cur = strsep(&next, ",;\n"))) { + int n; + int end; + char cmd; + unsigned int memcg_id; + unsigned int nid; + unsigned long seq; + unsigned int swappiness = -1; + unsigned long nr_to_reclaim = -1; + + cur = skip_spaces(cur); + if (!*cur) + continue; + + n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid, + &seq, &end, &swappiness, &end, &nr_to_reclaim, &end); + if (n < 4 || cur[end]) { + err = -EINVAL; + break; + } + + err = run_cmd(cmd, memcg_id, nid, seq, swappiness, nr_to_reclaim); + if (err) + break; + } + + kvfree(buf); + + return err ? : len; +} + +static int lru_gen_seq_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &lru_gen_seq_ops); +} + +static const struct file_operations lru_gen_rw_fops = { + .open = lru_gen_seq_open, + .read = seq_read, + .write = lru_gen_seq_write, + .llseek = seq_lseek, + .release = seq_release, +}; + +static const struct file_operations lru_gen_ro_fops = { + .open = lru_gen_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + /****************************************************************************** * initialization ******************************************************************************/ @@ -4772,6 +5178,12 @@ static int __init init_lru_gen(void) if (hotplug_memory_notifier(mem_notifier, 0)) pr_err("lru_gen: failed to subscribe hotplug notifications\n"); + if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) + pr_err("lru_gen: failed to create sysfs group\n"); + + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); + debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops); + return 0; }; /* From patchwork Wed Aug 18 06:31:06 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443397 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FCB3C4320E for ; Wed, 18 Aug 2021 06:31:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EA34560720 for ; Wed, 18 Aug 2021 06:31:36 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EA34560720 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 9FABE6B0083; Wed, 18 Aug 2021 02:31:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9A9E36B0085; Wed, 18 Aug 2021 02:31:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 824228D0001; Wed, 18 Aug 2021 02:31:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0096.hostedemail.com [216.40.44.96]) by kanga.kvack.org (Postfix) with ESMTP id 61AC46B0083 for ; Wed, 18 Aug 2021 02:31:26 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 0C6338249980 for ; Wed, 18 Aug 2021 06:31:26 +0000 (UTC) X-FDA: 78487229772.18.1012126 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf11.hostedemail.com (Postfix) with ESMTP id C16AFF0058AD for ; Wed, 18 Aug 2021 06:31:25 +0000 (UTC) Received: by mail-yb1-f202.google.com with SMTP id j9-20020a2581490000b02905897d81c63fso1816744ybm.8 for ; Tue, 17 Aug 2021 23:31:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=ObPkY95BgccMVOXBvY8W42OB1HqUP7YDieqWubLPiME=; b=Jtstdp0V+Xh88MSFciGZx/6Crwh6uax0pLuw3AXs+D9oj1KbSd4HzqBIam2qrHZk5g s6jTCXgbdB9PNkW+XseCkH6f/SPnpj+XnPohW32F/qWDIdj7SKz1f0lnCEZ8LmwdB716 Rvc1f56BY2VSkf0TJZ2KH2yJfDtj+1gU2XMmm4u5Bt6jYwhCgsspJjhwVZKFfJUTrCz/ XZVzMwPQYhql8Y8pTREUZGN5i0txpFG5vvusnI8qF9fCezySVG3pknbNJwrstN6Q+nbI 0vRgf5epo1QPD4Pm/Aj6L2XrjBdYdKaX+yhcs/uRqo5Y444dTY1fVQPIgzKndV2ESXHr EBPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=ObPkY95BgccMVOXBvY8W42OB1HqUP7YDieqWubLPiME=; b=Cfb3bIYH/X5ffmWyrrvJ2r6l88JBTlwNYhVYwZ1kPPhLPvLtOHCo5g1OyHk4RuJDN6 hihtvhEx/uo6wAcvAdslydHefgcyhQwm3+XmVsTDvb++XMubb0RLNzFmLGFA7GntEgBh zas0bdN4o3+Xp8Ki1j2+plHhpjLcGFLEdKL6NQkj4bBL3vIAHKOiFFBuDpbcfjMja6ei p4+GFLpRAa35tNqgvbMu4TGFF9zSAexU/cjDG2RsDXxqNh7/OwPa0yFv7DQvNxrDhcvj gmYB0AQTUKQLcnCqyKndANtpbLv8CW52/RPOHz6dE4jRbP8+fpdiHtpB56QZK/o2r/w5 8fbA== X-Gm-Message-State: AOAM530uSIuKV+NjYSJ8uzL68ywRxE5GTZdZPQXM2wJuA/CHQhqQz2Kx 6MVKq2J6YSR/BSYgSG8cZiBDUmMp/AnKKypt4zWQhD+HM9KlaiEkqlXgimpcBbrRhjd5kaRQT3T 32mvdtUDTygsupZBOsJzuL08fYBnijlBa6pvVwjEKfj2QTa0k6HX1Qqty X-Google-Smtp-Source: ABdhPJxfsOO100ggYeO0sY++gqlc3MsdnvEUTquZH6o+BnI8n97BCwdldFsnuDGXBQoFBIYNUCH/1sFG8Qo= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a25:42d8:: with SMTP id p207mr9419451yba.270.1629268285077; Tue, 17 Aug 2021 23:31:25 -0700 (PDT) Date: Wed, 18 Aug 2021 00:31:06 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-11-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 10/11] mm: multigenerational lru: Kconfig From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=Jtstdp0V; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf11.hostedemail.com: domain of 3PakcYQYKCBIGCHzs6y66y3w.u64305CF-442Dsu2.69y@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3PakcYQYKCBIGCHzs6y66y3w.u64305CF-442Dsu2.69y@flex--yuzhao.bounces.google.com X-Stat-Signature: wbnonwezckp6pkry7sy33qudztsfo9q4 X-Rspamd-Queue-Id: C16AFF0058AD X-Rspamd-Server: rspam05 X-HE-Tag: 1629268285-115850 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add configuration options for the multigenerational lru. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- mm/Kconfig | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index 40a9bfcd5062..4cd257cfdf84 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -889,4 +889,63 @@ config IO_MAPPING config SECRETMEM def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED +# the multigenerational lru { +config LRU_GEN + bool "Multigenerational LRU" + depends on MMU + # the following options may leave not enough spare bits in page->flags + depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP) + help + A high performance LRU implementation to heavily overcommit workloads + that are not IO bound. See Documentation/vm/multigen_lru.rst for + details. + + Warning: do not enable this option unless you plan to use it because + it introduces a small per-process and per-memcg and per-node memory + overhead. + +config LRU_GEN_ENABLED + bool "Turn on by default" + depends on LRU_GEN + help + The default value of /sys/kernel/mm/lru_gen/enabled is 0. This option + changes it to 1. + + Warning: the default value is the fast path. See + Documentation/static-keys.txt for details. + +config LRU_GEN_STATS + bool "Full stats for debugging" + depends on LRU_GEN + help + This option keeps full stats for each generation, which can be read + from /sys/kernel/debug/lru_gen_full. + + Warning: do not enable this option unless you plan to use it because + it introduces an additional small per-process and per-memcg and + per-node memory overhead. + +config NR_LRU_GENS + int "Max number of generations" + depends on LRU_GEN + range 4 31 + default 7 + help + This will use order_base_2(N+1) spare bits from page flags. + + Warning: do not use numbers larger than necessary because each + generation introduces a small per-node and per-memcg memory overhead. + +config TIERS_PER_GEN + int "Number of tiers per generation" + depends on LRU_GEN + range 2 5 + default 4 + help + This will use N-2 spare bits from page flags. + + Larger values generally offer better protection to active pages under + heavy buffered I/O workloads. +# } + endmenu From patchwork Wed Aug 18 06:31:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443399 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85412C4338F for ; Wed, 18 Aug 2021 06:31:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 37A5B60F11 for ; Wed, 18 Aug 2021 06:31:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 37A5B60F11 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 25A286B0085; Wed, 18 Aug 2021 02:31:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E4B48D0001; Wed, 18 Aug 2021 02:31:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F29206B0088; Wed, 18 Aug 2021 02:31:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0215.hostedemail.com [216.40.44.215]) by kanga.kvack.org (Postfix) with ESMTP id D31CC6B0085 for ; Wed, 18 Aug 2021 02:31:27 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 65DCA22892 for ; Wed, 18 Aug 2021 06:31:27 +0000 (UTC) X-FDA: 78487229814.11.5C300F7 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf05.hostedemail.com (Postfix) with ESMTP id 22BC45048BB2 for ; Wed, 18 Aug 2021 06:31:27 +0000 (UTC) Received: by mail-yb1-f202.google.com with SMTP id n20-20020a2540140000b0290593b8e64cd5so1837294yba.3 for ; Tue, 17 Aug 2021 23:31:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=EuYCgt8+lkHBd5kOQmKG+gfowm77wjwHGDYaQfqgXkM=; b=jf1KXu7fmEUdMsKv8qqPS1uXlv+56ikFPzF13yc3+KTvUt6asr/dfm42U7m8z2oaxW A1lZnhfuZsNz6idvEFIx7MwSYBYmByXzQK4ED92Tl/aOYre4fO0pStwZP5hfQyZoLhpA k9RuTVA9AcmArHPO54uF/Ki4EABqeALUSuj9BtbZVk/q1wrblw+DakX+xFVyH0ZPllpN QRr9BLXmVbCFxjf9LCMyU2/3AuDTtwNHaELFoGy21fgiwitZOhlfWs7IBfT24zSAzv5z gcDc1UyinDFGNwilfZtrE/BXLAPyfDw73xKZU00G6uLKrvxEGHYLjgWMad1/KSypqjGn s8Lg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=EuYCgt8+lkHBd5kOQmKG+gfowm77wjwHGDYaQfqgXkM=; b=bpiAyfCtnbVkWWkC4NcCa/hvNnmnf6xRF1m0fFRGnwecJGS0w262LZqzVbxoeQBvSm MWqCNMWz/r+IAMkMEnQriPzrzmN5bnLpGek7cCxpZ5uS+GUOPfa8fvupyhsNoqY4vaUE CcWqxTi3vSEKb+HmcenIywGxM+NysnOi20GQPMwwzeOzpzL344XNS8cGV+twx2ABb929 KedIbHHb4lNSlhnFA1CeIfWDBCTFTKFMl+gOM5r6Nr7wnUZSyDlJdG7QjNAWvGL+IV6M 09U73b0C1/UxR3+TG0Aq2tol5eGmRa6VWIKC5/1loFYTZVMCsBmBO4sR/Yhp2G+Aa7wp Rvyg== X-Gm-Message-State: AOAM533+43noaAkuihpXxsnRGizyoA+MqzrgGJmzbTF/yoZ5rDVRSEmI 1XV/a6O4TzpPo4fCtQk7U2V6pdMwTmWsijUigwaqACx8vsoKbQSYJqYccH9apUFjPaz4mGEGngr 0ADDt2XZ8kuTST9dnWeM/OyHpR43CeWBegLqpSeX4M+VQWUTTmbgwdNMZ X-Google-Smtp-Source: ABdhPJyHuC5mbVB/nEzN5pzEvd35CiT+uZVTZdGa9/B0v3WwM0r7rwsMuOQ4SlTfuOuU4UXhfMr9DYHhRjw= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a25:f310:: with SMTP id c16mr8599656ybs.464.1629268286452; Tue, 17 Aug 2021 23:31:26 -0700 (PDT) Date: Wed, 18 Aug 2021 00:31:07 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-12-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 11/11] mm: multigenerational lru: documentation From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=jf1KXu7f; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of 3PqkcYQYKCBMHDI0t7z77z4x.v75416DG-553Etv3.7Az@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3PqkcYQYKCBMHDI0t7z77z4x.v75416DG-553Etv3.7Az@flex--yuzhao.bounces.google.com X-Stat-Signature: c4nq3g7ct4oiq961sdqddouod8zp6x84 X-Rspamd-Queue-Id: 22BC45048BB2 X-Rspamd-Server: rspam05 X-HE-Tag: 1629268287-160661 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add Documentation/vm/multigen_lru.rst. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- Documentation/vm/index.rst | 1 + Documentation/vm/multigen_lru.rst | 134 ++++++++++++++++++++++++++++++ 2 files changed, 135 insertions(+) create mode 100644 Documentation/vm/multigen_lru.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index eff5fbd492d0..c353b3f55924 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -17,6 +17,7 @@ various features of the Linux memory management swap_numa zswap + multigen_lru Kernel developers MM documentation ================================== diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..adedff5319d9 --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,134 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Multigenerational LRU +===================== + +Quick Start +=========== +Build Configurations +-------------------- +:Required: Set ``CONFIG_LRU_GEN=y``. + +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by + default. + +Runtime Configurations +---------------------- +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the + feature was not turned on by default. + +:Optional: Write ``N`` to ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to + protect the working set of ``N`` milliseconds. The OOM killer is + invoked if this working set cannot be kept in memory. + +:Optional: Read ``/sys/kernel/debug/lru_gen`` to confirm the feature + is turned on. This file has the following output: + +:: + + memcg memcg_id memcg_path + node node_id + min_gen birth_time anon_size file_size + ... + max_gen birth_time anon_size file_size + +``min_gen`` is the oldest generation number and ``max_gen`` is the +youngest generation number. ``birth_time`` is in milliseconds. +``anon_size`` and ``file_size`` are in pages. + +Phones/Laptops/Workstations +--------------------------- +No additional configurations required. + +Servers/Data Centers +-------------------- +:To support more generations: Change ``CONFIG_NR_LRU_GENS`` to a + larger number. + +:To support more tiers: Change ``CONFIG_TIERS_PER_GEN`` to a larger + number. + +:To support full stats: Set ``CONFIG_LRU_GEN_STATS=y``. + +:Working set estimation: Write ``+ memcg_id node_id max_gen + [swappiness]`` to ``/sys/kernel/debug/lru_gen`` to invoke the aging, + which scans PTEs for accessed pages and then creates the next + generation ``max_gen+1``. A swap file and a non-zero ``swappiness``, + which overrides ``vm.swappiness``, are required to scan PTEs mapping + anon pages. + +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness] + [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to invoke the + eviction, which evicts generations less than or equal to ``min_gen``. + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and + ``max_gen-1`` are not fully aged and therefore cannot be evicted. + ``nr_to_reclaim`` can be used to limit the number of pages to evict. + Multiple command lines are supported, so does concatenation with + delimiters ``,`` and ``;``. + +Framework +========= +For each ``lruvec``, evictable pages are divided into multiple +generations. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[2]`` separately for anon and file types as clean +file pages can be evicted regardless of swap and writeback +constraints. These three variables are monotonically increasing. +Generation numbers are truncated into +``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into +``page->flags``. The sliding window technique is used to prevent +truncated generation numbers from overlapping. Each truncated +generation number is an index to an array of per-type and per-zone +lists ``lrugen->lists``. + +Each generation is then divided into multiple tiers. Tiers represent +levels of usage from file descriptors only. Pages accessed ``N`` times +via file descriptors belong to tier ``order_base_2(N)``. Each +generation contains at most ``CONFIG_TIERS_PER_GEN`` tiers, and they +require additional ``CONFIG_TIERS_PER_GEN-2`` bits in ``page->flags``. +In contrast to moving across generations which requires list +operations, moving across tiers only involves operations on +``page->flags`` and therefore has a negligible cost. A feedback loop +modeled after the PID controller monitors refault rates of all tiers +and decides when to protect pages from which tiers. + +The framework comprises two conceptually independent components: the +aging and the eviction, which can be invoked separately from user +space for the purpose of working set estimation and proactive reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, the aging +traverses ``lruvec_memcg()->mm_list`` and calls ``walk_page_range()`` +to scan PTEs for accessed pages (a ``mm_struct`` list is maintained +for each ``memcg``). Upon finding one, the aging updates its +generation number to ``max_seq`` (modulo ``CONFIG_NR_LRU_GENS``). +After each round of traversal, the aging increments ``max_seq``. The +aging is due when both ``min_seq[2]`` have caught up with +``max_seq-1``. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, the +eviction scans pages on the per-zone lists indexed by anon and file +``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``). It first tries to +select a type based on the values of ``min_seq[2]``. If they are +equal, it selects the type that has a lower refault rate. The eviction +sorts a page according to its updated generation number if the aging +has found this page accessed. It also moves a page to the next +generation if this page is from an upper tier that has a higher +refault rate than the base tier. The eviction increments +``min_seq[2]`` of a selected type when it finds all the per-zone lists +indexed by ``min_seq[2]`` of this selected type are empty. + +To-do List +========== +KVM Optimization +---------------- +Support shadow page table walk. + +NUMA Optimization +----------------- +Optimize page table walk for NUMA.