330 lines
14 KiB
Diff
330 lines
14 KiB
Diff
From mboxrd@z Thu Jan 1 00:00:00 1970
|
|
Return-Path: <linux-kernel-owner@kernel.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
X-Spam-Level:
|
|
X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
|
|
DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
|
|
INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
|
|
USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no
|
|
version=3.4.0
|
|
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
|
|
by smtp.lore.kernel.org (Postfix) with ESMTP id 99A73C4360C
|
|
for <linux-kernel@archiver.kernel.org>; Sat, 13 Mar 2021 07:59:07 +0000 (UTC)
|
|
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
|
|
by mail.kernel.org (Postfix) with ESMTP id 7C9EA64F1E
|
|
for <linux-kernel@archiver.kernel.org>; Sat, 13 Mar 2021 07:59:07 +0000 (UTC)
|
|
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
|
|
id S233629AbhCMH6s (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
|
|
Sat, 13 Mar 2021 02:58:48 -0500
|
|
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59030 "EHLO
|
|
lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
|
|
with ESMTP id S233011AbhCMH6Q (ORCPT
|
|
<rfc822;linux-kernel@vger.kernel.org>);
|
|
Sat, 13 Mar 2021 02:58:16 -0500
|
|
Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49])
|
|
by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D3F77C061574
|
|
for <linux-kernel@vger.kernel.org>; Fri, 12 Mar 2021 23:58:15 -0800 (PST)
|
|
Received: by mail-yb1-xb49.google.com with SMTP id y7so31766185ybh.20
|
|
for <linux-kernel@vger.kernel.org>; Fri, 12 Mar 2021 23:58:15 -0800 (PST)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20161025;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc;
|
|
bh=b33M625vmeFm8iJFKYIy9IbS+yyXzJHrz2YlprWAE88=;
|
|
b=qyBWCu6iSCz/+GOTBSyjEGx0UNh3wx8I4EpB+DGhW3FtbTsYmoVsgkJK7K9lMib92D
|
|
8UESs064HgmPaCcFC9wummpEDT04EZB57UgnWkSzwsmT8q8yKbsLNsdnaqxDho13rxSL
|
|
l1lhvY8XggaGyQS76caURzCZzmuuIb31yoMyJa36cSNEQIIGzS/Qm0HS9FQ4Sslqjhio
|
|
7G+7M9RsfMDtCuFijNWCkO0VasJ5hLLwIPnUW2My7qRxlwAQoGToYUEA5ipkn9Ckz+I1
|
|
ZZZL32LyYugyxj8DFHGhkOK2vtm0J8rqvkbb7eJOL7RwHttQzqGHotvqWMTx+tw95ZXr
|
|
O2hQ==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20161025;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc;
|
|
bh=b33M625vmeFm8iJFKYIy9IbS+yyXzJHrz2YlprWAE88=;
|
|
b=c/gz9vPAEJyp/EJn3y5EFyXbzSo7i/3uPaPFgvAuDmTSD+ba6y1WjRuclOCOCm5zrN
|
|
rzU2v5yfzmo6p+fXYM+C5uFH8SqC+cMK+bYnZUuNl3OwvgwL3kofqlcrnKaQQMqRRSey
|
|
lrW47VbiLF0IySg6AM605BkxEjKxVIkdnWAS+bqXtQVym74gxHHOHwX5tGuCn/7Bs/c4
|
|
cekamb+2vIeO+6/P0YD1ZdLO9LS1OwgxSor5cocBSXXW7J7bBLCGKAyU++NrNnahjJs1
|
|
8ckqrHiqFC1mEWFcs+VBQ9/NCeVfcAfMGRtsMsljI5kll/myniTnSrETf49UyjrcTk87
|
|
zwYQ==
|
|
X-Gm-Message-State: AOAM531AI2qRYUeaR85JEsd20wl4qCSV4g5Qpav0tPw9JXRAxcdYk9S4
|
|
AlH8Rls2BT9Ub2LGipv5Jfv5X2rI3ho=
|
|
X-Google-Smtp-Source: ABdhPJwyYS+6AjG1nAXtEYBKUiLdJ5mAmkUQWOq9ngRJz7So3XTjiRhhO4QmZvWvkKbzXQ7oleb6ep4AFYs=
|
|
X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:f931:d3e4:faa0:4f74])
|
|
(user=yuzhao job=sendgmr) by 2002:a25:8003:: with SMTP id m3mr24313264ybk.452.1615622295044;
|
|
Fri, 12 Mar 2021 23:58:15 -0800 (PST)
|
|
Date: Sat, 13 Mar 2021 00:57:47 -0700
|
|
In-Reply-To: <20210313075747.3781593-1-yuzhao@google.com>
|
|
Message-Id: <20210313075747.3781593-15-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20210313075747.3781593-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.31.0.rc2.261.g7f71774620-goog
|
|
Subject: [PATCH v1 14/14] mm: multigenerational lru: documentation
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: linux-mm@kvack.org
|
|
Cc: Alex Shi <alex.shi@linux.alibaba.com>,
|
|
Andrew Morton <akpm@linux-foundation.org>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>,
|
|
Hillf Danton <hdanton@sina.com>,
|
|
Johannes Weiner <hannes@cmpxchg.org>,
|
|
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
|
|
Matthew Wilcox <willy@infradead.org>,
|
|
Mel Gorman <mgorman@suse.de>, Michal Hocko <mhocko@suse.com>,
|
|
Roman Gushchin <guro@fb.com>, Vlastimil Babka <vbabka@suse.cz>,
|
|
Wei Yang <richard.weiyang@linux.alibaba.com>,
|
|
Yang Shi <shy828301@gmail.com>,
|
|
Ying Huang <ying.huang@intel.com>,
|
|
linux-kernel@vger.kernel.org, page-reclaim@google.com,
|
|
Yu Zhao <yuzhao@google.com>
|
|
Content-Type: text/plain; charset="UTF-8"
|
|
Precedence: bulk
|
|
List-ID: <linux-kernel.vger.kernel.org>
|
|
X-Mailing-List: linux-kernel@vger.kernel.org
|
|
Archived-At: <https://lore.kernel.org/lkml/20210313075747.3781593-15-yuzhao@google.com/>
|
|
List-Archive: <https://lore.kernel.org/lkml/>
|
|
List-Post: <mailto:linux-kernel@vger.kernel.org>
|
|
|
|
Add Documentation/vm/multigen_lru.rst.
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
---
|
|
Documentation/vm/index.rst | 1 +
|
|
Documentation/vm/multigen_lru.rst | 210 ++++++++++++++++++++++++++++++
|
|
2 files changed, 211 insertions(+)
|
|
create mode 100644 Documentation/vm/multigen_lru.rst
|
|
|
|
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
|
|
index eff5fbd492d0..c353b3f55924 100644
|
|
--- a/Documentation/vm/index.rst
|
|
+++ b/Documentation/vm/index.rst
|
|
@@ -17,6 +17,7 @@ various features of the Linux memory management
|
|
|
|
swap_numa
|
|
zswap
|
|
+ multigen_lru
|
|
|
|
Kernel developers MM documentation
|
|
==================================
|
|
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
|
|
new file mode 100644
|
|
index 000000000000..fea927da2572
|
|
--- /dev/null
|
|
+++ b/Documentation/vm/multigen_lru.rst
|
|
@@ -0,0 +1,210 @@
|
|
+=====================
|
|
+Multigenerational LRU
|
|
+=====================
|
|
+
|
|
+Quick Start
|
|
+===========
|
|
+Build Options
|
|
+-------------
|
|
+:Required: Set ``CONFIG_LRU_GEN=y``.
|
|
+
|
|
+:Optional: Change ``CONFIG_NR_LRU_GENS`` to a number ``X`` to support
|
|
+ a maximum of ``X`` generations.
|
|
+
|
|
+:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by
|
|
+ default.
|
|
+
|
|
+Runtime Options
|
|
+---------------
|
|
+:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
|
|
+ feature was not turned on by default.
|
|
+
|
|
+:Optional: Change ``/sys/kernel/mm/lru_gen/spread`` to a number ``N``
|
|
+ to spread pages out across ``N+1`` generations. ``N`` must be less
|
|
+ than ``X``. Larger values make the background aging more aggressive.
|
|
+
|
|
+:Optional: Read ``/sys/kernel/debug/lru_gen`` to verify the feature.
|
|
+ This file has the following output:
|
|
+
|
|
+::
|
|
+
|
|
+ memcg memcg_id memcg_path
|
|
+ node node_id
|
|
+ min_gen birth_time anon_size file_size
|
|
+ ...
|
|
+ max_gen birth_time anon_size file_size
|
|
+
|
|
+Given a memcg and a node, ``min_gen`` is the oldest generation
|
|
+(number) and ``max_gen`` is the youngest. Birth time is in
|
|
+milliseconds. Anon and file sizes are in pages.
|
|
+
|
|
+Recipes
|
|
+-------
|
|
+:Android on ARMv8.1+: ``X=4``, ``N=0``
|
|
+
|
|
+:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of
|
|
+ ``ARM64_HW_AFDBM``
|
|
+
|
|
+:Laptops running Chrome on x86_64: ``X=7``, ``N=2``
|
|
+
|
|
+:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]``
|
|
+ to ``/sys/kernel/debug/lru_gen`` to account referenced pages to
|
|
+ generation ``max_gen`` and create the next generation ``max_gen+1``.
|
|
+ ``gen`` must be equal to ``max_gen`` in order to avoid races. A swap
|
|
+ file and a non-zero swappiness value are required to scan anon pages.
|
|
+ If swapping is not desired, set ``vm.swappiness`` to ``0`` and
|
|
+ overwrite it with a non-zero ``swappiness``.
|
|
+
|
|
+:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness]
|
|
+ [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict
|
|
+ generations less than or equal to ``gen``. ``gen`` must be less than
|
|
+ ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active generations
|
|
+ and therefore protected from the eviction. ``nr_to_reclaim`` can be
|
|
+ used to limit the number of pages to be evicted. Multiple command
|
|
+ lines are supported, so does concatenation with delimiters ``,`` and
|
|
+ ``;``.
|
|
+
|
|
+Workflow
|
|
+========
|
|
+Evictable pages are divided into multiple generations for each
|
|
+``lruvec``. The youngest generation number is stored in ``max_seq``
|
|
+for both anon and file types as they are aged on an equal footing. The
|
|
+oldest generation numbers are stored in ``min_seq[2]`` separately for
|
|
+anon and file types as clean file pages can be evicted regardless of
|
|
+swap and write-back constraints. Generation numbers are truncated into
|
|
+``ilog2(CONFIG_NR_LRU_GENS)+1`` bits in order to fit into
|
|
+``page->flags``. The sliding window technique is used to prevent
|
|
+truncated generation numbers from overlapping. Each truncated
|
|
+generation number is an index to an array of per-type and per-zone
|
|
+lists. Evictable pages are added to the per-zone lists indexed by
|
|
+``max_seq`` or ``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``),
|
|
+depending on whether they are being faulted in or read ahead. The
|
|
+workflow comprises two conceptually independent functions: the aging
|
|
+and the eviction.
|
|
+
|
|
+Aging
|
|
+-----
|
|
+The aging produces young generations. Given an ``lruvec``, the aging
|
|
+scans page tables for referenced pages of this ``lruvec``. Upon
|
|
+finding one, the aging updates its generation number to ``max_seq``.
|
|
+After each round of scan, the aging increments ``max_seq``. The aging
|
|
+maintains either a system-wide ``mm_struct`` list or per-memcg
|
|
+``mm_struct`` lists, and it only scans page tables of processes that
|
|
+have been scheduled since the last scan. Since scans are differential
|
|
+with respect to referenced pages, the cost is roughly proportional to
|
|
+their number.
|
|
+
|
|
+Eviction
|
|
+--------
|
|
+The eviction consumes old generations. Given an ``lruvec``, the
|
|
+eviction scans the pages on the per-zone lists indexed by either of
|
|
+``min_seq[2]``. It selects a type according to the values of
|
|
+``min_seq[2]`` and swappiness. During a scan, the eviction either
|
|
+sorts or isolates a page, depending on whether the aging has updated
|
|
+its generation number. When it finds all the per-zone lists are empty,
|
|
+the eviction increments ``min_seq[2]`` indexed by this selected type.
|
|
+The eviction triggers the aging when both of ``min_seq[2]`` reaches
|
|
+``max_seq-1``, assuming both anon and file types are reclaimable.
|
|
+
|
|
+Rationale
|
|
+=========
|
|
+Characteristics of cloud workloads
|
|
+----------------------------------
|
|
+With cloud storage gone mainstream, the role of local storage has
|
|
+diminished. For most of the systems running cloud workloads, anon
|
|
+pages account for the majority of memory consumption and page cache
|
|
+contains mostly executable pages. Notably, the portion of the unmapped
|
|
+is negligible.
|
|
+
|
|
+As a result, swapping is necessary to achieve substantial memory
|
|
+overcommit. And the ``rmap`` is the hottest in the reclaim path
|
|
+because its usage is proportional to the number of scanned pages,
|
|
+which on average is many times the number of reclaimed pages.
|
|
+
|
|
+With ``zram``, a typical ``kswapd`` profile on v5.11 looks like:
|
|
+
|
|
+::
|
|
+
|
|
+ 31.03% page_vma_mapped_walk
|
|
+ 25.59% lzo1x_1_do_compress
|
|
+ 4.63% do_raw_spin_lock
|
|
+ 3.89% vma_interval_tree_iter_next
|
|
+ 3.33% vma_interval_tree_subtree_search
|
|
+
|
|
+And with real swap, it looks like:
|
|
+
|
|
+::
|
|
+
|
|
+ 45.16% page_vma_mapped_walk
|
|
+ 7.61% do_raw_spin_lock
|
|
+ 5.69% vma_interval_tree_iter_next
|
|
+ 4.91% vma_interval_tree_subtree_search
|
|
+ 3.71% page_referenced_one
|
|
+
|
|
+Limitations of the Current Implementation
|
|
+-----------------------------------------
|
|
+Notion of the Active/Inactive
|
|
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
+For servers equipped with hundreds of gigabytes of memory, the
|
|
+granularity of the active/inactive is too coarse to be useful for job
|
|
+scheduling. And false active/inactive rates are relatively high.
|
|
+
|
|
+For phones and laptops, the eviction is biased toward file pages
|
|
+because the selection has to resort to heuristics as direct
|
|
+comparisons between anon and file types are infeasible.
|
|
+
|
|
+For systems with multiple nodes and/or memcgs, it is impossible to
|
|
+compare ``lruvec``\s based on the notion of the active/inactive.
|
|
+
|
|
+Incremental Scans via the ``rmap``
|
|
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
+Each incremental scan picks up at where the last scan left off and
|
|
+stops after it has found a handful of unreferenced pages. For most of
|
|
+the systems running cloud workloads, incremental scans lose the
|
|
+advantage under sustained memory pressure due to high ratios of the
|
|
+number of scanned pages to the number of reclaimed pages. On top of
|
|
+that, the ``rmap`` has poor memory locality due to its complex data
|
|
+structures. The combined effects typically result in a high amount of
|
|
+CPU usage in the reclaim path.
|
|
+
|
|
+Benefits of the Multigenerational LRU
|
|
+-------------------------------------
|
|
+Notion of Generation Numbers
|
|
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
+The notion of generation numbers introduces a quantitative approach to
|
|
+memory overcommit. A larger number of pages can be spread out across
|
|
+configurable generations, and thus they have relatively low false
|
|
+active/inactive rates. Each generation includes all pages that have
|
|
+been referenced since the last generation.
|
|
+
|
|
+Given an ``lruvec``, scans and the selections between anon and file
|
|
+types are all based on generation numbers, which are simple and yet
|
|
+effective. For different ``lruvec``\s, comparisons are still possible
|
|
+based on birth times of generations.
|
|
+
|
|
+Differential Scans via Page Tables
|
|
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
+Each differential scan discovers all pages that have been referenced
|
|
+since the last scan. Specifically, it walks the ``mm_struct`` list
|
|
+associated with an ``lruvec`` to scan page tables of processes that
|
|
+have been scheduled since the last scan. The cost of each differential
|
|
+scan is roughly proportional to the number of referenced pages it
|
|
+discovers. Unless address spaces are extremely sparse, page tables
|
|
+usually have better memory locality than the ``rmap``. The end result
|
|
+is generally a significant reduction in CPU usage, for most of the
|
|
+systems running cloud workloads.
|
|
+
|
|
+To-do List
|
|
+==========
|
|
+KVM Optimization
|
|
+----------------
|
|
+Support shadow page table walk.
|
|
+
|
|
+NUMA Optimization
|
|
+-----------------
|
|
+Add per-node RSS for ``should_skip_mm()``.
|
|
+
|
|
+Refault Tracking Optimization
|
|
+-----------------------------
|
|
+Use generation numbers rather than LRU positions in
|
|
+``workingset_eviction()`` and ``workingset_refault()``.
|
|
--
|
|
2.31.0.rc2.261.g7f71774620-goog
|
|
|
|
|