From nobody Mon May 13 00:41:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 151549750802516.436209617974214; Tue, 9 Jan 2018 03:31:48 -0800 (PST) Received: from localhost ([::1]:47022 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eYs8B-0005pM-4W for importer@patchew.org; Tue, 09 Jan 2018 06:31:47 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35084) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eYs5t-0004Rp-AU for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:28 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eYs5q-00032J-6f for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:25 -0500 Received: from mga09.intel.com ([134.134.136.24]:54587) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eYs5p-0002zJ-OT for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:22 -0500 Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 09 Jan 2018 03:29:21 -0800 Received: from devel-ww.sh.intel.com ([10.239.48.110]) by orsmga008.jf.intel.com with ESMTP; 09 Jan 2018 03:29:17 -0800 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.46,335,1511856000"; d="scan'208";a="9317537" From: Wei Wang To: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, mhocko@kernel.org, akpm@linux-foundation.org, mawilcox@microsoft.com Date: Tue, 9 Jan 2018 19:10:58 +0800 Message-Id: <1515496262-7533-2-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1515496262-7533-1-git-send-email-wei.w.wang@intel.com> References: <1515496262-7533-1-git-send-email-wei.w.wang@intel.com> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 134.134.136.24 Subject: [Qemu-devel] [PATCH v21 1/5] xbitmap: Introduce xbitmap X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: aarcange@redhat.com, yang.zhang.wz@gmail.com, quan.xu0@gmail.com, david@redhat.com, penguin-kernel@I-love.SAKURA.ne.jp, liliang.opensource@gmail.com, willy@infradead.org, amit.shah@redhat.com, wei.w.wang@intel.com, cornelia.huck@de.ibm.com, pbonzini@redhat.com, nilal@redhat.com, mgorman@techsingularity.net Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" From: Matthew Wilcox The eXtensible Bitmap is a sparse bitmap representation which is efficient for set bits which tend to cluster. It supports up to 'unsigned long' worth of bits. Signed-off-by: Matthew Wilcox Signed-off-by: Wei Wang Cc: Andrew Morton Cc: Michal Hocko Cc: Michael S. Tsirkin Cc: Tetsuo Handa --- include/linux/xbitmap.h | 48 ++++ lib/Makefile | 2 +- lib/radix-tree.c | 38 ++- lib/xbitmap.c | 444 +++++++++++++++++++++++++++= ++++ tools/include/linux/bitmap.h | 34 +++ tools/include/linux/kernel.h | 2 + tools/testing/radix-tree/Makefile | 17 +- tools/testing/radix-tree/linux/kernel.h | 2 - tools/testing/radix-tree/linux/xbitmap.h | 1 + tools/testing/radix-tree/main.c | 4 + tools/testing/radix-tree/test.h | 1 + 11 files changed, 583 insertions(+), 10 deletions(-) create mode 100644 include/linux/xbitmap.h create mode 100644 lib/xbitmap.c create mode 100644 tools/testing/radix-tree/linux/xbitmap.h diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h new file mode 100644 index 0000000..c008309 --- /dev/null +++ b/include/linux/xbitmap.h @@ -0,0 +1,48 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +/* + * eXtensible Bitmaps + * Copyright (c) 2017 Microsoft Corporation + * Author: Matthew Wilcox + * + * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility. + * All bits are initially zero. + * + * Locking is to be provided by the user. No xb_ function is safe to + * call concurrently with any other xb_ function. + */ + +#include + +struct xb { + struct radix_tree_root xbrt; +}; + +#define XB_INIT { \ + .xbrt =3D RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT), \ +} +#define DEFINE_XB(name) struct xb name =3D XB_INIT + +static inline void xb_init(struct xb *xb) +{ + INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT); +} + +int xb_set_bit(struct xb *xb, unsigned long bit); +bool xb_test_bit(const struct xb *xb, unsigned long bit); +void xb_clear_bit(struct xb *xb, unsigned long bit); +void xb_zero(struct xb *xb, unsigned long min, unsigned long max); +void xb_fill(struct xb *xb, unsigned long min, unsigned long max); +bool xb_find_set(const struct xb *xb, unsigned long max, unsigned long *bi= t); +bool xb_find_zero(const struct xb *xb, unsigned long max, unsigned long *b= it); + +static inline bool xb_empty(const struct xb *xb) +{ + return radix_tree_empty(&xb->xbrt); +} + +int __must_check xb_preload(gfp_t); + +static inline void xb_preload_end(void) +{ + preempt_enable(); +} diff --git a/lib/Makefile b/lib/Makefile index d11c48e..08a8183 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -19,7 +19,7 @@ KCOV_INSTRUMENT_dynamic_debug.o :=3D n =20 lib-y :=3D ctype.o string.o vsprintf.o cmdline.o \ rbtree.o radix-tree.o dump_stack.o timerqueue.o\ - idr.o int_sqrt.o extable.o \ + idr.o xbitmap.o int_sqrt.o extable.o \ sha1.o chacha20.o irq_regs.o argv_split.o \ flex_proportions.o ratelimit.o show_mem.o \ is_single_threaded.o plist.o decompress.o kobject_uevent.o \ diff --git a/lib/radix-tree.c b/lib/radix-tree.c index c8d5556..d2bd8fe 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -37,7 +37,7 @@ #include #include #include - +#include =20 /* Number of nodes in fully populated tree of given height */ static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mo= stly; @@ -77,6 +77,11 @@ static struct kmem_cache *radix_tree_node_cachep; RADIX_TREE_MAP_SHIFT)) #define IDA_PRELOAD_SIZE (IDA_MAX_PATH * 2 - 1) =20 +#define XB_INDEX_BITS (BITS_PER_LONG - ilog2(IDA_BITMAP_BITS)) +#define XB_MAX_PATH (DIV_ROUND_UP(XB_INDEX_BITS, \ + RADIX_TREE_MAP_SHIFT)) +#define XB_PRELOAD_SIZE (XB_MAX_PATH * 2 - 1) + /* * Per-cpu pool of preloaded nodes */ @@ -1781,7 +1786,7 @@ void __rcu **radix_tree_next_chunk(const struct radix= _tree_root *root, child =3D rcu_dereference_raw(node->slots[offset]); } =20 - if (!child) + if (!child && !is_idr(root)) goto restart; if (child =3D=3D RADIX_TREE_RETRY) break; @@ -2135,6 +2140,35 @@ int ida_pre_get(struct ida *ida, gfp_t gfp) } EXPORT_SYMBOL(ida_pre_get); =20 +/** + * xb_preload - preload for xb_set_bit() + * @gfp_mask: allocation mask to use for preloading + * + * Preallocate memory to use for the next call to xb_set_bit(). On success, + * return zero, with preemption disabled. On error, return -ENOMEM with + * preemption not disabled. + */ +int xb_preload(gfp_t gfp) +{ + if (!this_cpu_read(ida_bitmap)) { + struct ida_bitmap *bitmap =3D kmalloc(sizeof(*bitmap), gfp); + + if (!bitmap) + return -ENOMEM; + /* + * The per-CPU variable is updated with preemption enabled. + * If the calling task is unlucky to be scheduled to another + * CPU which has no ida_bitmap allocation, it will be detected + * when setting a bit (i.e. xb_set_bit()). + */ + bitmap =3D this_cpu_cmpxchg(ida_bitmap, NULL, bitmap); + kfree(bitmap); + } + + return __radix_tree_preload(gfp, XB_PRELOAD_SIZE); +} +EXPORT_SYMBOL(xb_preload); + void __rcu **idr_get_free_cmn(struct radix_tree_root *root, struct radix_tree_iter *iter, gfp_t gfp, unsigned long max) diff --git a/lib/xbitmap.c b/lib/xbitmap.c new file mode 100644 index 0000000..62b2211 --- /dev/null +++ b/lib/xbitmap.c @@ -0,0 +1,444 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +/* + * XBitmap implementation + * Copyright (c) 2017 Microsoft Corporation + * Author: Matthew Wilcox + */ + +#include +#include +#include +#include + +/** + * xb_set_bit() - Set a bit in the XBitmap. + * @xb: The XBitmap. + * @bit: Index of the bit to set. + * + * This function is used to set a bit in the xbitmap. + * + * Return: 0 on success. -ENOMEM if memory could not be allocated. + */ +int xb_set_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index =3D bit / IDA_BITMAP_BITS; + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_iter iter; + void __rcu **slot; + struct ida_bitmap *bitmap; + + bit %=3D IDA_BITMAP_BITS; + radix_tree_iter_init(&iter, index); + slot =3D idr_get_free_cmn(root, &iter, GFP_NOWAIT | __GFP_NOWARN, index); + if (IS_ERR(slot)) { + if (slot =3D=3D ERR_PTR(-ENOSPC)) + return 0; /* Already set */ + return -ENOMEM; + } + bitmap =3D rcu_dereference_raw(*slot); + if (!bitmap) { + bitmap =3D this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -ENOMEM; + memset(bitmap, 0, sizeof(*bitmap)); + radix_tree_iter_replace(root, &iter, slot, bitmap); + } + + __set_bit(bit, bitmap->bitmap); + if (bitmap_full(bitmap->bitmap, IDA_BITMAP_BITS)) + radix_tree_iter_tag_clear(root, &iter, IDR_FREE); + return 0; +} +EXPORT_SYMBOL(xb_set_bit); + +/** + * xb_clear_bit() - Clear a bit in the XBitmap. + * @xb: The XBitmap. + * @bit: Index of the bit to clear. + * + * This function is used to clear a bit in the xbitmap. + */ +void xb_clear_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index =3D bit / IDA_BITMAP_BITS; + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_iter iter; + void __rcu **slot; + struct ida_bitmap *bitmap; + + bit %=3D IDA_BITMAP_BITS; + slot =3D radix_tree_iter_lookup(root, &iter, index); + if (!slot) + return; + bitmap =3D radix_tree_deref_slot(slot); + if (!bitmap) + return; + + radix_tree_iter_tag_set(root, &iter, IDR_FREE); + __clear_bit(bit, bitmap->bitmap); + if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) { + kfree(bitmap); + radix_tree_iter_delete(root, &iter, slot); + } +} +EXPORT_SYMBOL(xb_clear_bit); + +/** + * xb_zero() - Clear a range of bits in the XBitmap. + * @xb: The XBitmap. + * @min: The first bit to clear. + * @max: The last bit to clear. + * + * This function is used to clear a range of bits in the xbitmap. + */ +void xb_zero(struct xb *xb, unsigned long min, unsigned long max) +{ + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_iter iter; + void __rcu **slot; + struct ida_bitmap *bitmap; + unsigned long index =3D min / IDA_BITMAP_BITS; + unsigned long first =3D min % IDA_BITMAP_BITS; + unsigned long maxindex =3D max / IDA_BITMAP_BITS; + + radix_tree_for_each_slot(slot, root, &iter, index) { + unsigned long nbits =3D IDA_BITMAP_BITS; + + if (index > maxindex) + break; + bitmap =3D radix_tree_deref_slot(slot); + if (!bitmap) + continue; + radix_tree_iter_tag_set(root, &iter, IDR_FREE); + + if (!first && iter.index < maxindex) + goto delete; + if (iter.index =3D=3D maxindex) + nbits =3D max % IDA_BITMAP_BITS + 1; + bitmap_clear(bitmap->bitmap, first, nbits - first); + first =3D 0; + if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) + goto delete; + continue; +delete: + kfree(bitmap); + radix_tree_iter_delete(root, &iter, slot); + } +} +EXPORT_SYMBOL(xb_zero); + +/** + * xb_test_bit() - Test a bit in the xbitmap. + * @xb: The XBitmap. + * @bit: Index of the bit to test. + * + * This function is used to test a bit in the xbitmap. + * + * Return: %true if the bit is set. + */ +bool xb_test_bit(const struct xb *xb, unsigned long bit) +{ + unsigned long index =3D bit / IDA_BITMAP_BITS; + struct ida_bitmap *bitmap =3D radix_tree_lookup(&xb->xbrt, index); + + bit %=3D IDA_BITMAP_BITS; + + if (!bitmap) + return false; + return test_bit(bit, bitmap->bitmap); +} +EXPORT_SYMBOL(xb_test_bit); + +/** + * xb_find_set() - Find the next set bit in a range of bits. + * @xb: The XBitmap. + * @max: The maximum position to search. + * @bit: The first bit to examine, and on exit, the found bit. + * + * On entry, @bit points to the index of the first bit to search. On exit, + * if this function returns %true, @bit will be updated to the index of the + * first found bit. It will not be updated if this function returns %fals= e. + * + * Return: %true if a set bit was found. + */ +bool xb_find_set(const struct xb *xb, unsigned long max, unsigned long *bi= t) +{ + struct radix_tree_iter iter; + void __rcu **slot; + struct ida_bitmap *bitmap; + unsigned long index =3D *bit / IDA_BITMAP_BITS; + unsigned int first =3D *bit % IDA_BITMAP_BITS; + unsigned long maxindex =3D max / IDA_BITMAP_BITS; + + radix_tree_for_each_slot(slot, &xb->xbrt, &iter, index) { + if (iter.index > maxindex) + break; + bitmap =3D radix_tree_deref_slot(slot); + if (bitmap) { + unsigned int nbits =3D IDA_BITMAP_BITS; + + if (iter.index =3D=3D maxindex) + nbits =3D max % IDA_BITMAP_BITS + 1; + first =3D find_next_bit(bitmap->bitmap, nbits, first); + if (first !=3D nbits) { + *bit =3D first + iter.index * IDA_BITMAP_BITS; + return true; + } + } + first =3D 0; + } + + return false; +} +EXPORT_SYMBOL(xb_find_set); + +/** + * xb_find_zero() - Find the next zero bit in a range of bits + * @xb: The XBitmap. + * @max: The maximum index to search. + * @bit: Pointer to an index. + * + * On entry, @bit points to the index of the first bit to search. On exit, + * if this function returns %true, @bit will be updated to the index of the + * first found bit. It will not be updated if this function returns %fals= e. + * + * Return: %true if a clear bit was found. + */ +bool xb_find_zero(const struct xb *xb, unsigned long max, unsigned long *b= it) +{ + void __rcu **slot; + struct radix_tree_iter iter; + struct ida_bitmap *bitmap; + unsigned long index =3D *bit / IDA_BITMAP_BITS; + unsigned long first =3D *bit % IDA_BITMAP_BITS; + unsigned long maxindex =3D max / IDA_BITMAP_BITS; + + radix_tree_for_each_tagged(slot, &xb->xbrt, &iter, index, IDR_FREE) { + unsigned int nbits =3D IDA_BITMAP_BITS; + + if (iter.index > maxindex) + return false; + bitmap =3D radix_tree_deref_slot(slot); + if (!bitmap) + break; + if (iter.index =3D=3D maxindex) + nbits =3D max % IDA_BITMAP_BITS + 1; + first =3D find_next_zero_bit(bitmap->bitmap, nbits, first); + if (first !=3D nbits) + break; + first =3D 0; + } + + *bit =3D first + iter.index * IDA_BITMAP_BITS; + return true; +} +EXPORT_SYMBOL(xb_find_zero); + +#ifndef __KERNEL__ + +static DEFINE_XB(xb1); + +static void xbitmap_check_bit(unsigned long bit) +{ + unsigned long nbit =3D 0; + + xb_preload(GFP_KERNEL); + assert(!xb_test_bit(&xb1, bit)); + assert(xb_set_bit(&xb1, bit) =3D=3D 0); + assert(xb_test_bit(&xb1, bit)); + assert(xb_find_set(&xb1, ULONG_MAX, &nbit) =3D=3D true); + assert(nbit =3D=3D bit); + assert(xb_find_set(&xb1, ULONG_MAX, &nbit) =3D=3D true); + assert(nbit =3D=3D bit); + nbit++; + assert(xb_find_set(&xb1, ULONG_MAX, &nbit) =3D=3D false); + assert(nbit =3D=3D bit + 1); + xb_clear_bit(&xb1, bit); + assert(xb_empty(&xb1)); + xb_clear_bit(&xb1, bit); + assert(xb_empty(&xb1)); + nbit =3D 0; + assert(xb_find_set(&xb1, ULONG_MAX, &nbit) =3D=3D false); + assert(nbit =3D=3D 0); + xb_preload_end(); +} + +/* + * In the following tests, preload is called once when all the bits to set + * locate in the same ida bitmap. Otherwise, it is recommended to call + * preload for each xb_set_bit. + */ +static void xbitmap_check_bit_range(void) +{ + unsigned long nbit =3D 0; + + /* Regular test1: node =3D NULL */ + xb_preload(GFP_KERNEL); + xb_set_bit(&xb1, 700); + xb_preload_end(); + assert(xb_find_set(&xb1, ULONG_MAX, &nbit) =3D=3D true); + assert(nbit =3D=3D 700); + nbit++; + assert(xb_find_set(&xb1, ULONG_MAX, &nbit) =3D=3D false); + assert(nbit =3D=3D 701); + xb_zero(&xb1, 0, 1023); + + /* + * Regular test2 + * set bit 2000, 2001, 2040 + * Next 1 in [0, 2048] --> 2000 + * Next 1 in [2000, 2002] --> 2000 + * Next 1 in [2002, 2040] --> 2040 + * Next 1 in [2002, 2039] --> none + * Next 0 in [2000, 2048] --> 2002 + * Next 0 in [2048, 2060] --> 2048 + */ + xb_preload(GFP_KERNEL); + assert(!xb_set_bit(&xb1, 2000)); + assert(!xb_set_bit(&xb1, 2001)); + assert(!xb_set_bit(&xb1, 2040)); + nbit =3D 0; + assert(xb_find_set(&xb1, 2048, &nbit) =3D=3D true); + assert(nbit =3D=3D 2000); + assert(xb_find_set(&xb1, 2002, &nbit) =3D=3D true); + assert(nbit =3D=3D 2000); + nbit =3D 2002; + assert(xb_find_set(&xb1, 2040, &nbit) =3D=3D true); + assert(nbit =3D=3D 2040); + nbit =3D 2002; + assert(xb_find_set(&xb1, 2039, &nbit) =3D=3D false); + assert(nbit =3D=3D 2002); + nbit =3D 2000; + assert(xb_find_zero(&xb1, 2048, &nbit) =3D=3D true); + assert(nbit =3D=3D 2002); + nbit =3D 2048; + assert(xb_find_zero(&xb1, 2060, &nbit) =3D=3D true); + assert(nbit =3D=3D 2048); + xb_zero(&xb1, 0, 2048); + nbit =3D 0; + assert(xb_find_set(&xb1, 2048, &nbit) =3D=3D false); + assert(nbit =3D=3D 0); + xb_preload_end(); + + /* + * Overflow tests: + * Set bit 1 and ULONG_MAX - 4 + * Next 1 in [0, ULONG_MAX] --> 1 + * Next 1 in [1, ULONG_MAX] --> 1 + * Next 1 in [2, ULONG_MAX] --> ULONG_MAX - 4 + * Next 1 in [ULONG_MAX - 3, 2] --> none + * Next 0 in [ULONG_MAX - 4, ULONG_MAX] --> ULONG_MAX - 3 + * Zero [ULONG_MAX - 4, ULONG_MAX] + * Next 1 in [ULONG_MAX - 10, ULONG_MAX] --> none + * Next 1 in [ULONG_MAX - 1, 2] --> none + * Zero [0, 1] + * Next 1 in [0, 2] --> none + */ + xb_preload(GFP_KERNEL); + assert(!xb_set_bit(&xb1, 1)); + xb_preload_end(); + xb_preload(GFP_KERNEL); + assert(!xb_set_bit(&xb1, ULONG_MAX - 4)); + nbit =3D 0; + assert(xb_find_set(&xb1, ULONG_MAX, &nbit) =3D=3D true); + assert(nbit =3D=3D 1); + nbit =3D 1; + assert(xb_find_set(&xb1, ULONG_MAX, &nbit) =3D=3D true); + assert(nbit =3D=3D 1); + nbit =3D 2; + assert(xb_find_set(&xb1, ULONG_MAX, &nbit) =3D=3D true); + assert(nbit =3D=3D ULONG_MAX - 4); + nbit++; + assert(xb_find_set(&xb1, 2, &nbit) =3D=3D false); + assert(nbit =3D=3D ULONG_MAX - 3); + nbit--; + assert(xb_find_zero(&xb1, ULONG_MAX, &nbit) =3D=3D true); + assert(nbit =3D=3D ULONG_MAX - 3); + xb_zero(&xb1, ULONG_MAX - 4, ULONG_MAX); + nbit =3D ULONG_MAX - 10; + assert(xb_find_set(&xb1, ULONG_MAX, &nbit) =3D=3D false); + assert(nbit =3D=3D ULONG_MAX - 10); + nbit =3D ULONG_MAX - 1; + assert(xb_find_set(&xb1, 2, &nbit) =3D=3D false); + xb_zero(&xb1, 0, 1); + nbit =3D 0; + assert(xb_find_set(&xb1, 2, &nbit) =3D=3D false); + assert(nbit =3D=3D 0); + xb_preload_end(); + assert(xb_empty(&xb1)); +} + +static void xbitmap_check_zero_bits(void) +{ + assert(xb_empty(&xb1)); + + /* Zero an empty xbitmap should work though no real work to do */ + xb_zero(&xb1, 0, ULONG_MAX); + assert(xb_empty(&xb1)); + + xb_preload(GFP_KERNEL); + assert(xb_set_bit(&xb1, 0) =3D=3D 0); + xb_preload_end(); + + /* Overflow test */ + xb_zero(&xb1, ULONG_MAX - 10, ULONG_MAX); + assert(xb_test_bit(&xb1, 0)); + + xb_preload(GFP_KERNEL); + assert(xb_set_bit(&xb1, ULONG_MAX) =3D=3D 0); + xb_preload_end(); + + xb_zero(&xb1, 0, ULONG_MAX); + assert(xb_empty(&xb1)); +} + +/* Check that setting an already-full bitmap works */ +static void xbitmap_check_set(unsigned long base) +{ + unsigned long i; + + assert(xb_empty(&xb1)); + + for (i =3D 0; i < 64 * 1024; i++) { + xb_preload(GFP_KERNEL); + assert(xb_set_bit(&xb1, base + i) =3D=3D 0); + xb_preload_end(); + } + for (i =3D 0; i < 64 * 1024; i++) + assert(xb_set_bit(&xb1, base + i) =3D=3D 0); + + for (i =3D 0; i < 64 * 1024; i++) + xb_clear_bit(&xb1, base + i); + + assert(xb_empty(&xb1)); +} + +static void xbitmap_checks(void) +{ + xb_init(&xb1); + xbitmap_check_bit(0); + xbitmap_check_bit(30); + xbitmap_check_bit(31); + xbitmap_check_bit(62); + xbitmap_check_bit(63); + xbitmap_check_bit(64); + xbitmap_check_bit(700); + xbitmap_check_bit(1023); + xbitmap_check_bit(1024); + xbitmap_check_bit(1025); + xbitmap_check_bit((1UL << 63) | (1UL << 24)); + xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70); + + xbitmap_check_bit_range(); + xbitmap_check_zero_bits(); + xbitmap_check_set(0); + xbitmap_check_set(1024); + xbitmap_check_set(1024 * 64); +} + +int __weak main(void) +{ + radix_tree_init(); + xbitmap_checks(); +} +#endif diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h index ca16027..8d0bc1b 100644 --- a/tools/include/linux/bitmap.h +++ b/tools/include/linux/bitmap.h @@ -37,6 +37,40 @@ static inline void bitmap_zero(unsigned long *dst, int n= bits) } } =20 +static inline void __bitmap_clear(unsigned long *map, unsigned int start, + int len) +{ + unsigned long *p =3D map + BIT_WORD(start); + const unsigned int size =3D start + len; + int bits_to_clear =3D BITS_PER_LONG - (start % BITS_PER_LONG); + unsigned long mask_to_clear =3D BITMAP_FIRST_WORD_MASK(start); + + while (len - bits_to_clear >=3D 0) { + *p &=3D ~mask_to_clear; + len -=3D bits_to_clear; + bits_to_clear =3D BITS_PER_LONG; + mask_to_clear =3D ~0UL; + p++; + } + if (len) { + mask_to_clear &=3D BITMAP_LAST_WORD_MASK(size); + *p &=3D ~mask_to_clear; + } +} + +static inline __always_inline void bitmap_clear(unsigned long *map, + unsigned int start, + unsigned int nbits) +{ + if (__builtin_constant_p(nbits) && nbits =3D=3D 1) + __clear_bit(start, map); + else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) && + __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8)) + memset((char *)map + start / 8, 0, nbits / 8); + else + __bitmap_clear(map, start, nbits); +} + static inline void bitmap_fill(unsigned long *dst, unsigned int nbits) { unsigned int nlongs =3D BITS_TO_LONGS(nbits); diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h index 0ad8844..3c992ae 100644 --- a/tools/include/linux/kernel.h +++ b/tools/include/linux/kernel.h @@ -13,6 +13,8 @@ #define UINT_MAX (~0U) #endif =20 +#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) =3D=3D 0) + #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d)) =20 #define PERF_ALIGN(x, a) __PERF_ALIGN_MASK(x, (typeof(x))(a)-1) diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/M= akefile index fa7ee36..788e526 100644 --- a/tools/testing/radix-tree/Makefile +++ b/tools/testing/radix-tree/Makefile @@ -1,12 +1,13 @@ # SPDX-License-Identifier: GPL-2.0 =20 CFLAGS +=3D -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=3Da= ddress -LDFLAGS +=3D -fsanitize=3Daddress -LDLIBS+=3D -lpthread -lurcu -TARGETS =3D main idr-test multiorder +LDFLAGS +=3D -fsanitize=3Daddress $(LDLIBS) +LDLIBS :=3D -lpthread -lurcu +TARGETS =3D main idr-test multiorder xbitmap CORE_OFILES :=3D radix-tree.o idr.o linux.o test.o find_bit.o OFILES =3D main.o $(CORE_OFILES) regression1.o regression2.o regression3.o= \ - tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o + tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \ + xbitmap.o =20 ifndef SHIFT SHIFT=3D3 @@ -25,8 +26,10 @@ idr-test: idr-test.o $(CORE_OFILES) =20 multiorder: multiorder.o $(CORE_OFILES) =20 +xbitmap: xbitmap.o $(CORE_OFILES) + clean: - $(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h + $(RM) $(TARGETS) *.o radix-tree.c idr.c xbitmap.c generated/map-shift.h =20 vpath %.c ../../lib =20 @@ -34,6 +37,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \ ../../include/linux/*.h \ ../../include/asm/*.h \ ../../../include/linux/radix-tree.h \ + ../../../include/linux/xbitmap.h \ ../../../include/linux/idr.h =20 radix-tree.c: ../../../lib/radix-tree.c @@ -42,6 +46,9 @@ radix-tree.c: ../../../lib/radix-tree.c idr.c: ../../../lib/idr.c sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@ =20 +xbitmap.c: ../../../lib/xbitmap.c + sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@ + .PHONY: mapshift =20 mapshift: diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-= tree/linux/kernel.h index c3bc3f3..426f32f 100644 --- a/tools/testing/radix-tree/linux/kernel.h +++ b/tools/testing/radix-tree/linux/kernel.h @@ -17,6 +17,4 @@ #define pr_debug printk #define pr_cont printk =20 -#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) - #endif /* _KERNEL_H */ diff --git a/tools/testing/radix-tree/linux/xbitmap.h b/tools/testing/radix= -tree/linux/xbitmap.h new file mode 100644 index 0000000..61de214 --- /dev/null +++ b/tools/testing/radix-tree/linux/xbitmap.h @@ -0,0 +1 @@ +#include "../../../../include/linux/xbitmap.h" diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/mai= n.c index 257f3f8..d112363 100644 --- a/tools/testing/radix-tree/main.c +++ b/tools/testing/radix-tree/main.c @@ -326,6 +326,10 @@ static void single_thread_tests(bool long_run) rcu_barrier(); printv(2, "after idr_checks: %d allocated, preempt %d\n", nr_allocated, preempt_count); + xbitmap_checks(); + rcu_barrier(); + printv(2, "after xbitmap_checks: %d allocated, preempt %d\n", + nr_allocated, preempt_count); big_gang_check(long_run); rcu_barrier(); printv(2, "after big_gang_check: %d allocated, preempt %d\n", diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/tes= t.h index d9c031d..8175d6b 100644 --- a/tools/testing/radix-tree/test.h +++ b/tools/testing/radix-tree/test.h @@ -38,6 +38,7 @@ void benchmark(void); void idr_checks(void); void ida_checks(void); void ida_thread_tests(void); +void xbitmap_checks(void); =20 struct item * item_tag_set(struct radix_tree_root *root, unsigned long index, int tag); --=20 2.7.4 From nobody Mon May 13 00:41:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515497663404331.3275122774993; Tue, 9 Jan 2018 03:34:23 -0800 (PST) Received: from localhost ([::1]:47145 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eYsAg-0008D3-Hd for importer@patchew.org; Tue, 09 Jan 2018 06:34:22 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35097) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eYs5w-0004Tb-Dm for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:31 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eYs5u-00034k-EG for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:28 -0500 Received: from mga09.intel.com ([134.134.136.24]:54587) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eYs5t-0002zJ-Uq for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:26 -0500 Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 09 Jan 2018 03:29:25 -0800 Received: from devel-ww.sh.intel.com ([10.239.48.110]) by orsmga008.jf.intel.com with ESMTP; 09 Jan 2018 03:29:21 -0800 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.46,335,1511856000"; d="scan'208";a="9317550" From: Wei Wang To: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, mhocko@kernel.org, akpm@linux-foundation.org, mawilcox@microsoft.com Date: Tue, 9 Jan 2018 19:10:59 +0800 Message-Id: <1515496262-7533-3-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1515496262-7533-1-git-send-email-wei.w.wang@intel.com> References: <1515496262-7533-1-git-send-email-wei.w.wang@intel.com> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 134.134.136.24 Subject: [Qemu-devel] [PATCH v21 2/5] virtio-balloon: VIRTIO_BALLOON_F_SG X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: aarcange@redhat.com, yang.zhang.wz@gmail.com, quan.xu0@gmail.com, david@redhat.com, penguin-kernel@I-love.SAKURA.ne.jp, liliang.opensource@gmail.com, willy@infradead.org, amit.shah@redhat.com, wei.w.wang@intel.com, cornelia.huck@de.ibm.com, pbonzini@redhat.com, nilal@redhat.com, mgorman@techsingularity.net Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of balloon (i.e. inflated/deflated) pages using scatter-gather lists to the host. The implementation of the previous virtio-balloon is not very efficient, because the balloon pages are transferred to the host by one array each time. Here is the breakdown of the time in percentage spent on each step of the balloon inflating process (inflating 7GB of an 8GB idle guest). 1) allocating pages (6.5%) 2) sending PFNs to host (68.3%) 3) address translation (6.1%) 4) madvise (19%) It takes about 4126ms for the inflating process to complete. The above profiling shows that the bottlenecks are stage 2) and stage 4). This patch optimizes step 2) by transferring pages to host in sgs. An sg describes a chunk of guest physically continuous pages. With this mechanism, step 4) can also be optimized by doing address translation and madvise() in chunks rather than page by page. With this new feature, the above ballooning process takes ~460ms resulting in an improvement of ~89%. TODO: - optimize stage 1) by allocating/freeing a chunk of pages instead of a single page each time. - sort the internal balloon page queue. - enable OOM to free inflated pages maintained in the local temporary list. Signed-off-by: Wei Wang Signed-off-by: Liang Li Suggested-by: Michael S. Tsirkin Cc: Tetsuo Handa --- drivers/virtio/virtio_balloon.c | 233 ++++++++++++++++++++++++++++++++= +--- include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 216 insertions(+), 18 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloo= n.c index a1fb52c..10876ea 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -32,6 +32,8 @@ #include #include #include +#include +#include =20 /* * Balloon device works in 4K page units. So each page is pointed to by @@ -79,6 +81,9 @@ struct virtio_balloon { /* Synchronize access/update to this struct virtio_balloon elements */ struct mutex balloon_lock; =20 + /* The xbitmap used to record balloon pages */ + struct xb page_xb; + /* The array of pfns we tell the Host about. */ unsigned int num_pfns; __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX]; @@ -141,15 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb, page_to_balloon_pfn(page) + i); } =20 +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head) +{ + unsigned int len; + + virtqueue_kick(vq); + wait_event(wq_head, virtqueue_get_buf(vq, &len)); +} + +static void add_one_sg(struct virtqueue *vq, unsigned long pfn, uint32_t l= en) +{ + struct scatterlist sg; + unsigned int unused; + int err; + + sg_init_table(&sg, 1); + sg_set_page(&sg, pfn_to_page(pfn), len, 0); + + /* Detach all the used buffers from the vq */ + while (virtqueue_get_buf(vq, &unused)) + ; + + err =3D virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL); + /* + * This is expected to never fail: there is always at least 1 entry + * available on the vq, because when the vq is full the worker thread + * that adds the sg will be put into sleep until at least 1 entry is + * available to use. + */ + BUG_ON(err); +} + +static void batch_balloon_page_sg(struct virtio_balloon *vb, + struct virtqueue *vq, + unsigned long pfn, + uint32_t len) +{ + add_one_sg(vq, pfn, len); + + /* Batch till the vq is full */ + if (!vq->num_free) + kick_and_wait(vq, vb->acked); +} + +/* + * Send balloon pages in sgs to host. The balloon pages are recorded in the + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE. + * The page xbitmap is searched for continuous "1" bits, which correspond + * to continuous pages, to chunk into sgs. + * + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap t= hat + * need to be searched. + */ +static void tell_host_sgs(struct virtio_balloon *vb, + struct virtqueue *vq, + unsigned long page_xb_start, + unsigned long page_xb_end) +{ + unsigned long pfn_start, pfn_end; + uint32_t max_len =3D round_down(UINT_MAX, PAGE_SIZE); + uint64_t len; + + pfn_start =3D page_xb_start; + while (pfn_start < page_xb_end) { + if (!xb_find_set(&vb->page_xb, page_xb_end, &pfn_start)) + break; + pfn_end =3D pfn_start + 1; + if (!xb_find_zero(&vb->page_xb, page_xb_end, &pfn_end)) + pfn_end =3D page_xb_end + 1; + len =3D (pfn_end - pfn_start) << PAGE_SHIFT; + while (len > max_len) { + batch_balloon_page_sg(vb, vq, pfn_start, max_len); + pfn_start +=3D max_len >> PAGE_SHIFT; + len -=3D max_len; + } + batch_balloon_page_sg(vb, vq, pfn_start, (uint32_t)len); + pfn_start =3D pfn_end + 1; + } + + /* + * The last few sgs may not reach the batch size, but need a kick to + * notify the device to handle them. + */ + if (vq->num_free !=3D virtqueue_get_vring_size(vq)) + kick_and_wait(vq, vb->acked); + + xb_zero(&vb->page_xb, page_xb_start, page_xb_end); +} + +static inline int xb_set_page(struct virtio_balloon *vb, + struct page *page, + unsigned long *pfn_min, + unsigned long *pfn_max) +{ + unsigned long pfn =3D page_to_pfn(page); + int ret; + + *pfn_min =3D min(pfn, *pfn_min); + *pfn_max =3D max(pfn, *pfn_max); + + do { + if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0) + return -ENOMEM; + + ret =3D xb_set_bit(&vb->page_xb, pfn); + xb_preload_end(); + } while (unlikely(ret =3D=3D -EAGAIN)); + + return ret; +} + static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) { unsigned num_allocated_pages; unsigned num_pfns; struct page *page; LIST_HEAD(pages); + bool use_sg =3D virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); + unsigned long pfn_max =3D 0, pfn_min =3D ULONG_MAX; =20 /* We can only do one array worth at a time. */ - num =3D min(num, ARRAY_SIZE(vb->pfns)); + if (!use_sg) + num =3D min(num, ARRAY_SIZE(vb->pfns)); =20 for (num_pfns =3D 0; num_pfns < num; num_pfns +=3D VIRTIO_BALLOON_PAGES_PER_PAGE) { @@ -173,8 +291,15 @@ static unsigned fill_balloon(struct virtio_balloon *vb= , size_t num) =20 while ((page =3D balloon_page_pop(&pages))) { balloon_page_enqueue(&vb->vb_dev_info, page); + if (use_sg) { + if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) { + __free_page(page); + continue; + } + } else { + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + } =20 - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); vb->num_pages +=3D VIRTIO_BALLOON_PAGES_PER_PAGE; if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) @@ -184,8 +309,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb= , size_t num) =20 num_allocated_pages =3D vb->num_pfns; /* Did we get any? */ - if (vb->num_pfns !=3D 0) - tell_host(vb, vb->inflate_vq); + if (vb->num_pfns) { + if (use_sg) + tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max); + else + tell_host(vb, vb->inflate_vq); + } mutex_unlock(&vb->balloon_lock); =20 return num_allocated_pages; @@ -211,9 +340,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb= , size_t num) struct page *page; struct balloon_dev_info *vb_dev_info =3D &vb->vb_dev_info; LIST_HEAD(pages); + bool use_sg =3D virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); + unsigned long pfn_max =3D 0, pfn_min =3D ULONG_MAX; =20 - /* We can only do one array worth at a time. */ - num =3D min(num, ARRAY_SIZE(vb->pfns)); + /* Traditionally, we can only do one array worth at a time. */ + if (!use_sg) + num =3D min(num, ARRAY_SIZE(vb->pfns)); =20 mutex_lock(&vb->balloon_lock); /* We can't release more pages than taken */ @@ -223,7 +355,14 @@ static unsigned leak_balloon(struct virtio_balloon *vb= , size_t num) page =3D balloon_page_dequeue(vb_dev_info); if (!page) break; - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + if (use_sg) { + if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) { + balloon_page_enqueue(&vb->vb_dev_info, page); + break; + } + } else { + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + } list_add(&page->lru, &pages); vb->num_pages -=3D VIRTIO_BALLOON_PAGES_PER_PAGE; } @@ -234,13 +373,55 @@ static unsigned leak_balloon(struct virtio_balloon *v= b, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - if (vb->num_pfns !=3D 0) - tell_host(vb, vb->deflate_vq); + if (vb->num_pfns) { + if (use_sg) + tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max); + else + tell_host(vb, vb->deflate_vq); + } release_pages_balloon(vb, &pages); mutex_unlock(&vb->balloon_lock); return num_freed_pages; } =20 +/* + * The regular leak_balloon() with VIRTIO_BALLOON_F_SG needs memory alloca= tion + * for xbitmap, which is not suitable for the oom case. This function does= not + * use xbitmap to chunk pages, so it can be used by oom notifier to deflate + * pages when VIRTIO_BALLOON_F_SG is negotiated. + */ +static unsigned int leak_balloon_sg_oom(struct virtio_balloon *vb) +{ + unsigned int n; + struct page *page; + struct balloon_dev_info *vb_dev_info =3D &vb->vb_dev_info; + struct virtqueue *vq =3D vb->deflate_vq; + LIST_HEAD(pages); + + mutex_lock(&vb->balloon_lock); + for (n =3D 0; n < oom_pages; n++) { + page =3D balloon_page_dequeue(vb_dev_info); + if (!page) + break; + + list_add(&page->lru, &pages); + vb->num_pages -=3D VIRTIO_BALLOON_PAGES_PER_PAGE; + batch_balloon_page_sg(vb, vb->deflate_vq, page_to_pfn(page), + PAGE_SIZE); + release_pages_balloon(vb, &pages); + } + + /* + * The last few sgs may not reach the batch size, but need a kick to + * notify the device to handle them. + */ + if (vq->num_free !=3D virtqueue_get_vring_size(vq)) + kick_and_wait(vq, vb->acked); + mutex_unlock(&vb->balloon_lock); + + return n; +} + static inline void update_stat(struct virtio_balloon *vb, int idx, u16 tag, u64 val) { @@ -380,7 +561,10 @@ static int virtballoon_oom_notify(struct notifier_bloc= k *self, return NOTIFY_OK; =20 freed =3D parm; - num_freed_pages =3D leak_balloon(vb, oom_pages); + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG)) + num_freed_pages =3D leak_balloon_sg_oom(vb); + else + num_freed_pages =3D leak_balloon(vb, oom_pages); update_balloon_size(vb); *freed +=3D num_freed_pages; =20 @@ -477,6 +661,7 @@ static int virtballoon_migratepage(struct balloon_dev_i= nfo *vb_dev_info, { struct virtio_balloon *vb =3D container_of(vb_dev_info, struct virtio_balloon, vb_dev_info); + bool use_sg =3D virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); unsigned long flags; =20 /* @@ -498,16 +683,24 @@ static int virtballoon_migratepage(struct balloon_dev= _info *vb_dev_info, vb_dev_info->isolated_pages--; __count_vm_event(BALLOON_MIGRATE); spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); - vb->num_pfns =3D VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns, newpage); - tell_host(vb, vb->inflate_vq); - + if (use_sg) { + add_one_sg(vb->inflate_vq, page_to_pfn(newpage), PAGE_SIZE); + kick_and_wait(vb->inflate_vq, vb->acked); + } else { + vb->num_pfns =3D VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns, newpage); + tell_host(vb, vb->inflate_vq); + } /* balloon's page migration 2nd step -- deflate "page" */ balloon_page_delete(page); - vb->num_pfns =3D VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns, page); - tell_host(vb, vb->deflate_vq); - + if (use_sg) { + add_one_sg(vb->deflate_vq, page_to_pfn(page), PAGE_SIZE); + kick_and_wait(vb->deflate_vq, vb->acked); + } else { + vb->num_pfns =3D VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns, page); + tell_host(vb, vb->deflate_vq); + } mutex_unlock(&vb->balloon_lock); =20 put_page(page); /* balloon reference */ @@ -566,6 +759,9 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_free_vb; =20 + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) + xb_init(&vb->page_xb); + vb->nb.notifier_call =3D virtballoon_oom_notify; vb->nb.priority =3D VIRTBALLOON_OOM_NOTIFY_PRIORITY; err =3D register_oom_notifier(&vb->nb); @@ -682,6 +878,7 @@ static unsigned int features[] =3D { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, + VIRTIO_BALLOON_F_SG, }; =20 static struct virtio_driver virtio_balloon_driver =3D { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virti= o_balloon.h index 343d7dd..37780a7 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -34,6 +34,7 @@ #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages = */ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ +#define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ =20 /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 --=20 2.7.4 From nobody Mon May 13 00:41:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515497526661178.95150445533056; Tue, 9 Jan 2018 03:32:06 -0800 (PST) Received: from localhost ([::1]:47024 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eYs8T-00068c-Nb for importer@patchew.org; Tue, 09 Jan 2018 06:32:05 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35144) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eYs61-0004YT-Ko for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:34 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eYs5y-00037G-Fn for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:33 -0500 Received: from mga09.intel.com ([134.134.136.24]:54587) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eYs5y-0002zJ-53 for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:30 -0500 Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 09 Jan 2018 03:29:29 -0800 Received: from devel-ww.sh.intel.com ([10.239.48.110]) by orsmga008.jf.intel.com with ESMTP; 09 Jan 2018 03:29:25 -0800 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.46,335,1511856000"; d="scan'208";a="9317559" From: Wei Wang To: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, mhocko@kernel.org, akpm@linux-foundation.org, mawilcox@microsoft.com Date: Tue, 9 Jan 2018 19:11:00 +0800 Message-Id: <1515496262-7533-4-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1515496262-7533-1-git-send-email-wei.w.wang@intel.com> References: <1515496262-7533-1-git-send-email-wei.w.wang@intel.com> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 134.134.136.24 Subject: [Qemu-devel] [PATCH v21 3/5] mm: support reporting free page blocks X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: aarcange@redhat.com, yang.zhang.wz@gmail.com, quan.xu0@gmail.com, david@redhat.com, penguin-kernel@I-love.SAKURA.ne.jp, liliang.opensource@gmail.com, willy@infradead.org, amit.shah@redhat.com, wei.w.wang@intel.com, cornelia.huck@de.ibm.com, pbonzini@redhat.com, nilal@redhat.com, mgorman@techsingularity.net Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" This patch adds support to walk through the free page blocks in the system and report them via a callback function. Some page blocks may leave the free list after zone->lock is released, so it is the caller's responsibility to either detect or prevent the use of such pages. One use example of this patch is to accelerate live migration by skipping the transfer of free pages reported from the guest. A popular method used by the hypervisor to track which part of memory is written during live migration is to write-protect all the guest memory. So, those pages that are reported as free pages but are written after the report function returns will be captured by the hypervisor, and they will be added to the next round of memory transfer. Signed-off-by: Wei Wang Signed-off-by: Liang Li Cc: Michal Hocko Cc: Michael S. Tsirkin Acked-by: Michal Hocko --- include/linux/mm.h | 6 ++++ mm/page_alloc.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++= ++++ 2 files changed, 97 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index ea818ff..b3077dd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1938,6 +1938,12 @@ extern void free_area_init_node(int nid, unsigned lo= ng * zones_size, unsigned long zone_start_pfn, unsigned long *zholes_size); extern void free_initmem(void); =20 +extern void walk_free_mem_block(void *opaque, + int min_order, + bool (*report_pfn_range)(void *opaque, + unsigned long pfn, + unsigned long num)); + /* * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) * into the buddy system. The freed pages will be poisoned with pattern diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 76c9688..705de22 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4899,6 +4899,97 @@ void show_free_areas(unsigned int filter, nodemask_t= *nodemask) show_swap_cache_info(); } =20 +/* + * Walk through a free page list and report the found pfn range via the + * callback. + * + * Return false if the callback requests to stop reporting. Otherwise, + * return true. + */ +static bool walk_free_page_list(void *opaque, + struct zone *zone, + int order, + enum migratetype mt, + bool (*report_pfn_range)(void *, + unsigned long, + unsigned long)) +{ + struct page *page; + struct list_head *list; + unsigned long pfn, flags; + bool ret; + + spin_lock_irqsave(&zone->lock, flags); + list =3D &zone->free_area[order].free_list[mt]; + list_for_each_entry(page, list, lru) { + pfn =3D page_to_pfn(page); + ret =3D report_pfn_range(opaque, pfn, 1 << order); + if (!ret) + break; + } + spin_unlock_irqrestore(&zone->lock, flags); + + return ret; +} + +/** + * walk_free_mem_block - Walk through the free page blocks in the system + * @opaque: the context passed from the caller + * @min_order: the minimum order of free lists to check + * @report_pfn_range: the callback to report the pfn range of the free pag= es + * + * If the callback returns false, stop iterating the list of free page blo= cks. + * Otherwise, continue to report. + * + * Please note that there are no locking guarantees for the callback and + * that the reported pfn range might be freed or disappear after the + * callback returns so the caller has to be very careful how it is used. + * + * The callback itself must not sleep or perform any operations which would + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC) + * or via any lock dependency. It is generally advisable to implement + * the callback as simple as possible and defer any heavy lifting to a + * different context. + * + * There is no guarantee that each free range will be reported only once + * during one walk_free_mem_block invocation. + * + * pfn_to_page on the given range is strongly discouraged and if there is + * an absolute need for that make sure to contact MM people to discuss + * potential problems. + * + * The function itself might sleep so it cannot be called from atomic + * contexts. + * + * In general low orders tend to be very volatile and so it makes more + * sense to query larger ones first for various optimizations which like + * ballooning etc... This will reduce the overhead as well. + */ +void walk_free_mem_block(void *opaque, + int min_order, + bool (*report_pfn_range)(void *opaque, + unsigned long pfn, + unsigned long num)) +{ + struct zone *zone; + int order; + enum migratetype mt; + bool ret; + + for_each_populated_zone(zone) { + for (order =3D MAX_ORDER - 1; order >=3D min_order; order--) { + for (mt =3D 0; mt < MIGRATE_TYPES; mt++) { + ret =3D walk_free_page_list(opaque, zone, + order, mt, + report_pfn_range); + if (!ret) + return; + } + } + } +} +EXPORT_SYMBOL_GPL(walk_free_mem_block); + static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) { zoneref->zone =3D zone; --=20 2.7.4 From nobody Mon May 13 00:41:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515497659954708.8569590174218; Tue, 9 Jan 2018 03:34:19 -0800 (PST) Received: from localhost ([::1]:47140 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eYsAU-00081y-TX for importer@patchew.org; Tue, 09 Jan 2018 06:34:10 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35189) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eYs64-0004bJ-NR for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:38 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eYs62-00039U-R5 for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:36 -0500 Received: from mga09.intel.com ([134.134.136.24]:54587) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eYs62-0002zJ-Cc for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:34 -0500 Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 09 Jan 2018 03:29:34 -0800 Received: from devel-ww.sh.intel.com ([10.239.48.110]) by orsmga008.jf.intel.com with ESMTP; 09 Jan 2018 03:29:29 -0800 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.46,335,1511856000"; d="scan'208";a="9317575" From: Wei Wang To: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, mhocko@kernel.org, akpm@linux-foundation.org, mawilcox@microsoft.com Date: Tue, 9 Jan 2018 19:11:01 +0800 Message-Id: <1515496262-7533-5-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1515496262-7533-1-git-send-email-wei.w.wang@intel.com> References: <1515496262-7533-1-git-send-email-wei.w.wang@intel.com> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 134.134.136.24 Subject: [Qemu-devel] [PATCH v21 4/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: aarcange@redhat.com, yang.zhang.wz@gmail.com, quan.xu0@gmail.com, david@redhat.com, penguin-kernel@I-love.SAKURA.ne.jp, liliang.opensource@gmail.com, willy@infradead.org, amit.shah@redhat.com, wei.w.wang@intel.com, cornelia.huck@de.ibm.com, pbonzini@redhat.com, nilal@redhat.com, mgorman@techsingularity.net Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the support of reporting hints of guest free pages to host via virtio-balloon. Host requests the guest to report free pages by sending a new cmd id to the guest via the free_page_report_cmd_id configuration register. When the guest starts to report, the first element added to the free page vq is the cmd id given by host. When the guest finishes the reporting of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added to the vq to tell host that the reporting is done. Host may also requests the guest to stop the reporting in advance by sending the stop cmd id to the guest via the configuration register. Signed-off-by: Wei Wang Signed-off-by: Liang Li Cc: Michael S. Tsirkin Cc: Michal Hocko --- drivers/virtio/virtio_balloon.c | 202 ++++++++++++++++++++++++++++++--= ---- include/uapi/linux/virtio_balloon.h | 4 + 2 files changed, 174 insertions(+), 32 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloo= n.c index 10876ea..f7e8830 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt; =20 struct virtio_balloon { struct virtio_device *vdev; - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq; + + /* Balloon's own wq for cpu-intensive work items */ + struct workqueue_struct *balloon_wq; + /* The free page reporting work item submitted to the balloon wq */ + struct work_struct report_free_page_work; =20 /* The balloon servicing is delegated to a freezable workqueue. */ struct work_struct update_balloon_stats_work; @@ -65,6 +70,13 @@ struct virtio_balloon { spinlock_t stop_update_lock; bool stop_update; =20 + /* Start to report free pages */ + bool report_free_page; + /* Stores the cmd id given by host to start the free page reporting */ + uint32_t start_cmd_id; + /* Stores STOP_ID as a sign to tell host that the reporting is done */ + uint32_t stop_cmd_id; + /* Waiting for host to ack the pages we released. */ wait_queue_head_t acked; =20 @@ -189,6 +201,28 @@ static void batch_balloon_page_sg(struct virtio_balloo= n *vb, kick_and_wait(vq, vb->acked); } =20 +static void batch_free_page_sg(struct virtqueue *vq, + unsigned long pfn, + uint32_t len) +{ + add_one_sg(vq, pfn, len); + + /* Batch till the vq is full */ + if (!vq->num_free) + virtqueue_kick(vq); +} + +static void send_cmd_id(struct virtio_balloon *vb, void *addr) +{ + struct scatterlist sg; + int err; + + sg_init_one(&sg, addr, sizeof(uint32_t)); + err =3D virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb, GFP_KERNEL); + BUG_ON(err); + virtqueue_kick(vb->free_page_vq); +} + /* * Send balloon pages in sgs to host. The balloon pages are recorded in the * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE. @@ -497,17 +531,6 @@ static void stats_handle_request(struct virtio_balloon= *vb) virtqueue_kick(vq); } =20 -static void virtballoon_changed(struct virtio_device *vdev) -{ - struct virtio_balloon *vb =3D vdev->priv; - unsigned long flags; - - spin_lock_irqsave(&vb->stop_update_lock, flags); - if (!vb->stop_update) - queue_work(system_freezable_wq, &vb->update_balloon_size_work); - spin_unlock_irqrestore(&vb->stop_update_lock, flags); -} - static inline s64 towards_target(struct virtio_balloon *vb) { s64 target; @@ -524,6 +547,36 @@ static inline s64 towards_target(struct virtio_balloon= *vb) return target - vb->num_pages; } =20 +static void virtballoon_changed(struct virtio_device *vdev) +{ + struct virtio_balloon *vb =3D vdev->priv; + unsigned long flags; + __u32 cmd_id; + s64 diff =3D towards_target(vb); + + if (diff) { + spin_lock_irqsave(&vb->stop_update_lock, flags); + if (!vb->stop_update) + queue_work(system_freezable_wq, + &vb->update_balloon_size_work); + spin_unlock_irqrestore(&vb->stop_update_lock, flags); + } + + virtio_cread(vb->vdev, struct virtio_balloon_config, + free_page_report_cmd_id, &cmd_id); + if (cmd_id =3D=3D VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID) { + WRITE_ONCE(vb->report_free_page, false); + } else if (cmd_id !=3D vb->start_cmd_id) { + /* + * Host requests to start the reporting by sending a new cmd + * id. + */ + WRITE_ONCE(vb->report_free_page, true); + vb->start_cmd_id =3D cmd_id; + queue_work(vb->balloon_wq, &vb->report_free_page_work); + } +} + static void update_balloon_size(struct virtio_balloon *vb) { u32 actual =3D vb->num_pages; @@ -601,40 +654,116 @@ static void update_balloon_size_func(struct work_str= uct *work) =20 static int init_vqs(struct virtio_balloon *vb) { - struct virtqueue *vqs[3]; - vq_callback_t *callbacks[] =3D { balloon_ack, balloon_ack, stats_request = }; - static const char * const names[] =3D { "inflate", "deflate", "stats" }; - int err, nvqs; + struct virtqueue **vqs; + vq_callback_t **callbacks; + const char **names; + struct scatterlist sg; + int i, nvqs, err =3D -ENOMEM; + + /* Inflateq and deflateq are used unconditionally */ + nvqs =3D 2; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) + nvqs++; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) + nvqs++; + + /* Allocate space for find_vqs parameters */ + vqs =3D kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL); + if (!vqs) + goto err_vq; + callbacks =3D kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL); + if (!callbacks) + goto err_callback; + names =3D kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL); + if (!names) + goto err_names; + + callbacks[0] =3D balloon_ack; + names[0] =3D "inflate"; + callbacks[1] =3D balloon_ack; + names[1] =3D "deflate"; + + i =3D 2; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { + callbacks[i] =3D stats_request; + names[i] =3D "stats"; + i++; + } =20 - /* - * We expect two virtqueues: inflate and deflate, and - * optionally stat. - */ - nvqs =3D virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; - err =3D virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL); + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { + callbacks[i] =3D NULL; + names[i] =3D "free_page_vq"; + } + + err =3D vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names, + NULL, NULL); if (err) - return err; + goto err_find; =20 vb->inflate_vq =3D vqs[0]; vb->deflate_vq =3D vqs[1]; + i =3D 2; if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { - struct scatterlist sg; - unsigned int num_stats; - vb->stats_vq =3D vqs[2]; - + vb->stats_vq =3D vqs[i++]; /* * Prime this virtqueue with one buffer so the hypervisor can * use it to signal us later (it can't be broken yet!). */ - num_stats =3D update_balloon_stats(vb); - - sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats); + sg_init_one(&sg, vb->stats, sizeof(vb->stats)); if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL) - < 0) - BUG(); + < 0) { + dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n", + __func__); + goto err_find; + } virtqueue_kick(vb->stats_vq); } + + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) + vb->free_page_vq =3D vqs[i]; + + kfree(names); + kfree(callbacks); + kfree(vqs); return 0; + +err_find: + kfree(names); +err_names: + kfree(callbacks); +err_callback: + kfree(vqs); +err_vq: + return err; +} + +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn, + unsigned long nr_pages) +{ + struct virtio_balloon *vb =3D (struct virtio_balloon *)opaque; + uint32_t len =3D nr_pages << PAGE_SHIFT; + + if (!READ_ONCE(vb->report_free_page)) + return false; + + batch_free_page_sg(vb->free_page_vq, pfn, len); + + return true; +} + +static void report_free_page(struct work_struct *work) +{ + struct virtio_balloon *vb; + + vb =3D container_of(work, struct virtio_balloon, report_free_page_work); + /* Start by sending the obtained cmd id to the host with an outbuf */ + send_cmd_id(vb, &vb->start_cmd_id); + walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages); + /* + * End by sending the stop id to the host with an outbuf. Use the + * non-batching mode here to trigger a kick after adding the stop id. + */ + send_cmd_id(vb, &vb->stop_cmd_id); } =20 #ifdef CONFIG_BALLOON_COMPACTION @@ -762,6 +891,13 @@ static int virtballoon_probe(struct virtio_device *vde= v) if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) xb_init(&vb->page_xb); =20 + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { + vb->balloon_wq =3D alloc_workqueue("balloon-wq", + WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0); + INIT_WORK(&vb->report_free_page_work, report_free_page); + vb->stop_cmd_id =3D VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID; + } + vb->nb.notifier_call =3D virtballoon_oom_notify; vb->nb.priority =3D VIRTBALLOON_OOM_NOTIFY_PRIORITY; err =3D register_oom_notifier(&vb->nb); @@ -826,6 +962,7 @@ static void virtballoon_remove(struct virtio_device *vd= ev) spin_unlock_irq(&vb->stop_update_lock); cancel_work_sync(&vb->update_balloon_size_work); cancel_work_sync(&vb->update_balloon_stats_work); + cancel_work_sync(&vb->report_free_page_work); =20 remove_common(vb); #ifdef CONFIG_BALLOON_COMPACTION @@ -879,6 +1016,7 @@ static unsigned int features[] =3D { VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, VIRTIO_BALLOON_F_SG, + VIRTIO_BALLOON_F_FREE_PAGE_VQ, }; =20 static struct virtio_driver virtio_balloon_driver =3D { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virti= o_balloon.h index 37780a7..58f1274 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -35,15 +35,19 @@ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ #define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ 4 /* VQ to report free pages */ =20 /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 =20 +#define VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID 0 struct virtio_balloon_config { /* Number of pages host wants Guest to give up. */ __u32 num_pages; /* Number of pages we've actually got in balloon. */ __u32 actual; + /* Free page report command id, readonly by guest */ + __u32 free_page_report_cmd_id; }; =20 #define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ --=20 2.7.4 From nobody Mon May 13 00:41:14 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1515497521881602.5000977111164; Tue, 9 Jan 2018 03:32:01 -0800 (PST) Received: from localhost ([::1]:47023 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eYs8P-00064m-4k for importer@patchew.org; Tue, 09 Jan 2018 06:32:01 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35212) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eYs6A-0004ih-2B for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:45 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eYs66-0003Bf-SD for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:42 -0500 Received: from mga09.intel.com ([134.134.136.24]:54587) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eYs66-0002zJ-Ja for qemu-devel@nongnu.org; Tue, 09 Jan 2018 06:29:38 -0500 Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 09 Jan 2018 03:29:38 -0800 Received: from devel-ww.sh.intel.com ([10.239.48.110]) by orsmga008.jf.intel.com with ESMTP; 09 Jan 2018 03:29:34 -0800 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.46,335,1511856000"; d="scan'208";a="9317582" From: Wei Wang To: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, mhocko@kernel.org, akpm@linux-foundation.org, mawilcox@microsoft.com Date: Tue, 9 Jan 2018 19:11:02 +0800 Message-Id: <1515496262-7533-6-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1515496262-7533-1-git-send-email-wei.w.wang@intel.com> References: <1515496262-7533-1-git-send-email-wei.w.wang@intel.com> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 134.134.136.24 Subject: [Qemu-devel] [PATCH v21 5/5] virtio-balloon: don't report free pages when page poisoning is enabled X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: aarcange@redhat.com, yang.zhang.wz@gmail.com, quan.xu0@gmail.com, david@redhat.com, penguin-kernel@I-love.SAKURA.ne.jp, liliang.opensource@gmail.com, willy@infradead.org, amit.shah@redhat.com, wei.w.wang@intel.com, cornelia.huck@de.ibm.com, pbonzini@redhat.com, nilal@redhat.com, mgorman@techsingularity.net Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" The guest free pages should not be discarded by the live migration thread when page poisoning is enabled with PAGE_POISONING_NO_SANITY=3Dn, because skipping the transfer of such poisoned free pages will trigger false positive when new pages are allocated and checked on the destination. This patch adds a config field, poison_val. Guest writes to the config field to tell the host about the poisoning value. The value will be 0 in the following cases: 1) PAGE_POISONING_NO_SANITY is enabled; 2) page poisoning is disabled; or 3) PAGE_POISONING_ZERO is enabled. Signed-off-by: Wei Wang Suggested-by: Michael S. Tsirkin Cc: Michal Hocko --- drivers/virtio/virtio_balloon.c | 8 ++++++++ include/uapi/linux/virtio_balloon.h | 2 ++ 2 files changed, 10 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloo= n.c index f7e8830..7429894 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -859,6 +859,7 @@ static struct file_system_type balloon_fs =3D { static int virtballoon_probe(struct virtio_device *vdev) { struct virtio_balloon *vb; + __u32 poison_val; int err; =20 if (!vdev->config->get) { @@ -896,6 +897,13 @@ static int virtballoon_probe(struct virtio_device *vde= v) WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0); INIT_WORK(&vb->report_free_page_work, report_free_page); vb->stop_cmd_id =3D VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID; + if (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) || + !page_poisoning_enabled()) + poison_val =3D 0; + else + poison_val =3D PAGE_POISON; + virtio_cwrite(vb->vdev, struct virtio_balloon_config, + poison_val, &poison_val); } =20 vb->nb.notifier_call =3D virtballoon_oom_notify; diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virti= o_balloon.h index 58f1274..f270e9e 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -48,6 +48,8 @@ struct virtio_balloon_config { __u32 actual; /* Free page report command id, readonly by guest */ __u32 free_page_report_cmd_id; + /* Stores PAGE_POISON if page poisoning with sanity check is in use */ + __u32 poison_val; }; =20 #define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ --=20 2.7.4