From nobody Sun May 5 04:38:25 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@gnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@gnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1506745299673693.9102964803008; Fri, 29 Sep 2017 21:21:39 -0700 (PDT) Received: from localhost ([::1]:38070 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy9HP-0001c9-KY for importer@patchew.org; Sat, 30 Sep 2017 00:21:31 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38135) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy9FO-0000Jf-B1 for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:28 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dy9FM-0001wZ-0Y for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:26 -0400 Received: from mga05.intel.com ([192.55.52.43]:47204) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1dy9FL-0001ux-Hr for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:23 -0400 Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga105.fm.intel.com with ESMTP; 29 Sep 2017 21:19:23 -0700 Received: from devel-ww.sh.intel.com ([10.239.48.92]) by orsmga001.jf.intel.com with ESMTP; 29 Sep 2017 21:19:19 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,455,1500966000"; d="scan'208";a="1177213032" From: Wei Wang To: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, mhocko@kernel.org, akpm@linux-foundation.org, mawilcox@microsoft.com Date: Sat, 30 Sep 2017 12:05:50 +0800 Message-Id: <1506744354-20979-2-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1506744354-20979-1-git-send-email-wei.w.wang@intel.com> References: <1506744354-20979-1-git-send-email-wei.w.wang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 192.55.52.43 Subject: [Qemu-devel] [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap X-BeenThere: qemu-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: aarcange@redhat.com, yang.zhang.wz@gmail.com, david@redhat.com, liliang.opensource@gmail.com, willy@infradead.org, amit.shah@redhat.com, wei.w.wang@intel.com, quan.xu@aliyun.com, cornelia.huck@de.ibm.com, pbonzini@redhat.com, mgorman@techsingularity.net Errors-To: qemu-devel-bounces+importer=patchew.org@gnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 From: Matthew Wilcox The eXtensible Bitmap is a sparse bitmap representation which is efficient for set bits which tend to cluster. It supports up to 'unsigned long' worth of bits, and this commit adds the bare bones -- xb_set_bit(), xb_clear_bit() and xb_test_bit(). More possible optimizations to add in the future=EF=BC=9A 1) xb_set_bit_range: set a range of bits 2) when searching a bit, if the bit is not found in the slot, move on to the next slot directly. 3) add Tags to help searching Signed-off-by: Wei Wang Cc: Matthew Wilcox Cc: Andrew Morton Cc: Michal Hocko Cc: Michael S. Tsirkin v15->v16 ChangeLog: 1) coding style - separate small functions for bit set/clear/test; 2) Clear a range of bits in a more efficient way: A) clear a range of bits from the same ida bitmap directly rather than search the bitmap again for each bit; B) when the range of bits to clear covers the whole ida bitmap, directly free the bitmap - no need to zero the bitmap first. 3) more efficient bit searching, like 2.A. --- include/linux/radix-tree.h | 2 + include/linux/xbitmap.h | 66 ++++++++++++ lib/Makefile | 2 +- lib/radix-tree.c | 42 +++++++- lib/xbitmap.c | 264 +++++++++++++++++++++++++++++++++++++++++= ++++ 5 files changed, 373 insertions(+), 3 deletions(-) create mode 100644 include/linux/xbitmap.h create mode 100644 lib/xbitmap.c diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 3e57350..1cffeb3 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *, const struct radix_tree_iter *, void __rcu **slot, void *entry); void radix_tree_replace_slot(struct radix_tree_root *, void __rcu **slot, void *entry); +bool __radix_tree_delete(struct radix_tree_root *root, + struct radix_tree_node *node, void __rcu **slot); void __radix_tree_delete_node(struct radix_tree_root *, struct radix_tree_node *, radix_tree_update_node_t update_node, diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h new file mode 100644 index 0000000..f634bd9 --- /dev/null +++ b/include/linux/xbitmap.h @@ -0,0 +1,66 @@ +/* + * eXtensible Bitmaps + * Copyright (c) 2017 Microsoft Corporation + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility. + * All bits are initially zero. + */ + +#ifndef __XBITMAP_H__ +#define __XBITMAP_H__ + +#include + +struct xb { + struct radix_tree_root xbrt; +}; + +#define XB_INIT { \ + .xbrt =3D RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT), \ +} +#define DEFINE_XB(name) struct xb name =3D XB_INIT + +static inline void xb_init(struct xb *xb) +{ + INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT); +} + +int xb_set_bit(struct xb *xb, unsigned long bit); +bool xb_test_bit(struct xb *xb, unsigned long bit); +void xb_clear_bit(struct xb *xb, unsigned long bit); +unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start, + unsigned long end); +unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start, + unsigned long end); +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long = end); + +/* Check if the xb tree is empty */ +static inline bool xb_is_empty(const struct xb *xb) +{ + return radix_tree_empty(&xb->xbrt); +} + +void xb_preload(gfp_t gfp); + +/** + * xb_preload_end - end preload section started with xb_preload() + * + * Each xb_preload() should be matched with an invocation of this + * function. See xb_preload() for details. + */ +static inline void xb_preload_end(void) +{ + preempt_enable(); +} + +#endif diff --git a/lib/Makefile b/lib/Makefile index 40c1837..ea50496 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o :=3D n =20 lib-y :=3D ctype.o string.o vsprintf.o cmdline.o \ rbtree.o radix-tree.o dump_stack.o timerqueue.o\ - idr.o int_sqrt.o extable.o \ + idr.o xbitmap.o int_sqrt.o extable.o \ sha1.o chacha20.o irq_regs.o argv_split.o \ flex_proportions.o ratelimit.o show_mem.o \ is_single_threaded.o plist.o decompress.o kobject_uevent.o \ diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 898e879..1e15e30 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -78,6 +78,19 @@ static struct kmem_cache *radix_tree_node_cachep; #define IDA_PRELOAD_SIZE (IDA_MAX_PATH * 2 - 1) =20 /* + * The xbitmap implementation supports up to ULONG_MAX bits, and it is + * implemented based on ida bitmaps. So, given an unsigned long index, + * the high order XB_INDEX_BITS bits of the index is used to find the + * corresponding item (i.e. ida bitmap) from the radix tree, and the low + * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into + * the ida bitmap to find the bit. + */ +#define XB_INDEX_BITS (BITS_PER_LONG - ilog2(IDA_BITMAP_BITS)) +#define XB_MAX_PATH (DIV_ROUND_UP(XB_INDEX_BITS, \ + RADIX_TREE_MAP_SHIFT)) +#define XB_PRELOAD_SIZE (XB_MAX_PATH * 2 - 1) + +/* * Per-cpu pool of preloaded nodes */ struct radix_tree_preload { @@ -840,6 +853,8 @@ int __radix_tree_create(struct radix_tree_root *root, u= nsigned long index, offset, 0, 0); if (!child) return -ENOMEM; + if (is_idr(root)) + all_tag_set(child, IDR_FREE); rcu_assign_pointer(*slot, node_to_entry(child)); if (node) node->count++; @@ -1986,8 +2001,8 @@ void __radix_tree_delete_node(struct radix_tree_root = *root, delete_node(root, node, update_node, private); } =20 -static bool __radix_tree_delete(struct radix_tree_root *root, - struct radix_tree_node *node, void __rcu **slot) +bool __radix_tree_delete(struct radix_tree_root *root, + struct radix_tree_node *node, void __rcu **slot) { void *old =3D rcu_dereference_raw(*slot); int exceptional =3D radix_tree_exceptional_entry(old) ? -1 : 0; @@ -2005,6 +2020,29 @@ static bool __radix_tree_delete(struct radix_tree_ro= ot *root, } =20 /** + * xb_preload - preload for xb_set_bit() + * @gfp_mask: allocation mask to use for preloading + * + * Preallocate memory to use for the next call to xb_set_bit(). This funct= ion + * returns with preemption disabled. It will be enabled by xb_preload_end(= ). + */ +void xb_preload(gfp_t gfp) +{ + if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0) + preempt_disable(); + + if (!this_cpu_read(ida_bitmap)) { + struct ida_bitmap *bitmap =3D kmalloc(sizeof(*bitmap), gfp); + + if (!bitmap) + return; + bitmap =3D this_cpu_cmpxchg(ida_bitmap, NULL, bitmap); + kfree(bitmap); + } +} +EXPORT_SYMBOL(xb_preload); + +/** * radix_tree_iter_delete - delete the entry at this iterator position * @root: radix tree root * @iter: iterator state diff --git a/lib/xbitmap.c b/lib/xbitmap.c new file mode 100644 index 0000000..4ab9ac2 --- /dev/null +++ b/lib/xbitmap.c @@ -0,0 +1,264 @@ +#include +#include + +/** + * xb_set_bit - set a bit in the xbitmap + * @xb: the xbitmap tree used to record the bit + * @bit: index of the bit to set + * + * This function is used to set a bit in the xbitmap. If the bitmap that @= bit + * resides in is not there, it will be allocated. + * + * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The ca= ller + * may want to call the function again. + */ +int xb_set_bit(struct xb *xb, unsigned long bit) +{ + int err; + unsigned long index =3D bit / IDA_BITMAP_BITS; + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned long ebit; + + bit %=3D IDA_BITMAP_BITS; + ebit =3D bit + 2; + + err =3D __radix_tree_create(root, index, 0, &node, &slot); + if (err) + return err; + bitmap =3D rcu_dereference_raw(*slot); + if (radix_tree_exception(bitmap)) { + unsigned long tmp =3D (unsigned long)bitmap; + + if (ebit < BITS_PER_LONG) { + tmp |=3D 1UL << ebit; + rcu_assign_pointer(*slot, (void *)tmp); + return 0; + } + bitmap =3D this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -EAGAIN; + memset(bitmap, 0, sizeof(*bitmap)); + bitmap->bitmap[0] =3D tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT; + rcu_assign_pointer(*slot, bitmap); + } + + if (!bitmap) { + if (ebit < BITS_PER_LONG) { + bitmap =3D (void *)((1UL << ebit) | + RADIX_TREE_EXCEPTIONAL_ENTRY); + __radix_tree_replace(root, node, slot, bitmap, NULL, + NULL); + return 0; + } + bitmap =3D this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -EAGAIN; + memset(bitmap, 0, sizeof(*bitmap)); + __radix_tree_replace(root, node, slot, bitmap, NULL, NULL); + } + + __set_bit(bit, bitmap->bitmap); + return 0; +} +EXPORT_SYMBOL(xb_set_bit); + +/** + * xb_clear_bit - clear a bit in the xbitmap + * @xb: the xbitmap tree used to record the bit + * @bit: index of the bit to clear + * + * This function is used to clear a bit in the xbitmap. If all the bits of= the + * bitmap are 0, the bitmap will be freed. + */ +void xb_clear_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index =3D bit / IDA_BITMAP_BITS; + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned long ebit; + + bit %=3D IDA_BITMAP_BITS; + ebit =3D bit + 2; + + bitmap =3D __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bitmap)) { + unsigned long tmp =3D (unsigned long)bitmap; + + if (ebit >=3D BITS_PER_LONG) + return; + tmp &=3D ~(1UL << ebit); + if (tmp =3D=3D RADIX_TREE_EXCEPTIONAL_ENTRY) + __radix_tree_delete(root, node, slot); + else + rcu_assign_pointer(*slot, (void *)tmp); + return; + } + + if (!bitmap) + return; + + __clear_bit(bit, bitmap->bitmap); + if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) { + kfree(bitmap); + __radix_tree_delete(root, node, slot); + } +} +EXPORT_SYMBOL(xb_clear_bit); + +/** + * xb_clear_bit - clear a range of bits in the xbitmap + * @start: the start of the bit range, inclusive + * @end: the end of the bit range, inclusive + * + * This function is used to clear a bit in the xbitmap. If all the bits of= the + * bitmap are 0, the bitmap will be freed. + */ +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long = end) +{ + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned int nbits; + + for (; start < end; start =3D (start | (IDA_BITMAP_BITS - 1)) + 1) { + unsigned long index =3D start / IDA_BITMAP_BITS; + unsigned long bit =3D start % IDA_BITMAP_BITS; + + bitmap =3D __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bitmap)) { + unsigned long ebit =3D bit + 2; + unsigned long tmp =3D (unsigned long)bitmap; + + nbits =3D min(end - start + 1, BITS_PER_LONG - ebit); + + if (ebit >=3D BITS_PER_LONG) + continue; + bitmap_clear(&tmp, ebit, nbits); + if (tmp =3D=3D RADIX_TREE_EXCEPTIONAL_ENTRY) + __radix_tree_delete(root, node, slot); + else + rcu_assign_pointer(*slot, (void *)tmp); + } else if (bitmap) { + nbits =3D min(end - start + 1, IDA_BITMAP_BITS - bit); + + if (nbits !=3D IDA_BITMAP_BITS) + bitmap_clear(bitmap->bitmap, bit, nbits); + + if (nbits =3D=3D IDA_BITMAP_BITS || + bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) { + kfree(bitmap); + __radix_tree_delete(root, node, slot); + } + } + } +} +EXPORT_SYMBOL(xb_clear_bit_range); + +/** + * xb_test_bit - test a bit in the xbitmap + * @xb: the xbitmap tree used to record the bit + * @bit: index of the bit to test + * + * This function is used to test a bit in the xbitmap. + * Returns: 1 if the bit is set, or 0 otherwise. + */ +bool xb_test_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index =3D bit / IDA_BITMAP_BITS; + const struct radix_tree_root *root =3D &xb->xbrt; + struct ida_bitmap *bitmap =3D radix_tree_lookup(root, index); + + bit %=3D IDA_BITMAP_BITS; + + if (!bitmap) + return false; + if (radix_tree_exception(bitmap)) { + bit +=3D RADIX_TREE_EXCEPTIONAL_SHIFT; + if (bit > BITS_PER_LONG) + return false; + return (unsigned long)bitmap & (1UL << bit); + } + + return test_bit(bit, bitmap->bitmap); +} +EXPORT_SYMBOL(xb_test_bit); + +static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start, + unsigned long end, bool set) +{ + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bmap; + unsigned long ret =3D end + 1; + + for (; start < end; start =3D (start | (IDA_BITMAP_BITS - 1)) + 1) { + unsigned long index =3D start / IDA_BITMAP_BITS; + unsigned long bit =3D start % IDA_BITMAP_BITS; + + bmap =3D __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bmap)) { + unsigned long tmp =3D (unsigned long)bmap; + unsigned long ebit =3D bit + 2; + + if (ebit >=3D BITS_PER_LONG) + continue; + if (set) + ret =3D find_next_bit(&tmp, BITS_PER_LONG, ebit); + else + ret =3D find_next_zero_bit(&tmp, BITS_PER_LONG, + ebit); + if (ret < BITS_PER_LONG) + return ret - 2 + IDA_BITMAP_BITS * index; + } else if (bmap) { + if (set) + ret =3D find_next_bit(bmap->bitmap, + IDA_BITMAP_BITS, bit); + else + ret =3D find_next_zero_bit(bmap->bitmap, + IDA_BITMAP_BITS, bit); + if (ret < IDA_BITMAP_BITS) + return ret + index * IDA_BITMAP_BITS; + } else if (!bmap && !set) { + return start; + } + } + + return ret; +} + +/** + * xb_find_next_set_bit - find the next set bit in a range + * @xb: the xbitmap to search + * @start: the start of the range, inclusive + * @end: the end of the range, inclusive + * + * Returns: the index of the found bit, or @end + 1 if no such bit is foun= d. + */ +unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start, + unsigned long end) +{ + return xb_find_next_bit(xb, start, end, 1); +} +EXPORT_SYMBOL(xb_find_next_set_bit); + +/** + * xb_find_next_zero_bit - find the next zero bit in a range + * @xb: the xbitmap to search + * @start: the start of the range, inclusive + * @end: the end of the range, inclusive + * + * Returns: the index of the found bit, or @end + 1 if no such bit is foun= d. + */ +unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start, + unsigned long end) +{ + return xb_find_next_bit(xb, start, end, 0); +} +EXPORT_SYMBOL(xb_find_next_zero_bit); --=20 2.7.4 From nobody Sun May 5 04:38:25 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@gnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@gnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1506745412094327.97004210061334; Fri, 29 Sep 2017 21:23:32 -0700 (PDT) Received: from localhost ([::1]:38079 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy9JL-0003Vd-BR for importer@patchew.org; Sat, 30 Sep 2017 00:23:31 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38150) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy9FS-0000Kw-1W for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:33 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dy9FP-0001yr-Re for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:30 -0400 Received: from mga05.intel.com ([192.55.52.43]:47204) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1dy9FP-0001ux-DL for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:27 -0400 Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga105.fm.intel.com with ESMTP; 29 Sep 2017 21:19:27 -0700 Received: from devel-ww.sh.intel.com ([10.239.48.92]) by orsmga001.jf.intel.com with ESMTP; 29 Sep 2017 21:19:23 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,455,1500966000"; d="scan'208";a="1177213046" From: Wei Wang To: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, mhocko@kernel.org, akpm@linux-foundation.org, mawilcox@microsoft.com Date: Sat, 30 Sep 2017 12:05:51 +0800 Message-Id: <1506744354-20979-3-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1506744354-20979-1-git-send-email-wei.w.wang@intel.com> References: <1506744354-20979-1-git-send-email-wei.w.wang@intel.com> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 192.55.52.43 Subject: [Qemu-devel] [PATCH v16 2/5] radix tree test suite: add tests for xbitmap X-BeenThere: qemu-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: aarcange@redhat.com, yang.zhang.wz@gmail.com, david@redhat.com, liliang.opensource@gmail.com, willy@infradead.org, amit.shah@redhat.com, wei.w.wang@intel.com, quan.xu@aliyun.com, cornelia.huck@de.ibm.com, pbonzini@redhat.com, mgorman@techsingularity.net Errors-To: qemu-devel-bounces+importer=patchew.org@gnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" From: Matthew Wilcox Add the following tests for xbitmap: 1) single bit test: single bit set/clear/find; 2) bit range test: set/clear a range of bits and find a 0 or 1 bit in the range. Signed-off-by: Wei Wang Cc: Matthew Wilcox Cc: Andrew Morton Cc: Michael S. Tsirkin --- tools/include/linux/bitmap.h | 34 ++++ tools/include/linux/kernel.h | 2 + tools/testing/radix-tree/Makefile | 7 +- tools/testing/radix-tree/linux/kernel.h | 2 - tools/testing/radix-tree/main.c | 5 + tools/testing/radix-tree/test.h | 1 + tools/testing/radix-tree/xbitmap.c | 269 ++++++++++++++++++++++++++++= ++++ 7 files changed, 317 insertions(+), 3 deletions(-) create mode 100644 tools/testing/radix-tree/xbitmap.c diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h index e8b9f51..890dab2 100644 --- a/tools/include/linux/bitmap.h +++ b/tools/include/linux/bitmap.h @@ -36,6 +36,40 @@ static inline void bitmap_zero(unsigned long *dst, int n= bits) } } =20 +static inline void __bitmap_clear(unsigned long *map, unsigned int start, + int len) +{ + unsigned long *p =3D map + BIT_WORD(start); + const unsigned int size =3D start + len; + int bits_to_clear =3D BITS_PER_LONG - (start % BITS_PER_LONG); + unsigned long mask_to_clear =3D BITMAP_FIRST_WORD_MASK(start); + + while (len - bits_to_clear >=3D 0) { + *p &=3D ~mask_to_clear; + len -=3D bits_to_clear; + bits_to_clear =3D BITS_PER_LONG; + mask_to_clear =3D ~0UL; + p++; + } + if (len) { + mask_to_clear &=3D BITMAP_LAST_WORD_MASK(size); + *p &=3D ~mask_to_clear; + } +} + +static inline __always_inline void bitmap_clear(unsigned long *map, + unsigned int start, + unsigned int nbits) +{ + if (__builtin_constant_p(nbits) && nbits =3D=3D 1) + __clear_bit(start, map); + else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) && + __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8)) + memset((char *)map + start / 8, 0, nbits / 8); + else + __bitmap_clear(map, start, nbits); +} + static inline void bitmap_fill(unsigned long *dst, unsigned int nbits) { unsigned int nlongs =3D BITS_TO_LONGS(nbits); diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h index 77d2e94..21e90ee 100644 --- a/tools/include/linux/kernel.h +++ b/tools/include/linux/kernel.h @@ -12,6 +12,8 @@ #define UINT_MAX (~0U) #endif =20 +#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) =3D=3D 0) + #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d)) =20 #define PERF_ALIGN(x, a) __PERF_ALIGN_MASK(x, (typeof(x))(a)-1) diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/M= akefile index 6a9480c..fc7cb422 100644 --- a/tools/testing/radix-tree/Makefile +++ b/tools/testing/radix-tree/Makefile @@ -5,7 +5,8 @@ LDLIBS+=3D -lpthread -lurcu TARGETS =3D main idr-test multiorder CORE_OFILES :=3D radix-tree.o idr.o linux.o test.o find_bit.o OFILES =3D main.o $(CORE_OFILES) regression1.o regression2.o regression3.o= \ - tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o + tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \ + xbitmap.o =20 ifndef SHIFT SHIFT=3D3 @@ -24,6 +25,9 @@ idr-test: idr-test.o $(CORE_OFILES) =20 multiorder: multiorder.o $(CORE_OFILES) =20 +xbitmap: xbitmap.o $(CORE_OFILES) + $(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap + clean: $(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h =20 @@ -33,6 +37,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \ ../../include/linux/*.h \ ../../include/asm/*.h \ ../../../include/linux/radix-tree.h \ + ../../../include/linux/xbitmap.h \ ../../../include/linux/idr.h =20 radix-tree.c: ../../../lib/radix-tree.c diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-= tree/linux/kernel.h index b21a77f..c1e6088 100644 --- a/tools/testing/radix-tree/linux/kernel.h +++ b/tools/testing/radix-tree/linux/kernel.h @@ -16,6 +16,4 @@ #define pr_debug printk #define pr_cont printk =20 -#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) - #endif /* _KERNEL_H */ diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/mai= n.c index bc9a784..6f4774e 100644 --- a/tools/testing/radix-tree/main.c +++ b/tools/testing/radix-tree/main.c @@ -337,6 +337,11 @@ static void single_thread_tests(bool long_run) rcu_barrier(); printv(2, "after copy_tag_check: %d allocated, preempt %d\n", nr_allocated, preempt_count); + + xbitmap_checks(); + rcu_barrier(); + printv(2, "after xbitmap_checks: %d allocated, preempt %d\n", + nr_allocated, preempt_count); } =20 int main(int argc, char **argv) diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/tes= t.h index 0f8220c..f8dcdaa 100644 --- a/tools/testing/radix-tree/test.h +++ b/tools/testing/radix-tree/test.h @@ -36,6 +36,7 @@ void iteration_test(unsigned order, unsigned duration); void benchmark(void); void idr_checks(void); void ida_checks(void); +void xbitmap_checks(void); void ida_thread_tests(void); =20 struct item * diff --git a/tools/testing/radix-tree/xbitmap.c b/tools/testing/radix-tree/= xbitmap.c new file mode 100644 index 0000000..2787cb2 --- /dev/null +++ b/tools/testing/radix-tree/xbitmap.c @@ -0,0 +1,269 @@ +#include +#include +#include +#include "../../../include/linux/xbitmap.h" + +static DEFINE_XB(xb1); + +int xb_set_bit(struct xb *xb, unsigned long bit) +{ + int err; + unsigned long index =3D bit / IDA_BITMAP_BITS; + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned long ebit; + + bit %=3D IDA_BITMAP_BITS; + ebit =3D bit + 2; + + err =3D __radix_tree_create(root, index, 0, &node, &slot); + if (err) + return err; + bitmap =3D rcu_dereference_raw(*slot); + if (radix_tree_exception(bitmap)) { + unsigned long tmp =3D (unsigned long)bitmap; + + if (ebit < BITS_PER_LONG) { + tmp |=3D 1UL << ebit; + rcu_assign_pointer(*slot, (void *)tmp); + return 0; + } + bitmap =3D this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -EAGAIN; + memset(bitmap, 0, sizeof(*bitmap)); + bitmap->bitmap[0] =3D tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT; + rcu_assign_pointer(*slot, bitmap); + } + + if (!bitmap) { + if (ebit < BITS_PER_LONG) { + bitmap =3D (void *)((1UL << ebit) | + RADIX_TREE_EXCEPTIONAL_ENTRY); + __radix_tree_replace(root, node, slot, bitmap, NULL, + NULL); + return 0; + } + bitmap =3D this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -EAGAIN; + memset(bitmap, 0, sizeof(*bitmap)); + __radix_tree_replace(root, node, slot, bitmap, NULL, NULL); + } + + __set_bit(bit, bitmap->bitmap); + return 0; +} + +bool xb_test_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index =3D bit / IDA_BITMAP_BITS; + const struct radix_tree_root *root =3D &xb->xbrt; + struct ida_bitmap *bitmap =3D radix_tree_lookup(root, index); + + bit %=3D IDA_BITMAP_BITS; + + if (!bitmap) + return false; + if (radix_tree_exception(bitmap)) { + bit +=3D RADIX_TREE_EXCEPTIONAL_SHIFT; + if (bit > BITS_PER_LONG) + return false; + return (unsigned long)bitmap & (1UL << bit); + } + + return test_bit(bit, bitmap->bitmap); +} + +void xb_clear_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index =3D bit / IDA_BITMAP_BITS; + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned long ebit; + + bit %=3D IDA_BITMAP_BITS; + ebit =3D bit + 2; + + bitmap =3D __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bitmap)) { + unsigned long tmp =3D (unsigned long)bitmap; + + if (ebit >=3D BITS_PER_LONG) + return; + tmp &=3D ~(1UL << ebit); + if (tmp =3D=3D RADIX_TREE_EXCEPTIONAL_ENTRY) + __radix_tree_delete(root, node, slot); + else + rcu_assign_pointer(*slot, (void *)tmp); + return; + } + + if (!bitmap) + return; + + __clear_bit(bit, bitmap->bitmap); + if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) { + kfree(bitmap); + __radix_tree_delete(root, node, slot); + } +} + +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long = end) +{ + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned int nbits; + + for (; start < end; start =3D (start | (IDA_BITMAP_BITS - 1)) + 1) { + unsigned long index =3D start / IDA_BITMAP_BITS; + unsigned long bit =3D start % IDA_BITMAP_BITS; + + bitmap =3D __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bitmap)) { + unsigned long ebit =3D bit + 2; + unsigned long tmp =3D (unsigned long)bitmap; + + nbits =3D min(end - start + 1, BITS_PER_LONG - ebit); + + if (ebit >=3D BITS_PER_LONG) + continue; + bitmap_clear(&tmp, ebit, nbits); + if (tmp =3D=3D RADIX_TREE_EXCEPTIONAL_ENTRY) + __radix_tree_delete(root, node, slot); + else + rcu_assign_pointer(*slot, (void *)tmp); + } else if (bitmap) { + nbits =3D min(end - start + 1, IDA_BITMAP_BITS - bit); + + if (nbits !=3D IDA_BITMAP_BITS) + bitmap_clear(bitmap->bitmap, bit, nbits); + + if (nbits =3D=3D IDA_BITMAP_BITS || + bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) { + kfree(bitmap); + __radix_tree_delete(root, node, slot); + } + } + } +} + +static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start, + unsigned long end, bool set) +{ + struct radix_tree_root *root =3D &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bmap; + unsigned long ret =3D end + 1; + + for (; start < end; start =3D (start | (IDA_BITMAP_BITS - 1)) + 1) { + unsigned long index =3D start / IDA_BITMAP_BITS; + unsigned long bit =3D start % IDA_BITMAP_BITS; + + bmap =3D __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bmap)) { + unsigned long tmp =3D (unsigned long)bmap; + unsigned long ebit =3D bit + 2; + + if (ebit >=3D BITS_PER_LONG) + continue; + if (set) + ret =3D find_next_bit(&tmp, BITS_PER_LONG, ebit); + else + ret =3D find_next_zero_bit(&tmp, BITS_PER_LONG, + ebit); + if (ret < BITS_PER_LONG) + return ret - 2 + IDA_BITMAP_BITS * index; + } else if (bmap) { + if (set) + ret =3D find_next_bit(bmap->bitmap, + IDA_BITMAP_BITS, bit); + else + ret =3D find_next_zero_bit(bmap->bitmap, + IDA_BITMAP_BITS, bit); + if (ret < IDA_BITMAP_BITS) + return ret + index * IDA_BITMAP_BITS; + } else if (!bmap && !set) { + return start; + } + } + + return ret; +} + +unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start, + unsigned long end) +{ + return xb_find_next_bit(xb, start, end, 1); +} + +unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start, + unsigned long end) +{ + return xb_find_next_bit(xb, start, end, 0); +} + +static void xbitmap_check_bit(unsigned long bit) +{ + xb_preload(GFP_KERNEL); + + assert(!xb_test_bit(&xb1, bit)); + assert(!xb_set_bit(&xb1, bit)); + assert(xb_test_bit(&xb1, bit)); + xb_clear_bit(&xb1, bit); + assert(xb_is_empty(&xb1)); + + xb_preload_end(); +} + +static void xbitmap_check_bit_range(void) +{ + xb_preload(GFP_KERNEL); + + /* Set a range of bits */ + assert(!xb_set_bit(&xb1, 1060)); + assert(!xb_set_bit(&xb1, 1061)); + assert(!xb_set_bit(&xb1, 1064)); + assert(!xb_set_bit(&xb1, 1065)); + assert(!xb_set_bit(&xb1, 8180)); + assert(!xb_set_bit(&xb1, 8181)); + assert(!xb_set_bit(&xb1, 8190)); + assert(!xb_set_bit(&xb1, 8191)); + + /* Test a range of bits */ + assert(xb_find_next_set_bit(&xb1, 0, 10000) =3D=3D 1060); + assert(xb_find_next_zero_bit(&xb1, 1061, 10000) =3D=3D 1062); + assert(xb_find_next_set_bit(&xb1, 1062, 10000) =3D=3D 1064); + assert(xb_find_next_zero_bit(&xb1, 1065, 10000) =3D=3D 1066); + assert(xb_find_next_set_bit(&xb1, 1066, 10000) =3D=3D 8180); + assert(xb_find_next_zero_bit(&xb1, 8180, 10000) =3D=3D 8182); + xb_clear_bit_range(&xb1, 0, 1000000); + assert(xb_find_next_set_bit(&xb1, 0, 10000) =3D=3D 10001); + + assert(xb_find_next_zero_bit(&xb1, 20000, 30000) =3D=3D 20000); + + xb_preload_end(); +} + +void xbitmap_checks(void) +{ + xb_init(&xb1); + + xbitmap_check_bit(0); + xbitmap_check_bit(30); + xbitmap_check_bit(31); + xbitmap_check_bit(1023); + xbitmap_check_bit(1024); + xbitmap_check_bit(1025); + xbitmap_check_bit((1UL << 63) | (1UL << 24)); + xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70); + + xbitmap_check_bit_range(); +} --=20 2.7.4 From nobody Sun May 5 04:38:25 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@gnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@gnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1506745305275597.189200081427; Fri, 29 Sep 2017 21:21:45 -0700 (PDT) Received: from localhost ([::1]:38072 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy9HU-0001gp-AB for importer@patchew.org; Sat, 30 Sep 2017 00:21:36 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38169) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy9FW-0000OE-Ee for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:36 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dy9FU-00021S-OL for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:34 -0400 Received: from mga05.intel.com ([192.55.52.43]:47204) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1dy9FT-0001ux-9e for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:32 -0400 Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga105.fm.intel.com with ESMTP; 29 Sep 2017 21:19:30 -0700 Received: from devel-ww.sh.intel.com ([10.239.48.92]) by orsmga001.jf.intel.com with ESMTP; 29 Sep 2017 21:19:27 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,455,1500966000"; d="scan'208";a="1177213060" From: Wei Wang To: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, mhocko@kernel.org, akpm@linux-foundation.org, mawilcox@microsoft.com Date: Sat, 30 Sep 2017 12:05:52 +0800 Message-Id: <1506744354-20979-4-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1506744354-20979-1-git-send-email-wei.w.wang@intel.com> References: <1506744354-20979-1-git-send-email-wei.w.wang@intel.com> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 192.55.52.43 Subject: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG X-BeenThere: qemu-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: aarcange@redhat.com, yang.zhang.wz@gmail.com, david@redhat.com, liliang.opensource@gmail.com, willy@infradead.org, amit.shah@redhat.com, wei.w.wang@intel.com, quan.xu@aliyun.com, cornelia.huck@de.ibm.com, pbonzini@redhat.com, mgorman@techsingularity.net Errors-To: qemu-devel-bounces+importer=patchew.org@gnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of balloon (i.e. inflated/deflated) pages using scatter-gather lists to the host. The implementation of the previous virtio-balloon is not very efficient, because the balloon pages are transferred to the host one by one. Here is the breakdown of the time in percentage spent on each step of the balloon inflating process (inflating 7GB of an 8GB idle guest). 1) allocating pages (6.5%) 2) sending PFNs to host (68.3%) 3) address translation (6.1%) 4) madvise (19%) It takes about 4126ms for the inflating process to complete. The above profiling shows that the bottlenecks are stage 2) and stage 4). This patch optimizes step 2) by transferring pages to the host in sgs. An sg describes a chunk of guest physically continuous pages. With this mechanism, step 4) can also be optimized by doing address translation and madvise() in chunks rather than page by page. With this new feature, the above ballooning process takes ~492ms resulting in an improvement of ~88%. TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of a single page each time. Signed-off-by: Wei Wang Signed-off-by: Liang Li Suggested-by: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 188 ++++++++++++++++++++++++++++++++= ---- include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 172 insertions(+), 17 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloo= n.c index f0b3a0b..6952e19 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -32,6 +32,8 @@ #include #include #include +#include +#include =20 /* * Balloon device works in 4K page units. So each page is pointed to by @@ -79,6 +81,9 @@ struct virtio_balloon { /* Synchronize access/update to this struct virtio_balloon elements */ struct mutex balloon_lock; =20 + /* The xbitmap used to record balloon pages */ + struct xb page_xb; + /* The array of pfns we tell the Host about. */ unsigned int num_pfns; __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX]; @@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb, page_to_balloon_pfn(page) + i); } =20 + +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head) +{ + unsigned int len; + + virtqueue_kick(vq); + wait_event(wq_head, virtqueue_get_buf(vq, &len)); +} + +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size) +{ + struct scatterlist sg; + unsigned int len; + + sg_init_one(&sg, addr, size); + + /* Detach all the used buffers from the vq */ + while (virtqueue_get_buf(vq, &len)) + ; + + return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL); +} + +static int send_balloon_page_sg(struct virtio_balloon *vb, + struct virtqueue *vq, + void *addr, + uint32_t size, + bool batch) +{ + int err; + + err =3D add_one_sg(vq, addr, size); + + /* If batchng is requested, we batch till the vq is full */ + if (!batch || !vq->num_free) + kick_and_wait(vq, vb->acked); + + return err; +} + +/* + * Send balloon pages in sgs to host. The balloon pages are recorded in the + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE. + * The page xbitmap is searched for continuous "1" bits, which correspond + * to continuous pages, to chunk into sgs. + * + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap t= hat + * need to be searched. + */ +static void tell_host_sgs(struct virtio_balloon *vb, + struct virtqueue *vq, + unsigned long page_xb_start, + unsigned long page_xb_end) +{ + unsigned long sg_pfn_start, sg_pfn_end; + void *sg_addr; + uint32_t sg_len, sg_max_len =3D round_down(UINT_MAX, PAGE_SIZE); + int err =3D 0; + + sg_pfn_start =3D page_xb_start; + while (sg_pfn_start < page_xb_end) { + sg_pfn_start =3D xb_find_next_set_bit(&vb->page_xb, sg_pfn_start, + page_xb_end); + if (sg_pfn_start =3D=3D page_xb_end + 1) + break; + sg_pfn_end =3D xb_find_next_zero_bit(&vb->page_xb, + sg_pfn_start + 1, + page_xb_end); + sg_addr =3D (void *)pfn_to_kaddr(sg_pfn_start); + sg_len =3D (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT; + while (sg_len > sg_max_len) { + err =3D send_balloon_page_sg(vb, vq, sg_addr, sg_max_len, + true); + if (unlikely(err < 0)) + goto err_out; + sg_addr +=3D sg_max_len; + sg_len -=3D sg_max_len; + } + err =3D send_balloon_page_sg(vb, vq, sg_addr, sg_len, true); + if (unlikely(err < 0)) + goto err_out; + sg_pfn_start =3D sg_pfn_end + 1; + } + + /* + * The last few sgs may not reach the batch size, but need a kick to + * notify the device to handle them. + */ + if (vq->num_free !=3D virtqueue_get_vring_size(vq)) + kick_and_wait(vq, vb->acked); + + xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end); + return; + +err_out: + dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err); +} + +static inline void xb_set_page(struct virtio_balloon *vb, + struct page *page, + unsigned long *pfn_min, + unsigned long *pfn_max) +{ + unsigned long pfn =3D page_to_pfn(page); + + *pfn_min =3D min(pfn, *pfn_min); + *pfn_max =3D max(pfn, *pfn_max); + xb_preload(GFP_KERNEL); + xb_set_bit(&vb->page_xb, pfn); + xb_preload_end(); +} + static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) { struct balloon_dev_info *vb_dev_info =3D &vb->vb_dev_info; unsigned num_allocated_pages; + bool use_sg =3D virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); + unsigned long pfn_max =3D 0, pfn_min =3D ULONG_MAX; =20 /* We can only do one array worth at a time. */ - num =3D min(num, ARRAY_SIZE(vb->pfns)); + if (!use_sg) + num =3D min(num, ARRAY_SIZE(vb->pfns)); =20 mutex_lock(&vb->balloon_lock); for (vb->num_pfns =3D 0; vb->num_pfns < num; @@ -162,7 +282,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb= , size_t num) msleep(200); break; } - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + + if (use_sg) + xb_set_page(vb, page, &pfn_min, &pfn_max); + else + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + vb->num_pages +=3D VIRTIO_BALLOON_PAGES_PER_PAGE; if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) @@ -171,8 +296,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb= , size_t num) =20 num_allocated_pages =3D vb->num_pfns; /* Did we get any? */ - if (vb->num_pfns !=3D 0) - tell_host(vb, vb->inflate_vq); + if (vb->num_pfns) { + if (use_sg) + tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max); + else + tell_host(vb, vb->inflate_vq); + } mutex_unlock(&vb->balloon_lock); =20 return num_allocated_pages; @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb= , size_t num) struct page *page; struct balloon_dev_info *vb_dev_info =3D &vb->vb_dev_info; LIST_HEAD(pages); + bool use_sg =3D virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); + unsigned long pfn_max =3D 0, pfn_min =3D ULONG_MAX; =20 - /* We can only do one array worth at a time. */ - num =3D min(num, ARRAY_SIZE(vb->pfns)); + /* Traditionally, we can only do one array worth at a time. */ + if (!use_sg) + num =3D min(num, ARRAY_SIZE(vb->pfns)); =20 mutex_lock(&vb->balloon_lock); /* We can't release more pages than taken */ @@ -210,7 +342,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb= , size_t num) page =3D balloon_page_dequeue(vb_dev_info); if (!page) break; - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + if (use_sg) + xb_set_page(vb, page, &pfn_min, &pfn_max); + else + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + list_add(&page->lru, &pages); vb->num_pages -=3D VIRTIO_BALLOON_PAGES_PER_PAGE; } @@ -221,8 +357,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb= , size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - if (vb->num_pfns !=3D 0) - tell_host(vb, vb->deflate_vq); + if (vb->num_pfns) { + if (use_sg) + tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max); + else + tell_host(vb, vb->deflate_vq); + } release_pages_balloon(vb, &pages); mutex_unlock(&vb->balloon_lock); return num_freed_pages; @@ -441,6 +581,7 @@ static int init_vqs(struct virtio_balloon *vb) } =20 #ifdef CONFIG_BALLOON_COMPACTION + /* * virtballoon_migratepage - perform the balloon page migration on behalf = of * a compation thread. (called under page lock) @@ -464,6 +605,7 @@ static int virtballoon_migratepage(struct balloon_dev_i= nfo *vb_dev_info, { struct virtio_balloon *vb =3D container_of(vb_dev_info, struct virtio_balloon, vb_dev_info); + bool use_sg =3D virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); unsigned long flags; =20 /* @@ -485,16 +627,24 @@ static int virtballoon_migratepage(struct balloon_dev= _info *vb_dev_info, vb_dev_info->isolated_pages--; __count_vm_event(BALLOON_MIGRATE); spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); - vb->num_pfns =3D VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns, newpage); - tell_host(vb, vb->inflate_vq); - + if (use_sg) { + send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage), + PAGE_SIZE, false); + } else { + vb->num_pfns =3D VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns, newpage); + tell_host(vb, vb->inflate_vq); + } /* balloon's page migration 2nd step -- deflate "page" */ balloon_page_delete(page); - vb->num_pfns =3D VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns, page); - tell_host(vb, vb->deflate_vq); - + if (use_sg) { + send_balloon_page_sg(vb, vb->deflate_vq, page_address(page), + PAGE_SIZE, false); + } else { + vb->num_pfns =3D VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns, page); + tell_host(vb, vb->deflate_vq); + } mutex_unlock(&vb->balloon_lock); =20 put_page(page); /* balloon reference */ @@ -553,6 +703,9 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_free_vb; =20 + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) + xb_init(&vb->page_xb); + vb->nb.notifier_call =3D virtballoon_oom_notify; vb->nb.priority =3D VIRTBALLOON_OOM_NOTIFY_PRIORITY; err =3D register_oom_notifier(&vb->nb); @@ -669,6 +822,7 @@ static unsigned int features[] =3D { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, + VIRTIO_BALLOON_F_SG, }; =20 static struct virtio_driver virtio_balloon_driver =3D { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virti= o_balloon.h index 343d7dd..37780a7 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -34,6 +34,7 @@ #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages = */ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ +#define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ =20 /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 --=20 2.7.4 From nobody Sun May 5 04:38:25 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@gnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@gnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1506745493608710.1335312280269; Fri, 29 Sep 2017 21:24:53 -0700 (PDT) Received: from localhost ([::1]:38081 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy9Ka-0004RE-Q8 for importer@patchew.org; Sat, 30 Sep 2017 00:24:48 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38184) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy9FY-0000QM-NQ for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:38 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dy9FX-000230-Dw for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:36 -0400 Received: from mga05.intel.com ([192.55.52.43]:47204) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1dy9FX-0001ux-5K for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:35 -0400 Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga105.fm.intel.com with ESMTP; 29 Sep 2017 21:19:34 -0700 Received: from devel-ww.sh.intel.com ([10.239.48.92]) by orsmga001.jf.intel.com with ESMTP; 29 Sep 2017 21:19:30 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,455,1500966000"; d="scan'208";a="1177213069" From: Wei Wang To: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, mhocko@kernel.org, akpm@linux-foundation.org, mawilcox@microsoft.com Date: Sat, 30 Sep 2017 12:05:53 +0800 Message-Id: <1506744354-20979-5-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1506744354-20979-1-git-send-email-wei.w.wang@intel.com> References: <1506744354-20979-1-git-send-email-wei.w.wang@intel.com> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 192.55.52.43 Subject: [Qemu-devel] [PATCH v16 4/5] mm: support reporting free page blocks X-BeenThere: qemu-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: aarcange@redhat.com, yang.zhang.wz@gmail.com, david@redhat.com, liliang.opensource@gmail.com, willy@infradead.org, amit.shah@redhat.com, wei.w.wang@intel.com, quan.xu@aliyun.com, cornelia.huck@de.ibm.com, pbonzini@redhat.com, mgorman@techsingularity.net Errors-To: qemu-devel-bounces+importer=patchew.org@gnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" This patch adds support to walk through the free page blocks in the system and report them via a callback function. Some page blocks may leave the free list after zone->lock is released, so it is the caller's responsibility to either detect or prevent the use of such pages. One use example of this patch is to accelerate live migration by skipping the transfer of free pages reported from the guest. A popular method used by the hypervisor to track which part of memory is written during live migration is to write-protect all the guest memory. So, those pages that are reported as free pages but are written after the report function returns will be captured by the hypervisor, and they will be added to the next round of memory transfer. Signed-off-by: Wei Wang Signed-off-by: Liang Li Cc: Michal Hocko Cc: Michael S. Tsirkin Acked-by: Michal Hocko --- include/linux/mm.h | 6 ++++ mm/page_alloc.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++= ++++ 2 files changed, 97 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 46b9ac5..d9652c2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned lo= ng * zones_size, unsigned long zone_start_pfn, unsigned long *zholes_size); extern void free_initmem(void); =20 +extern void walk_free_mem_block(void *opaque, + int min_order, + bool (*report_pfn_range)(void *opaque, + unsigned long pfn, + unsigned long num)); + /* * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) * into the buddy system. The freed pages will be poisoned with pattern diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6d00f74..c6bb874 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4762,6 +4762,97 @@ void show_free_areas(unsigned int filter, nodemask_t= *nodemask) show_swap_cache_info(); } =20 +/* + * Walk through a free page list and report the found pfn range via the + * callback. + * + * Return false if the callback requests to stop reporting. Otherwise, + * return true. + */ +static bool walk_free_page_list(void *opaque, + struct zone *zone, + int order, + enum migratetype mt, + bool (*report_pfn_range)(void *, + unsigned long, + unsigned long)) +{ + struct page *page; + struct list_head *list; + unsigned long pfn, flags; + bool ret; + + spin_lock_irqsave(&zone->lock, flags); + list =3D &zone->free_area[order].free_list[mt]; + list_for_each_entry(page, list, lru) { + pfn =3D page_to_pfn(page); + ret =3D report_pfn_range(opaque, pfn, 1 << order); + if (!ret) + break; + } + spin_unlock_irqrestore(&zone->lock, flags); + + return ret; +} + +/** + * walk_free_mem_block - Walk through the free page blocks in the system + * @opaque: the context passed from the caller + * @min_order: the minimum order of free lists to check + * @report_pfn_range: the callback to report the pfn range of the free pag= es + * + * If the callback returns false, stop iterating the list of free page blo= cks. + * Otherwise, continue to report. + * + * Please note that there are no locking guarantees for the callback and + * that the reported pfn range might be freed or disappear after the + * callback returns so the caller has to be very careful how it is used. + * + * The callback itself must not sleep or perform any operations which would + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC) + * or via any lock dependency. It is generally advisable to implement + * the callback as simple as possible and defer any heavy lifting to a + * different context. + * + * There is no guarantee that each free range will be reported only once + * during one walk_free_mem_block invocation. + * + * pfn_to_page on the given range is strongly discouraged and if there is + * an absolute need for that make sure to contact MM people to discuss + * potential problems. + * + * The function itself might sleep so it cannot be called from atomic + * contexts. + * + * In general low orders tend to be very volatile and so it makes more + * sense to query larger ones first for various optimizations which like + * ballooning etc... This will reduce the overhead as well. + */ +void walk_free_mem_block(void *opaque, + int min_order, + bool (*report_pfn_range)(void *opaque, + unsigned long pfn, + unsigned long num)) +{ + struct zone *zone; + int order; + enum migratetype mt; + bool ret; + + for_each_populated_zone(zone) { + for (order =3D MAX_ORDER - 1; order >=3D min_order; order--) { + for (mt =3D 0; mt < MIGRATE_TYPES; mt++) { + ret =3D walk_free_page_list(opaque, zone, + order, mt, + report_pfn_range); + if (!ret) + return; + } + } + } +} +EXPORT_SYMBOL_GPL(walk_free_mem_block); + static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) { zoneref->zone =3D zone; --=20 2.7.4 From nobody Sun May 5 04:38:25 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@gnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@gnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1506745411892527.4687438898831; Fri, 29 Sep 2017 21:23:31 -0700 (PDT) Received: from localhost ([::1]:38078 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy9JJ-0003UM-5W for importer@patchew.org; Sat, 30 Sep 2017 00:23:29 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38227) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy9Fe-0000Vs-OS for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:44 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dy9Fb-00024x-Hu for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:42 -0400 Received: from mga05.intel.com ([192.55.52.43]:47204) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1dy9Fb-0001ux-2T for qemu-devel@nongnu.org; Sat, 30 Sep 2017 00:19:39 -0400 Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga105.fm.intel.com with ESMTP; 29 Sep 2017 21:19:38 -0700 Received: from devel-ww.sh.intel.com ([10.239.48.92]) by orsmga001.jf.intel.com with ESMTP; 29 Sep 2017 21:19:34 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,455,1500966000"; d="scan'208";a="1177213073" From: Wei Wang To: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, mhocko@kernel.org, akpm@linux-foundation.org, mawilcox@microsoft.com Date: Sat, 30 Sep 2017 12:05:54 +0800 Message-Id: <1506744354-20979-6-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1506744354-20979-1-git-send-email-wei.w.wang@intel.com> References: <1506744354-20979-1-git-send-email-wei.w.wang@intel.com> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 192.55.52.43 Subject: [Qemu-devel] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ X-BeenThere: qemu-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: aarcange@redhat.com, yang.zhang.wz@gmail.com, david@redhat.com, liliang.opensource@gmail.com, willy@infradead.org, amit.shah@redhat.com, wei.w.wang@intel.com, quan.xu@aliyun.com, cornelia.huck@de.ibm.com, pbonzini@redhat.com, mgorman@techsingularity.net Errors-To: qemu-devel-bounces+importer=patchew.org@gnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Add a new vq, ctrl_vq, to handle commands between the host and guest. With this feature, we will be able to have the control plane and data plane separated. In other words, the control related commands of each feature will be sent via the ctrl_vq, meanwhile each feature may have its own vq used as a data plane. Free page report is the the first new feature controlled via ctrl_vq, and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added. Currently, this feature has two cmds: VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest to start the free page report work. VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest would send the cmd to the host to indicate the reporting work is done. The host would send the cmd to the guest to actively request the stop of the reporting work. The free_page_vq is used to transmit the guest free page blocks to the host. Signed-off-by: Wei Wang Signed-off-by: Liang Li Cc: Michael S. Tsirkin Cc: Michal Hocko --- drivers/virtio/virtio_balloon.c | 249 ++++++++++++++++++++++++++++++++= +--- include/uapi/linux/virtio_balloon.h | 15 +++ 2 files changed, 244 insertions(+), 20 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloo= n.c index 6952e19..70dc4ae 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt; =20 struct virtio_balloon { struct virtio_device *vdev; - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq, + *free_page_vq; + + /* Balloon's own wq for cpu-intensive work items */ + struct workqueue_struct *balloon_wq; + /* The work items submitted to the balloon wq are listed here */ + struct work_struct report_free_page_work; =20 /* The balloon servicing is delegated to a freezable workqueue. */ struct work_struct update_balloon_stats_work; @@ -65,6 +71,9 @@ struct virtio_balloon { spinlock_t stop_update_lock; bool stop_update; =20 + /* Stop reporting free pages */ + bool report_free_page_stop; + /* Waiting for host to ack the pages we released. */ wait_queue_head_t acked; =20 @@ -93,6 +102,11 @@ struct virtio_balloon { =20 /* To register callback in oom notifier call chain */ struct notifier_block nb; + + /* Host to guest ctrlq cmd buf for free page report */ + struct virtio_balloon_ctrlq_cmd free_page_cmd_in; + /* Guest to Host ctrlq cmd buf for free page report */ + struct virtio_balloon_ctrlq_cmd free_page_cmd_out; }; =20 static struct virtio_device_id id_table[] =3D { @@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon = *vb, return err; } =20 +static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t si= ze) +{ + int ret =3D 0; + + /* + * Since this is an optimization feature, losing a couplle of free + * pages to report isn't important. We simply resturn without adding + * the page if the vq is full. + */ + if (vq->num_free) { + ret =3D add_one_sg(vq, addr, size); + if (!ret) + virtqueue_kick(vq); + } + + return ret; +} + /* * Send balloon pages in sgs to host. The balloon pages are recorded in the * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE. @@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_str= uct *work) queue_work(system_freezable_wq, work); } =20 -static int init_vqs(struct virtio_balloon *vb) +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn, + unsigned long nr_pages) +{ + struct virtio_balloon *vb =3D (struct virtio_balloon *)opaque; + void *addr =3D (void *)pfn_to_kaddr(pfn); + uint32_t len =3D nr_pages << PAGE_SHIFT; + + if (vb->report_free_page_stop) + return false; + + /* If the vq is broken, stop reporting the free pages. */ + if (send_free_page_sg(vb->free_page_vq, addr, len) < 0) + return false; + + return true; +} + +static void ctrlq_add_cmd(struct virtqueue *vq, + struct virtio_balloon_ctrlq_cmd *cmd, + bool inbuf) { - struct virtqueue *vqs[3]; - vq_callback_t *callbacks[] =3D { balloon_ack, balloon_ack, stats_request = }; - static const char * const names[] =3D { "inflate", "deflate", "stats" }; - int err, nvqs; + struct scatterlist sg; + int err; + + sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd)); + if (inbuf) + err =3D virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL); + else + err =3D virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL); + + /* Sanity check: this can't really happen */ + WARN_ON(err); +} + +static void ctrlq_send_cmd(struct virtio_balloon *vb, + struct virtio_balloon_ctrlq_cmd *cmd, + bool inbuf) +{ + struct virtqueue *vq =3D vb->ctrl_vq; + + ctrlq_add_cmd(vq, cmd, inbuf); + if (!inbuf) { + /* + * All the input cmd buffers are replenished here. + * This is necessary because the input cmd buffers are lost + * after live migration. The device needs to rewind all of + * them from the ctrl_vq. + */ + ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true); + } + virtqueue_kick(vq); +} =20 +static void report_free_page_end(struct virtio_balloon *vb) +{ /* - * We expect two virtqueues: inflate and deflate, and - * optionally stat. + * The host may have already requested to stop the reporting before we + * finish, so no need to notify the host in this case. */ - nvqs =3D virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; - err =3D virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL); + if (vb->report_free_page_stop) + return; + + vb->free_page_cmd_out.class =3D VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE; + vb->free_page_cmd_out.cmd =3D VIRTIO_BALLOON_FREE_PAGE_F_STOP; + ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false); + vb->report_free_page_stop =3D true; +} + +static void report_free_page(struct work_struct *work) +{ + struct virtio_balloon *vb; + + vb =3D container_of(work, struct virtio_balloon, report_free_page_work); + walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages); + report_free_page_end(vb); +} + +static void ctrlq_handle(struct virtqueue *vq) +{ + struct virtio_balloon *vb =3D vq->vdev->priv; + struct virtio_balloon_ctrlq_cmd *msg; + unsigned int class, cmd, len; + + msg =3D (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len); + if (unlikely(!msg)) + return; + + /* The outbuf is sent by the host for recycling, so just return. */ + if (msg =3D=3D &vb->free_page_cmd_out) + return; + + class =3D virtio32_to_cpu(vb->vdev, msg->class); + cmd =3D virtio32_to_cpu(vb->vdev, msg->cmd); + + switch (class) { + case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE: + if (cmd =3D=3D VIRTIO_BALLOON_FREE_PAGE_F_STOP) { + vb->report_free_page_stop =3D true; + } else if (cmd =3D=3D VIRTIO_BALLOON_FREE_PAGE_F_START) { + vb->report_free_page_stop =3D false; + queue_work(vb->balloon_wq, &vb->report_free_page_work); + } + vb->free_page_cmd_in.class =3D + VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE; + ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true); + break; + default: + dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n", + __func__); + } +} + +static int init_vqs(struct virtio_balloon *vb) +{ + struct virtqueue **vqs; + vq_callback_t **callbacks; + const char **names; + struct scatterlist sg; + int i, nvqs, err =3D -ENOMEM; + + /* Inflateq and deflateq are used unconditionally */ + nvqs =3D 2; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) + nvqs++; + /* If ctrlq is enabled, the free page vq will also be created */ + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) + nvqs +=3D 2; + + /* Allocate space for find_vqs parameters */ + vqs =3D kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL); + if (!vqs) + goto err_vq; + callbacks =3D kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL); + if (!callbacks) + goto err_callback; + names =3D kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL); + if (!names) + goto err_names; + + callbacks[0] =3D balloon_ack; + names[0] =3D "inflate"; + callbacks[1] =3D balloon_ack; + names[1] =3D "deflate"; + + i =3D 2; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { + callbacks[i] =3D stats_request; + names[i] =3D "stats"; + i++; + } + + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) { + callbacks[i] =3D ctrlq_handle; + names[i++] =3D "ctrlq"; + callbacks[i] =3D NULL; + names[i] =3D "free_page_vq"; + } + + err =3D vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names, + NULL, NULL); if (err) - return err; + goto err_find; =20 vb->inflate_vq =3D vqs[0]; vb->deflate_vq =3D vqs[1]; + i =3D 2; if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { - struct scatterlist sg; - unsigned int num_stats; - vb->stats_vq =3D vqs[2]; - + vb->stats_vq =3D vqs[i++]; /* * Prime this virtqueue with one buffer so the hypervisor can * use it to signal us later (it can't be broken yet!). */ - num_stats =3D update_balloon_stats(vb); - - sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats); + sg_init_one(&sg, vb->stats, sizeof(vb->stats)); if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL) - < 0) - BUG(); + < 0) { + dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n", + __func__); + goto err_find; + } virtqueue_kick(vb->stats_vq); } + + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) { + vb->ctrl_vq =3D vqs[i++]; + vb->free_page_vq =3D vqs[i]; + /* Prime the ctrlq with an inbuf for the host to send a cmd */ + vb->free_page_cmd_in.class =3D + VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE; + ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true); + } + + kfree(names); + kfree(callbacks); + kfree(vqs); return 0; + +err_find: + kfree(names); +err_names: + kfree(callbacks); +err_callback: + kfree(vqs); +err_vq: + return err; } =20 #ifdef CONFIG_BALLOON_COMPACTION @@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vde= v) if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) xb_init(&vb->page_xb); =20 + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) { + vb->balloon_wq =3D alloc_workqueue("balloon-wq", + WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0); + INIT_WORK(&vb->report_free_page_work, report_free_page); + vb->report_free_page_stop =3D true; + } + vb->nb.notifier_call =3D virtballoon_oom_notify; vb->nb.priority =3D VIRTBALLOON_OOM_NOTIFY_PRIORITY; err =3D register_oom_notifier(&vb->nb); @@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vd= ev) spin_unlock_irq(&vb->stop_update_lock); cancel_work_sync(&vb->update_balloon_size_work); cancel_work_sync(&vb->update_balloon_stats_work); + cancel_work_sync(&vb->report_free_page_work); =20 remove_common(vb); #ifdef CONFIG_BALLOON_COMPACTION @@ -823,6 +1031,7 @@ static unsigned int features[] =3D { VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, VIRTIO_BALLOON_F_SG, + VIRTIO_BALLOON_F_CTRL_VQ, }; =20 static struct virtio_driver virtio_balloon_driver =3D { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virti= o_balloon.h index 37780a7..dbf0616 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -35,6 +35,7 @@ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ #define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ +#define VIRTIO_BALLOON_F_CTRL_VQ 4 /* Control Virtqueue */ =20 /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 @@ -83,4 +84,18 @@ struct virtio_balloon_stat { __virtio64 val; } __attribute__((packed)); =20 +enum { + VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE =3D 0, + VIRTIO_BALLOON_CTRLQ_CLASS_MAX, +}; + +struct virtio_balloon_ctrlq_cmd { + __virtio32 class; + __virtio32 cmd; +}; + +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */ +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP 0 +#define VIRTIO_BALLOON_FREE_PAGE_F_START 1 + #endif /* _LINUX_VIRTIO_BALLOON_H */ --=20 2.7.4