From nobody Tue Dec 16 21:51:15 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C5A1EE49B8 for ; Fri, 25 Aug 2023 01:59:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240082AbjHYB7d (ORCPT ); Thu, 24 Aug 2023 21:59:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35938 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238649AbjHYB67 (ORCPT ); Thu, 24 Aug 2023 21:58:59 -0400 Received: from mail-oo1-xc41.google.com (mail-oo1-xc41.google.com [IPv6:2607:f8b0:4864:20::c41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9AB191BD4 for ; Thu, 24 Aug 2023 18:58:32 -0700 (PDT) Received: by mail-oo1-xc41.google.com with SMTP id 006d021491bc7-5733bcf6eb6so306082eaf.0 for ; Thu, 24 Aug 2023 18:58:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1692928712; x=1693533512; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=ruDEdGF6thZk2mcTIP3xCS1ljQPI4op9YI2FgBKug94=; b=kkRw7sl9geHK0ITXL+hdZUjcIW2cln5iPPPyDeBP0zsXiuU15Y9srDxgZOVUSJVaMA /IeLwqY1c7EMDpzhGgCr3Q4OyEoZRTpNYQgWk2uQkJ8iMoqhn6jSJ5ZiHe1sihmxs8P4 pntAkQPLa0hSu3cp8+MchlcaYGUW0GG/cGGqxN1pwRwH2aN24qGAoT5Z2VN9Udev3GpB aBzQgtiL5+QviX2r88kSDhv2D6vnlYarsug5AWbaLBJrjjZvxVQFFN0hDPWhricm46TY Wz04u55qSf5HGJtd3yk52CAXVu8pWXKBehbrx+BiKBiwpQN0lI7+XJ7W0241T5GnSSce BXTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692928712; x=1693533512; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ruDEdGF6thZk2mcTIP3xCS1ljQPI4op9YI2FgBKug94=; b=KBJKKTL71UkDaMGsI8cEN8oS/f23Ql47nTU/BPQUXZSMjpKgutDZ1UStmaFgdmlby+ Z19O/8+rZr+banCGuEVeiwxDC13Cz33A7wkAbcv5fRli2pwuf2kB2CZUx8vcHhL44bDv ZjLJwJP9+pDW9EG4L2Xffv2z3YO/tFK3Q9Su/MYT0yFrL6rYLdSoVNP5xliqeg2Zhj1a URag0efQQYqizl3q5h9CSAvDqRQG8BpVZfxLKYZpHRX4PlqU/vPQp98Gs3KQBK2Hmss4 EOC+3Qzrikrk36monK5uTvSBFu13JtkwQh0dEjFfvt3ZhMJxqoxry0vIokCesjZxSc/s cbtQ== X-Gm-Message-State: AOJu0YzC87JEB9fixPlblhpDI3Pk6U+TTb0O6lZ3isYPRiDfbzWFTYY6 /IpMFMC/u9GAGcmhlllo3lCP5w== X-Google-Smtp-Source: AGHT+IFKjYg67WWgMahVAYejNR9bJkWDWmOjy8i9DB2w6tNxCHeHQ6DClpswFML76gPvcfJQ314fLQ== X-Received: by 2002:a05:6870:e242:b0:1be:e6d6:15c4 with SMTP id d2-20020a056870e24200b001bee6d615c4mr1565726oac.9.1692928711896; Thu, 24 Aug 2023 18:58:31 -0700 (PDT) Received: from C02D83NFML85.bytedance.net ([139.177.225.229]) by smtp.gmail.com with ESMTPSA id pq8-20020a17090b3d8800b00263dfe9b972sm2299502pjb.0.2023.08.24.18.58.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 24 Aug 2023 18:58:31 -0700 (PDT) From: Xu Zhao To: maz@kernel.org, oliver.upton@linux.dev, james.morse@arm.com Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Xu Zhao Subject: [RFC v2] KVM: arm/arm64: optimize vSGI injection performance Date: Fri, 25 Aug 2023 09:58:11 +0800 Message-Id: <20230825015811.5292-1-zhaoxu.35@bytedance.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" In a VM with more than 16 vCPUs (with multiple aff0 groups), if the target=20 vCPU of a vSGI exceeds 16th vCPU, kvm needs to iterate from vCPU0 until=20 the target vCPU is found. However, affinity routing information is provided=20 by the ICC_SGI* register, which allows kvm to bypass other aff0 groups,=20 iterating only on the aff0 group that the target vCPU located. It reduces=20 the maximum iteration times from the total number of vCPUs to 16, or even=20 8 times. This patch aims to optimize the vSGI injecting performance of injecting=20 target exceeds 16th vCPU in vm with more than 16 vCPUs. Here comes the test data. Test environment Host kernel: v6.5 Guest kernel: v5.4.143 Benchmark: ipi_benchmark, https://patchwork.kernel.org/project/linux-arm-k= ernel/patch/2017121 1141600.24401-1-ynorov@caviumnetworks.com. run times: each case runs for 5*100000 times 4 cores with vcpu pinning: | ipi benchmark | vgic_v3_dispatch_= sgi | | No | | original | with patch | impoved | original | with pat= ch | impoved | | 0 | vcpu0 -> vcpu1 | 222994694 ns | 208198673 ns | +6.6% | 1505 ns | = 1215 ns | +19.3% | | 1 | vcpu0 -> vcpu3 | 216790218 ns | 198613251 ns | +8.4% | 1266 ns | = 1174 ns | +7.3% | 32 cores with vcpu pinning: | ipi benchmark | v= gic_v3_dispatch_sgi | | No | | original | with patch | impoved | original= | with patch | impoved | | 2 | vcpu0 -> vcpu1 | 205954986 ns | 208735352 ns | -1.3% | 1655 ns= | 1258 ns | +24.0% | | 3 | vcpu0 -> vcpu15 | 327822710 ns | 268791736 ns | +18.0% | 2053 ns= | 1591 ns | +22.5% | | 4 | vcpu0 -> vcpu16 | 319203289 ns | 265857795 ns | +16.7% | 2080 ns= | 1612 ns | +22.5% | | 5 | vcpu0 -> vcpu31 | 399790803 ns | 316207724 ns | +20.9% | 2426 ns= | 1511 ns | +37.7% | The test results indicate that VM with less than 16 vcpus have similar=20 performance to the original. The performance of VM witch 32 cores improvement can be observed. When=20 injecting SGI into the first vCPU of the first aff0 group, the performance=20 remains the same as before (because the number of iteration is also 1),=20 but there is an improvement in performance when injecting interrupts into=20 the last vCPU. When injecting vSGI into the first and last vCPU of the=20 second aff0 group, the performance improvement is significant because=20 compared to the original algorithm, it skipped iterates the first aff0=20 group. BTW, performance improvement can also be observed by microbench in=20 kvm-unit-test with little modification :add 32 cores initialization,=20 then change IPI target CPU in function ipi_exec. The more vcpu a VM has, the greater the performance improvement of injectin= g=20 vSGI into the vCPU in the last aff0 group. Signed-off-by: Xu Zhao --- arch/arm64/kvm/vgic/vgic-mmio-v3.c | 152 ++++++++++++++--------------- include/linux/kvm_host.h | 5 + 2 files changed, 78 insertions(+), 79 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c b/arch/arm64/kvm/vgic/vgic-= mmio-v3.c index 188d2187eede..af8f2d6b18c3 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c +++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c @@ -1013,44 +1013,64 @@ int vgic_v3_has_attr_regs(struct kvm_device *dev, s= truct kvm_device_attr *attr) =20 return 0; } + /* - * Compare a given affinity (level 1-3 and a level 0 mask, from the SGI - * generation register ICC_SGI1R_EL1) with a given VCPU. - * If the VCPU's MPIDR matches, return the level0 affinity, otherwise - * return -1. + * Get affinity routing index from ICC_SGI_* register + * format: + * aff3 aff2 aff1 aff0 + * |- 8 bits -|- 8 bits -|- 8 bits -|- 4 bits -| */ -static int match_mpidr(u64 sgi_aff, u16 sgi_cpu_mask, struct kvm_vcpu *vcp= u) +static unsigned long sgi_to_affinity(unsigned long reg) { - unsigned long affinity; - int level0; + u64 aff; =20 - /* - * Split the current VCPU's MPIDR into affinity level 0 and the - * rest as this is what we have to compare against. - */ - affinity =3D kvm_vcpu_get_mpidr_aff(vcpu); - level0 =3D MPIDR_AFFINITY_LEVEL(affinity, 0); - affinity &=3D ~MPIDR_LEVEL_MASK; + /* aff3 - aff1 */ + aff =3D (((reg) & ICC_SGI1R_AFFINITY_3_MASK) >> ICC_SGI1R_AFFINITY_3_SHIF= T) << 16 | + (((reg) & ICC_SGI1R_AFFINITY_2_MASK) >> ICC_SGI1R_AFFINITY_2_SHIFT) << 8= | + (((reg) & ICC_SGI1R_AFFINITY_1_MASK) >> ICC_SGI1R_AFFINITY_1_SHIFT); =20 - /* bail out if the upper three levels don't match */ - if (sgi_aff !=3D affinity) - return -1; + /* aff0, the length of targetlist in sgi register is 16, which is 4bit */ + aff <<=3D 4; =20 - /* Is this VCPU's bit set in the mask ? */ - if (!(sgi_cpu_mask & BIT(level0))) - return -1; - - return level0; + return aff; } =20 /* - * The ICC_SGI* registers encode the affinity differently from the MPIDR, - * so provide a wrapper to use the existing defines to isolate a certain - * affinity level. + * inject a vsgi to vcpu */ -#define SGI_AFFINITY_LEVEL(reg, level) \ - ((((reg) & ICC_SGI1R_AFFINITY_## level ##_MASK) \ - >> ICC_SGI1R_AFFINITY_## level ##_SHIFT) << MPIDR_LEVEL_SHIFT(level)) +static inline void vgic_v3_inject_sgi(struct kvm_vcpu *vcpu, int sgi, bool= allow_group1) +{ + struct vgic_irq *irq; + unsigned long flags; + + irq =3D vgic_get_irq(vcpu->kvm, vcpu, sgi); + + raw_spin_lock_irqsave(&irq->irq_lock, flags); + + /* + * An access targeting Group0 SGIs can only generate + * those, while an access targeting Group1 SGIs can + * generate interrupts of either group. + */ + if (!irq->group || allow_group1) { + if (!irq->hw) { + irq->pending_latch =3D true; + vgic_queue_irq_unlock(vcpu->kvm, irq, flags); + } else { + /* HW SGI? Ask the GIC to inject it */ + int err; + err =3D irq_set_irqchip_state(irq->host_irq, + IRQCHIP_STATE_PENDING, + true); + WARN_RATELIMIT(err, "IRQ %d", irq->host_irq); + raw_spin_unlock_irqrestore(&irq->irq_lock, flags); + } + } else { + raw_spin_unlock_irqrestore(&irq->irq_lock, flags); + } + + vgic_put_irq(vcpu->kvm, irq); +} =20 /** * vgic_v3_dispatch_sgi - handle SGI requests from VCPUs @@ -1071,74 +1091,48 @@ void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u6= 4 reg, bool allow_group1) struct kvm *kvm =3D vcpu->kvm; struct kvm_vcpu *c_vcpu; u16 target_cpus; - u64 mpidr; int sgi; int vcpu_id =3D vcpu->vcpu_id; bool broadcast; - unsigned long c, flags; + unsigned long c, aff_index; =20 sgi =3D (reg & ICC_SGI1R_SGI_ID_MASK) >> ICC_SGI1R_SGI_ID_SHIFT; broadcast =3D reg & BIT_ULL(ICC_SGI1R_IRQ_ROUTING_MODE_BIT); target_cpus =3D (reg & ICC_SGI1R_TARGET_LIST_MASK) >> ICC_SGI1R_TARGET_LI= ST_SHIFT; - mpidr =3D SGI_AFFINITY_LEVEL(reg, 3); - mpidr |=3D SGI_AFFINITY_LEVEL(reg, 2); - mpidr |=3D SGI_AFFINITY_LEVEL(reg, 1); =20 /* - * We iterate over all VCPUs to find the MPIDRs matching the request. - * If we have handled one CPU, we clear its bit to detect early - * if we are already finished. This avoids iterating through all - * VCPUs when most of the times we just signal a single VCPU. + * Writing IRM bit is not a frequent behavior, so separate SGI injection = into two parts. + * If it is not broadcast, compute the affinity routing index first, + * then iterate targetlist to find the target VCPU. + * Or, inject sgi to all VCPUs but the calling one. */ - kvm_for_each_vcpu(c, c_vcpu, kvm) { - struct vgic_irq *irq; - - /* Exit early if we have dealt with all requested CPUs */ - if (!broadcast && target_cpus =3D=3D 0) - break; + if (likely(!broadcast)) { + /* compute affinity routing index */ + aff_index =3D sgi_to_affinity(reg); =20 - /* Don't signal the calling VCPU */ - if (broadcast && c =3D=3D vcpu_id) - continue; - - if (!broadcast) { - int level0; + /* exit if meet a wrong affinity value */ + if (aff_index >=3D atomic_read(&kvm->online_vcpus)) + return; =20 - level0 =3D match_mpidr(mpidr, target_cpus, c_vcpu); - if (level0 =3D=3D -1) + /* Iterate target list */ + kvm_for_each_target_list(c, target_cpus) { + if (!(target_cpus & (1 << c))) continue; =20 - /* remove this matching VCPU from the mask */ - target_cpus &=3D ~BIT(level0); - } + c_vcpu =3D kvm_get_vcpu_by_id(kvm, aff_index+c); + if (!c_vcpu) + break; =20 - irq =3D vgic_get_irq(vcpu->kvm, c_vcpu, sgi); - - raw_spin_lock_irqsave(&irq->irq_lock, flags); - - /* - * An access targeting Group0 SGIs can only generate - * those, while an access targeting Group1 SGIs can - * generate interrupts of either group. - */ - if (!irq->group || allow_group1) { - if (!irq->hw) { - irq->pending_latch =3D true; - vgic_queue_irq_unlock(vcpu->kvm, irq, flags); - } else { - /* HW SGI? Ask the GIC to inject it */ - int err; - err =3D irq_set_irqchip_state(irq->host_irq, - IRQCHIP_STATE_PENDING, - true); - WARN_RATELIMIT(err, "IRQ %d", irq->host_irq); - raw_spin_unlock_irqrestore(&irq->irq_lock, flags); - } - } else { - raw_spin_unlock_irqrestore(&irq->irq_lock, flags); + vgic_v3_inject_sgi(c_vcpu, sgi, allow_group1); } + } else { + kvm_for_each_vcpu(c, c_vcpu, kvm) { + /* don't signal the calling vcpu */ + if (c_vcpu->vcpu_id =3D=3D vcpu_id) + continue; =20 - vgic_put_irq(vcpu->kvm, irq); + vgic_v3_inject_sgi(c_vcpu, sgi, allow_group1); + } } } =20 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9d3ac7720da9..9b4afea7a1ee 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -910,6 +910,11 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm= *kvm, int i) xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \ (atomic_read(&kvm->online_vcpus) - 1)) =20 +#define kvm_for_each_target_list(idx, target_cpus) \ + for (idx =3D target_cpus & 0xff ? 0 : (ICC_SGI1R_AFFINITY_1_SHIFT>>1); \ + (1 << idx) <=3D target_cpus; \ + idx++) + static inline struct kvm_vcpu *kvm_get_vcpu_by_id(struct kvm *kvm, int id) { struct kvm_vcpu *vcpu =3D NULL; --=20 2.20.1