From nobody Mon Feb  9 04:00:20 2026
Delivered-To: importer@patchew.org
Authentication-Results: mx.zohomail.com;
	dkim=fail  header.i=@quicinc.com;
	spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as
 permitted sender)
  smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org;
	dmarc=fail(p=none dis=none)  header.from=quicinc.com
Return-Path: <qemu-devel-bounces+importer=patchew.org@nongnu.org>
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by
 mx.zohomail.com
	with SMTPS id 1625528508789275.8028091354789;
 Mon, 5 Jul 2021 16:41:48 -0700 (PDT)
Received: from localhost ([::1]:45834 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <qemu-devel-bounces+importer=patchew.org@nongnu.org>)
	id 1m0YDj-0000X2-3Q
	for importer@patchew.org; Mon, 05 Jul 2021 19:41:47 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:35544)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <tsimpson@qualcomm.com>)
 id 1m0Y77-0003kF-R7
 for qemu-devel@nongnu.org; Mon, 05 Jul 2021 19:34:57 -0400
Received: from alexa-out-sd-01.qualcomm.com ([199.106.114.38]:37352)
 by eggs.gnu.org with esmtps (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.90_1) (envelope-from <tsimpson@qualcomm.com>)
 id 1m0Y74-0004a6-3A
 for qemu-devel@nongnu.org; Mon, 05 Jul 2021 19:34:57 -0400
Received: from unknown (HELO ironmsg04-sd.qualcomm.com) ([10.53.140.144])
 by alexa-out-sd-01.qualcomm.com with ESMTP; 05 Jul 2021 16:34:39 -0700
Received: from vu-tsimpson-aus.qualcomm.com (HELO
 vu-tsimpson1-aus.qualcomm.com) ([10.222.150.1])
 by ironmsg04-sd.qualcomm.com with ESMTP; 05 Jul 2021 16:34:38 -0700
Received: by vu-tsimpson1-aus.qualcomm.com (Postfix, from userid 47164)
 id 11D6522A5; Mon,  5 Jul 2021 18:34:38 -0500 (CDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
 d=quicinc.com; i=@quicinc.com; q=dns/txt; s=qcdkim;
 t=1625528094; x=1657064094;
 h=from:to:cc:subject:date:message-id:in-reply-to:
 references:mime-version:content-transfer-encoding;
 bh=bLHVoIGHXBHI5Wdfp8l0N/9csUZ0BOyDvX8HNY2ytLI=;
 b=giIvgvG2aNS0HEsjvnJrIbJ0VNKhUk1qGDFwEL2eXokrzwo0jN1vmZVX
 KixOsUdicfp00fTXivQvkdeuS0jx1vNU8q0Mk28ZpgcW+RXYjrViy4D+O
 p8VM/w9C4JUefkrMLQt/xGIstA+CJe1iPlu7UdyYkeOAxj+6TeyXlgqk0 E=;
X-QCInternal: smtphost
From: Taylor Simpson <tsimpson@quicinc.com>
To: qemu-devel@nongnu.org
Subject: [PATCH 19/20] Hexagon HVX (tests/tcg/hexagon) scatter_gather test
Date: Mon,  5 Jul 2021 18:34:33 -0500
Message-Id: <1625528074-19440-20-git-send-email-tsimpson@quicinc.com>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1625528074-19440-1-git-send-email-tsimpson@quicinc.com>
References: <1625528074-19440-1-git-send-email-tsimpson@quicinc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17
 as permitted sender) client-ip=209.51.188.17;
 envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org;
 helo=lists.gnu.org;
Received-SPF: pass client-ip=199.106.114.38;
 envelope-from=tsimpson@qualcomm.com; helo=alexa-out-sd-01.qualcomm.com
X-Spam_score_int: -40
X-Spam_score: -4.1
X-Spam_bar: ----
X-Spam_report: (-4.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.249,
 RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: ale@rev.ng, peter.maydell@linaro.org, bcain@quicinc.com,
 richard.henderson@linaro.org, tsimpson@quicinc.com, philmd@redhat.com
Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org
Sender: "Qemu-devel" <qemu-devel-bounces+importer=patchew.org@nongnu.org>
X-ZohoMail-DKIM: fail (Header signature does not verify)
X-ZM-MESSAGEID: 1625528510273100001

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 tests/tcg/hexagon/scatter_gather.c | 1011 ++++++++++++++++++++++++++++++++=
++++
 tests/tcg/hexagon/Makefile.target  |    2 +
 2 files changed, 1013 insertions(+)
 create mode 100644 tests/tcg/hexagon/scatter_gather.c

diff --git a/tests/tcg/hexagon/scatter_gather.c b/tests/tcg/hexagon/scatter=
_gather.c
new file mode 100644
index 0000000..b93eb18
--- /dev/null
+++ b/tests/tcg/hexagon/scatter_gather.c
@@ -0,0 +1,1011 @@
+/*
+ *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Res=
erved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * This example tests the HVX scatter/gather instructions
+ *
+ * See section 5.13 of the V68 HVX Programmer's Reference
+ *
+ * There are 3 main classes operations
+ *     _16                 16-bit elements and 16-bit offsets
+ *     _32                 32-bit elements and 32-bit offsets
+ *     _16_32              16-bit elements and 32-bit offsets
+ *
+ * There are also masked and accumulate versions
+ */
+
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+typedef long HVX_Vector       __attribute__((__vector_size__(128)))
+                              __attribute__((aligned(128)));
+typedef long HVX_VectorPair   __attribute__((__vector_size__(256)))
+                              __attribute__((aligned(128)));
+typedef long HVX_VectorPred   __attribute__((__vector_size__(128)))
+                              __attribute__((aligned(128)));
+
+#define VSCATTER_16(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermh_128B((int)BASE, RGN, OFF, VALS)
+#define VSCATTER_16_MASKED(MASK, BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermhq_128B(MASK, (int)BASE, RGN, OFF, VALS)
+#define VSCATTER_32(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermw_128B((int)BASE, RGN, OFF, VALS)
+#define VSCATTER_32_MASKED(MASK, BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermwq_128B(MASK, (int)BASE, RGN, OFF, VALS)
+#define VSCATTER_16_32(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermhw_128B((int)BASE, RGN, OFF, VALS)
+#define VSCATTER_16_32_MASKED(MASK, BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermhwq_128B(MASK, (int)BASE, RGN, OFF, VALS)
+#define VSCATTER_16_ACC(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermh_add_128B((int)BASE, RGN, OFF, VALS)
+#define VSCATTER_32_ACC(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermw_add_128B((int)BASE, RGN, OFF, VALS)
+#define VSCATTER_16_32_ACC(BASE, RGN, OFF, VALS) \
+    __builtin_HEXAGON_V6_vscattermhw_add_128B((int)BASE, RGN, OFF, VALS)
+
+#define VGATHER_16(DSTADDR, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermh_128B(DSTADDR, (int)BASE, RGN, OFF)
+#define VGATHER_16_MASKED(DSTADDR, MASK, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermhq_128B(DSTADDR, MASK, (int)BASE, RGN, OF=
F)
+#define VGATHER_32(DSTADDR, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermw_128B(DSTADDR, (int)BASE, RGN, OFF)
+#define VGATHER_32_MASKED(DSTADDR, MASK, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermwq_128B(DSTADDR, MASK, (int)BASE, RGN, OF=
F)
+#define VGATHER_16_32(DSTADDR, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermhw_128B(DSTADDR, (int)BASE, RGN, OFF)
+#define VGATHER_16_32_MASKED(DSTADDR, MASK, BASE, RGN, OFF) \
+    __builtin_HEXAGON_V6_vgathermhwq_128B(DSTADDR, MASK, (int)BASE, RGN, O=
FF)
+
+#define VSHUFF_H(V) \
+    __builtin_HEXAGON_V6_vshuffh_128B(V)
+#define VSPLAT_H(X) \
+    __builtin_HEXAGON_V6_lvsplath_128B(X)
+#define VAND_VAL(PRED, VAL) \
+    __builtin_HEXAGON_V6_vandvrt_128B(PRED, VAL)
+#define VDEAL_H(V) \
+    __builtin_HEXAGON_V6_vdealh_128B(V)
+
+int err;
+
+/* define the number of rows/cols in a square matrix */
+#define MATRIX_SIZE 64
+
+/* define the size of the scatter buffer */
+#define SCATTER_BUFFER_SIZE (MATRIX_SIZE * MATRIX_SIZE)
+
+/* fake vtcm - put buffers together and force alignment */
+static struct {
+    unsigned short vscatter16[SCATTER_BUFFER_SIZE];
+    unsigned short vgather16[MATRIX_SIZE];
+    unsigned int   vscatter32[SCATTER_BUFFER_SIZE];
+    unsigned int   vgather32[MATRIX_SIZE];
+    unsigned short vscatter16_32[SCATTER_BUFFER_SIZE];
+    unsigned short vgather16_32[MATRIX_SIZE];
+} vtcm __attribute__((aligned(0x10000)));
+
+/* declare the arrays of reference values */
+unsigned short vscatter16_ref[SCATTER_BUFFER_SIZE];
+unsigned short vgather16_ref[MATRIX_SIZE];
+unsigned int   vscatter32_ref[SCATTER_BUFFER_SIZE];
+unsigned int   vgather32_ref[MATRIX_SIZE];
+unsigned short vscatter16_32_ref[SCATTER_BUFFER_SIZE];
+unsigned short vgather16_32_ref[MATRIX_SIZE];
+
+/* declare the arrays of offsets */
+unsigned short half_offsets[MATRIX_SIZE];
+unsigned int   word_offsets[MATRIX_SIZE];
+
+/* declare the arrays of values */
+unsigned short half_values[MATRIX_SIZE];
+unsigned short half_values_acc[MATRIX_SIZE];
+unsigned short half_values_masked[MATRIX_SIZE];
+unsigned int   word_values[MATRIX_SIZE];
+unsigned int   word_values_acc[MATRIX_SIZE];
+unsigned int   word_values_masked[MATRIX_SIZE];
+
+/* declare the arrays of predicates */
+unsigned short half_predicates[MATRIX_SIZE];
+unsigned int   word_predicates[MATRIX_SIZE];
+
+/* make this big enough for all the intrinsics */
+const size_t region_len =3D sizeof(vtcm);
+
+/* optionally add sync instructions */
+#define SYNC_VECTOR 1
+
+static void sync_scatter(void *addr)
+{
+#if SYNC_VECTOR
+    /*
+     * Do the scatter release followed by a dummy load to complete the
+     * synchronization.  Normally the dummy load would be deferred as
+     * long as possible to minimize stalls.
+     */
+    asm volatile("vmem(%0 + #0):scatter_release\n" : : "r"(addr));
+    /* use volatile to force the load */
+    volatile HVX_Vector vDummy =3D *(HVX_Vector *)addr; vDummy =3D vDummy;
+#endif
+}
+
+static void sync_gather(void *addr)
+{
+#if SYNC_VECTOR
+    /* use volatile to force the load */
+    volatile HVX_Vector vDummy =3D *(HVX_Vector *)addr; vDummy =3D vDummy;
+#endif
+}
+
+/* optionally print the results */
+#define PRINT_DATA 0
+
+#define FILL_CHAR       '.'
+
+/* fill vtcm scratch with ee */
+void prefill_vtcm_scratch(void)
+{
+    memset(&vtcm, FILL_CHAR, sizeof(vtcm));
+}
+
+/* create byte offsets to be a diagonal of the matrix with 16 bit elements=
 */
+void create_offsets_values_preds_16(void)
+{
+    unsigned short half_element =3D 0;
+    unsigned short half_element_masked =3D 0;
+    char letter =3D 'A';
+    char letter_masked =3D '@';
+
+    for (int i =3D 0; i < MATRIX_SIZE; i++) {
+        half_offsets[i] =3D i * (2 * MATRIX_SIZE + 2);
+
+        half_element =3D 0;
+        half_element_masked =3D 0;
+        for (int j =3D 0; j < 2; j++) {
+            half_element |=3D letter << j * 8;
+            half_element_masked |=3D letter_masked << j * 8;
+        }
+
+        half_values[i] =3D half_element;
+        half_values_acc[i] =3D ((i % 10) << 8) + (i % 10);
+        half_values_masked[i] =3D half_element_masked;
+
+        letter++;
+        /* reset to 'A' */
+        if (letter =3D=3D 'M') {
+            letter =3D 'A';
+        }
+
+        half_predicates[i] =3D (i % 3 =3D=3D 0 || i % 5 =3D=3D 0) ? ~0 : 0;
+    }
+}
+
+/* create byte offsets to be a diagonal of the matrix with 32 bit elements=
 */
+void create_offsets_values_preds_32(void)
+{
+    unsigned int word_element =3D 0;
+    unsigned int word_element_masked =3D 0;
+    char letter =3D 'A';
+    char letter_masked =3D '&';
+
+    for (int i =3D 0; i < MATRIX_SIZE; i++) {
+        word_offsets[i] =3D i * (4 * MATRIX_SIZE + 4);
+
+        word_element =3D 0;
+        word_element_masked =3D 0;
+        for (int j =3D 0; j < 4; j++) {
+            word_element |=3D letter << j * 8;
+            word_element_masked |=3D letter_masked << j * 8;
+        }
+
+        word_values[i] =3D word_element;
+        word_values_acc[i] =3D ((i % 10) << 8) + (i % 10);
+        word_values_masked[i] =3D word_element_masked;
+
+        letter++;
+        /* reset to 'A' */
+        if (letter =3D=3D 'M') {
+            letter =3D 'A';
+        }
+
+        word_predicates[i] =3D (i % 4 =3D=3D 0 || i % 7 =3D=3D 0) ? ~0 : 0;
+    }
+}
+
+/*
+ * create byte offsets to be a diagonal of the matrix with 16 bit elements
+ * and 32 bit offsets
+ */
+void create_offsets_values_preds_16_32(void)
+{
+    unsigned short half_element =3D 0;
+    unsigned short half_element_masked =3D 0;
+    char letter =3D 'D';
+    char letter_masked =3D '$';
+
+    for (int i =3D 0; i < MATRIX_SIZE; i++) {
+        word_offsets[i] =3D i * (2 * MATRIX_SIZE + 2);
+
+        half_element =3D 0;
+        half_element_masked =3D 0;
+        for (int j =3D 0; j < 2; j++) {
+            half_element |=3D letter << j * 8;
+            half_element_masked |=3D letter_masked << j * 8;
+        }
+
+        half_values[i] =3D half_element;
+        half_values_acc[i] =3D ((i % 10) << 8) + (i % 10);
+        half_values_masked[i] =3D half_element_masked;
+
+        letter++;
+        /* reset to 'A' */
+        if (letter =3D=3D 'P') {
+            letter =3D 'D';
+        }
+
+        half_predicates[i] =3D (i % 2 =3D=3D 0 || i % 13 =3D=3D 0) ? ~0 : =
0;
+    }
+}
+
+/* scatter the 16 bit elements using intrinsics */
+void vector_scatter_16(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsets =3D *(HVX_Vector *)half_offsets;
+    HVX_Vector values =3D *(HVX_Vector *)half_values;
+
+    VSCATTER_16(&vtcm.vscatter16, region_len, offsets, values);
+
+    sync_scatter(vtcm.vscatter16);
+}
+
+/* scatter-accumulate the 16 bit elements using intrinsics */
+void vector_scatter_16_acc(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsets =3D *(HVX_Vector *)half_offsets;
+    HVX_Vector values =3D *(HVX_Vector *)half_values_acc;
+
+    VSCATTER_16_ACC(&vtcm.vscatter16, region_len, offsets, values);
+
+    sync_scatter(vtcm.vscatter16);
+}
+
+/* scatter the 16 bit elements using intrinsics */
+void vector_scatter_16_masked(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsets =3D *(HVX_Vector *)half_offsets;
+    HVX_Vector values =3D *(HVX_Vector *)half_values_masked;
+    HVX_Vector pred_reg =3D *(HVX_Vector *)half_predicates;
+    HVX_VectorPred preds =3D VAND_VAL(pred_reg, ~0);
+
+    VSCATTER_16_MASKED(preds, &vtcm.vscatter16, region_len, offsets, value=
s);
+
+    sync_scatter(vtcm.vscatter16);
+}
+
+/* scatter the 32 bit elements using intrinsics */
+void vector_scatter_32(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsetslo =3D *(HVX_Vector *)word_offsets;
+    HVX_Vector offsetshi =3D *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2];
+    HVX_Vector valueslo =3D *(HVX_Vector *)word_values;
+    HVX_Vector valueshi =3D *(HVX_Vector *)&word_values[MATRIX_SIZE / 2];
+
+    VSCATTER_32(&vtcm.vscatter32, region_len, offsetslo, valueslo);
+    VSCATTER_32(&vtcm.vscatter32, region_len, offsetshi, valueshi);
+
+    sync_scatter(vtcm.vscatter32);
+}
+
+/* scatter-acc the 32 bit elements using intrinsics */
+void vector_scatter_32_acc(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsetslo =3D *(HVX_Vector *)word_offsets;
+    HVX_Vector offsetshi =3D *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2];
+    HVX_Vector valueslo =3D *(HVX_Vector *)word_values_acc;
+    HVX_Vector valueshi =3D *(HVX_Vector *)&word_values_acc[MATRIX_SIZE / =
2];
+
+    VSCATTER_32_ACC(&vtcm.vscatter32, region_len, offsetslo, valueslo);
+    VSCATTER_32_ACC(&vtcm.vscatter32, region_len, offsetshi, valueshi);
+
+    sync_scatter(vtcm.vscatter32);
+}
+
+/* scatter the 32 bit elements using intrinsics */
+void vector_scatter_32_masked(void)
+{
+    /* copy the offsets and values to vectors */
+    HVX_Vector offsetslo =3D *(HVX_Vector *)word_offsets;
+    HVX_Vector offsetshi =3D *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2];
+    HVX_Vector valueslo =3D *(HVX_Vector *)word_values_masked;
+    HVX_Vector valueshi =3D *(HVX_Vector *)&word_values_masked[MATRIX_SIZE=
 / 2];
+    HVX_Vector pred_reglo =3D *(HVX_Vector *)word_predicates;
+    HVX_Vector pred_reghi =3D *(HVX_Vector *)&word_predicates[MATRIX_SIZE =
/ 2];
+    HVX_VectorPred predslo =3D VAND_VAL(pred_reglo, ~0);
+    HVX_VectorPred predshi =3D VAND_VAL(pred_reghi, ~0);
+
+    VSCATTER_32_MASKED(predslo, &vtcm.vscatter32, region_len, offsetslo,
+                       valueslo);
+    VSCATTER_32_MASKED(predshi, &vtcm.vscatter32, region_len, offsetshi,
+                       valueshi);
+
+    sync_scatter(vtcm.vscatter16);
+}
+
+/* scatter the 16 bit elements with 32 bit offsets using intrinsics */
+void vector_scatter_16_32(void)
+{
+    HVX_VectorPair offsets;
+    HVX_Vector values;
+
+    /* get the word offsets in a vector pair */
+    offsets =3D *(HVX_VectorPair *)word_offsets;
+
+    /* these values need to be shuffled for the scatter */
+    values =3D *(HVX_Vector *)half_values;
+    values =3D VSHUFF_H(values);
+
+    VSCATTER_16_32(&vtcm.vscatter16_32, region_len, offsets, values);
+
+    sync_scatter(vtcm.vscatter16_32);
+}
+
+/* scatter-acc the 16 bit elements with 32 bit offsets using intrinsics */
+void vector_scatter_16_32_acc(void)
+{
+    HVX_VectorPair offsets;
+    HVX_Vector values;
+
+    /* get the word offsets in a vector pair */
+    offsets =3D *(HVX_VectorPair *)word_offsets;
+
+    /* these values need to be shuffled for the scatter */
+    values =3D *(HVX_Vector *)half_values_acc;
+    values =3D VSHUFF_H(values);
+
+    VSCATTER_16_32_ACC(&vtcm.vscatter16_32, region_len, offsets, values);
+
+    sync_scatter(vtcm.vscatter16_32);
+}
+
+/* masked scatter the 16 bit elements with 32 bit offsets using intrinsics=
 */
+void vector_scatter_16_32_masked(void)
+{
+    HVX_VectorPair offsets;
+    HVX_Vector values;
+    HVX_Vector pred_reg;
+
+    /* get the word offsets in a vector pair */
+    offsets =3D *(HVX_VectorPair *)word_offsets;
+
+    /* these values need to be shuffled for the scatter */
+    values =3D *(HVX_Vector *)half_values_masked;
+    values =3D VSHUFF_H(values);
+
+    pred_reg =3D *(HVX_Vector *)half_predicates;
+    pred_reg =3D VSHUFF_H(pred_reg);
+    HVX_VectorPred preds =3D VAND_VAL(pred_reg, ~0);
+
+    VSCATTER_16_32_MASKED(preds, &vtcm.vscatter16_32, region_len, offsets,
+                          values);
+
+    sync_scatter(vtcm.vscatter16_32);
+}
+
+/* gather the elements from the scatter16 buffer */
+void vector_gather_16(void)
+{
+    HVX_Vector *vgather =3D (HVX_Vector *)&vtcm.vgather16;
+    HVX_Vector offsets =3D *(HVX_Vector *)half_offsets;
+
+    VGATHER_16(vgather, &vtcm.vscatter16, region_len, offsets);
+
+    sync_gather(vgather);
+}
+
+static unsigned short gather_16_masked_init(void)
+{
+    char letter =3D '?';
+    return letter | (letter << 8);
+}
+
+void vector_gather_16_masked(void)
+{
+    HVX_Vector *vgather =3D (HVX_Vector *)&vtcm.vgather16;
+    HVX_Vector offsets =3D *(HVX_Vector *)half_offsets;
+    HVX_Vector pred_reg =3D *(HVX_Vector *)half_predicates;
+    HVX_VectorPred preds =3D VAND_VAL(pred_reg, ~0);
+
+    *vgather =3D VSPLAT_H(gather_16_masked_init());
+    VGATHER_16_MASKED(vgather, preds, &vtcm.vscatter16, region_len, offset=
s);
+
+    sync_gather(vgather);
+}
+
+/* gather the elements from the scatter32 buffer */
+void vector_gather_32(void)
+{
+    HVX_Vector *vgatherlo =3D (HVX_Vector *)&vtcm.vgather32;
+    HVX_Vector *vgatherhi =3D
+        (HVX_Vector *)((int)&vtcm.vgather32 + (MATRIX_SIZE * 2));
+    HVX_Vector offsetslo =3D *(HVX_Vector *)word_offsets;
+    HVX_Vector offsetshi =3D *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2];
+
+    VGATHER_32(vgatherlo, &vtcm.vscatter32, region_len, offsetslo);
+    VGATHER_32(vgatherhi, &vtcm.vscatter32, region_len, offsetshi);
+
+    sync_gather(vgatherhi);
+}
+
+static unsigned int gather_32_masked_init(void)
+{
+    char letter =3D '?';
+    return letter | (letter << 8) | (letter << 16) | (letter << 24);
+}
+
+void vector_gather_32_masked(void)
+{
+    HVX_Vector *vgatherlo =3D (HVX_Vector *)&vtcm.vgather32;
+    HVX_Vector *vgatherhi =3D
+        (HVX_Vector *)((int)&vtcm.vgather32 + (MATRIX_SIZE * 2));
+    HVX_Vector offsetslo =3D *(HVX_Vector *)word_offsets;
+    HVX_Vector offsetshi =3D *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2];
+    HVX_Vector pred_reglo =3D *(HVX_Vector *)word_predicates;
+    HVX_VectorPred predslo =3D VAND_VAL(pred_reglo, ~0);
+    HVX_Vector pred_reghi =3D *(HVX_Vector *)&word_predicates[MATRIX_SIZE =
/ 2];
+    HVX_VectorPred predshi =3D VAND_VAL(pred_reghi, ~0);
+
+    *vgatherlo =3D VSPLAT_H(gather_32_masked_init());
+    *vgatherhi =3D VSPLAT_H(gather_32_masked_init());
+    VGATHER_32_MASKED(vgatherlo, predslo, &vtcm.vscatter32, region_len,
+                      offsetslo);
+    VGATHER_32_MASKED(vgatherhi, predshi, &vtcm.vscatter32, region_len,
+                      offsetshi);
+
+    sync_gather(vgatherlo);
+    sync_gather(vgatherhi);
+}
+
+/* gather the elements from the scatter16_32 buffer */
+void vector_gather_16_32(void)
+{
+    HVX_Vector *vgather;
+    HVX_VectorPair offsets;
+    HVX_Vector values;
+
+    /* get the vtcm address to gather from */
+    vgather =3D (HVX_Vector *)&vtcm.vgather16_32;
+
+    /* get the word offsets in a vector pair */
+    offsets =3D *(HVX_VectorPair *)word_offsets;
+
+    VGATHER_16_32(vgather, &vtcm.vscatter16_32, region_len, offsets);
+
+    /* deal the elements to get the order back */
+    values =3D *(HVX_Vector *)vgather;
+    values =3D VDEAL_H(values);
+
+    /* write it back to vtcm address */
+    *(HVX_Vector *)vgather =3D values;
+}
+
+void vector_gather_16_32_masked(void)
+{
+    HVX_Vector *vgather;
+    HVX_VectorPair offsets;
+    HVX_Vector pred_reg;
+    HVX_VectorPred preds;
+    HVX_Vector values;
+
+    /* get the vtcm address to gather from */
+    vgather =3D (HVX_Vector *)&vtcm.vgather16_32;
+
+    /* get the word offsets in a vector pair */
+    offsets =3D *(HVX_VectorPair *)word_offsets;
+    pred_reg =3D *(HVX_Vector *)half_predicates;
+    pred_reg =3D VSHUFF_H(pred_reg);
+    preds =3D VAND_VAL(pred_reg, ~0);
+
+   *vgather =3D VSPLAT_H(gather_16_masked_init());
+   VGATHER_16_32_MASKED(vgather, preds, &vtcm.vscatter16_32, region_len,
+                        offsets);
+
+    /* deal the elements to get the order back */
+    values =3D *(HVX_Vector *)vgather;
+    values =3D VDEAL_H(values);
+
+    /* write it back to vtcm address */
+    *(HVX_Vector *)vgather =3D values;
+}
+
+static void check_buffer(const char *name, void *c, void *r, size_t size)
+{
+    char *check =3D (char *)c;
+    char *ref =3D (char *)r;
+    for (int i =3D 0; i < size; i++) {
+        if (check[i] !=3D ref[i]) {
+            printf("ERROR %s [%d]: 0x%x (%c) !=3D 0x%x (%c)\n", name, i,
+                   check[i], check[i], ref[i], ref[i]);
+            err++;
+        }
+    }
+}
+
+/*
+ * These scalar functions are the C equivalents of the vector functions th=
at
+ * use HVX
+ */
+
+/* scatter the 16 bit elements using C */
+void scalar_scatter_16(unsigned short *vscatter16)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        vscatter16[half_offsets[i] / 2] =3D half_values[i];
+    }
+}
+
+void check_scatter_16()
+{
+    memset(vscatter16_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16(vscatter16_ref);
+    check_buffer(__func__, vtcm.vscatter16, vscatter16_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+/* scatter the 16 bit elements using C */
+void scalar_scatter_16_acc(unsigned short *vscatter16)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        vscatter16[half_offsets[i] / 2] +=3D half_values_acc[i];
+    }
+}
+
+void check_scatter_16_acc()
+{
+    memset(vscatter16_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16(vscatter16_ref);
+    scalar_scatter_16_acc(vscatter16_ref);
+    check_buffer(__func__, vtcm.vscatter16, vscatter16_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+/* scatter the 16 bit elements using C */
+void scalar_scatter_16_masked(unsigned short *vscatter16)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; i++) {
+        if (half_predicates[i]) {
+            vscatter16[half_offsets[i] / 2] =3D half_values_masked[i];
+        }
+    }
+
+}
+
+void check_scatter_16_masked()
+{
+    memset(vscatter16_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16(vscatter16_ref);
+    scalar_scatter_16_acc(vscatter16_ref);
+    scalar_scatter_16_masked(vscatter16_ref);
+    check_buffer(__func__, vtcm.vscatter16, vscatter16_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+/* scatter the 32 bit elements using C */
+void scalar_scatter_32(unsigned int *vscatter32)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        vscatter32[word_offsets[i] / 4] =3D word_values[i];
+    }
+}
+
+void check_scatter_32()
+{
+    memset(vscatter32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+    scalar_scatter_32(vscatter32_ref);
+    check_buffer(__func__, vtcm.vscatter32, vscatter32_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+}
+
+/* scatter the 32 bit elements using C */
+void scalar_scatter_32_acc(unsigned int *vscatter32)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        vscatter32[word_offsets[i] / 4] +=3D word_values_acc[i];
+    }
+}
+
+void check_scatter_32_acc()
+{
+    memset(vscatter32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+    scalar_scatter_32(vscatter32_ref);
+    scalar_scatter_32_acc(vscatter32_ref);
+    check_buffer(__func__, vtcm.vscatter32, vscatter32_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+}
+
+/* scatter the 32 bit elements using C */
+void scalar_scatter_32_masked(unsigned int *vscatter32)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; i++) {
+        if (word_predicates[i]) {
+            vscatter32[word_offsets[i] / 4] =3D word_values_masked[i];
+        }
+    }
+}
+
+void check_scatter_32_masked()
+{
+    memset(vscatter32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+    scalar_scatter_32(vscatter32_ref);
+    scalar_scatter_32_acc(vscatter32_ref);
+    scalar_scatter_32_masked(vscatter32_ref);
+    check_buffer(__func__, vtcm.vscatter32, vscatter32_ref,
+                  SCATTER_BUFFER_SIZE * sizeof(unsigned int));
+}
+
+/* scatter the 32 bit elements using C */
+void scalar_scatter_16_32(unsigned short *vscatter16_32)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        vscatter16_32[word_offsets[i] / 2] =3D half_values[i];
+    }
+}
+
+void check_scatter_16_32()
+{
+    memset(vscatter16_32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16_32(vscatter16_32_ref);
+    check_buffer(__func__, vtcm.vscatter16_32, vscatter16_32_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+/* scatter the 32 bit elements using C */
+void scalar_scatter_16_32_acc(unsigned short *vscatter16_32)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        vscatter16_32[word_offsets[i] / 2] +=3D half_values_acc[i];
+    }
+}
+
+void check_scatter_16_32_acc()
+{
+    memset(vscatter16_32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16_32(vscatter16_32_ref);
+    scalar_scatter_16_32_acc(vscatter16_32_ref);
+    check_buffer(__func__, vtcm.vscatter16_32, vscatter16_32_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+void scalar_scatter_16_32_masked(unsigned short *vscatter16_32)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; i++) {
+        if (half_predicates[i]) {
+            vscatter16_32[word_offsets[i] / 2] =3D half_values_masked[i];
+        }
+    }
+}
+
+void check_scatter_16_32_masked()
+{
+    memset(vscatter16_32_ref, FILL_CHAR,
+           SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+    scalar_scatter_16_32(vscatter16_32_ref);
+    scalar_scatter_16_32_acc(vscatter16_32_ref);
+    scalar_scatter_16_32_masked(vscatter16_32_ref);
+    check_buffer(__func__, vtcm.vscatter16_32, vscatter16_32_ref,
+                 SCATTER_BUFFER_SIZE * sizeof(unsigned short));
+}
+
+/* gather the elements from the scatter buffer using C */
+void scalar_gather_16(unsigned short *vgather16)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        vgather16[i] =3D vtcm.vscatter16[half_offsets[i] / 2];
+    }
+}
+
+void check_gather_16()
+{
+      memset(vgather16_ref, 0, MATRIX_SIZE * sizeof(unsigned short));
+      scalar_gather_16(vgather16_ref);
+      check_buffer(__func__, vtcm.vgather16, vgather16_ref,
+                   MATRIX_SIZE * sizeof(unsigned short));
+}
+
+void scalar_gather_16_masked(unsigned short *vgather16)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        if (half_predicates[i]) {
+            vgather16[i] =3D vtcm.vscatter16[half_offsets[i] / 2];
+        }
+    }
+}
+
+void check_gather_16_masked()
+{
+    memset(vgather16_ref, gather_16_masked_init(),
+           MATRIX_SIZE * sizeof(unsigned short));
+    scalar_gather_16_masked(vgather16_ref);
+    check_buffer(__func__, vtcm.vgather16, vgather16_ref,
+                 MATRIX_SIZE * sizeof(unsigned short));
+}
+
+/* gather the elements from the scatter buffer using C */
+void scalar_gather_32(unsigned int *vgather32)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        vgather32[i] =3D vtcm.vscatter32[word_offsets[i] / 4];
+    }
+}
+
+void check_gather_32(void)
+{
+    memset(vgather32_ref, 0, MATRIX_SIZE * sizeof(unsigned int));
+    scalar_gather_32(vgather32_ref);
+    check_buffer(__func__, vtcm.vgather32, vgather32_ref,
+                 MATRIX_SIZE * sizeof(unsigned int));
+}
+
+void scalar_gather_32_masked(unsigned int *vgather32)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        if (word_predicates[i]) {
+            vgather32[i] =3D vtcm.vscatter32[word_offsets[i] / 4];
+        }
+    }
+}
+
+
+void check_gather_32_masked(void)
+{
+    memset(vgather32_ref, gather_32_masked_init(),
+           MATRIX_SIZE * sizeof(unsigned int));
+    scalar_gather_32_masked(vgather32_ref);
+    check_buffer(__func__, vtcm.vgather32,
+                 vgather32_ref, MATRIX_SIZE * sizeof(unsigned int));
+}
+
+/* gather the elements from the scatter buffer using C */
+void scalar_gather_16_32(unsigned short *vgather16_32)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        vgather16_32[i] =3D vtcm.vscatter16_32[word_offsets[i] / 2];
+    }
+}
+
+void check_gather_16_32(void)
+{
+    memset(vgather16_32_ref, 0, MATRIX_SIZE * sizeof(unsigned short));
+    scalar_gather_16_32(vgather16_32_ref);
+    check_buffer(__func__, vtcm.vgather16_32, vgather16_32_ref,
+                 MATRIX_SIZE * sizeof(unsigned short));
+}
+
+void scalar_gather_16_32_masked(unsigned short *vgather16_32)
+{
+    for (int i =3D 0; i < MATRIX_SIZE; ++i) {
+        if (half_predicates[i]) {
+            vgather16_32[i] =3D vtcm.vscatter16_32[word_offsets[i] / 2];
+        }
+    }
+
+}
+
+void check_gather_16_32_masked(void)
+{
+    memset(vgather16_32_ref, gather_16_masked_init(),
+           MATRIX_SIZE * sizeof(unsigned short));
+    scalar_gather_16_32_masked(vgather16_32_ref);
+    check_buffer(__func__, vtcm.vgather16_32, vgather16_32_ref,
+                 MATRIX_SIZE * sizeof(unsigned short));
+}
+
+/* print scatter16 buffer */
+void print_scatter16_buffer(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 16 bit scatter buffer");
+
+        for (int i =3D 0; i < SCATTER_BUFFER_SIZE; i++) {
+            if ((i % MATRIX_SIZE) =3D=3D 0) {
+                printf("\n");
+            }
+            for (int j =3D 0; j < 2; j++) {
+                printf("%c", (char)((vtcm.vscatter16[i] >> j * 8) & 0xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+/* print the gather 16 buffer */
+void print_gather_result_16(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 16 bit gather result\n");
+
+        for (int i =3D 0; i < MATRIX_SIZE; i++) {
+            for (int j =3D 0; j < 2; j++) {
+                printf("%c", (char)((vtcm.vgather16[i] >> j * 8) & 0xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+/* print the scatter32 buffer */
+void print_scatter32_buffer(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 32 bit scatter buffer");
+
+        for (int i =3D 0; i < SCATTER_BUFFER_SIZE; i++) {
+            if ((i % MATRIX_SIZE) =3D=3D 0) {
+                printf("\n");
+            }
+            for (int j =3D 0; j < 4; j++) {
+                printf("%c", (char)((vtcm.vscatter32[i] >> j * 8) & 0xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+/* print the gather 32 buffer */
+void print_gather_result_32(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 32 bit gather result\n");
+
+        for (int i =3D 0; i < MATRIX_SIZE; i++) {
+            for (int j =3D 0; j < 4; j++) {
+                printf("%c", (char)((vtcm.vgather32[i] >> j * 8) & 0xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+/* print the scatter16_32 buffer */
+void print_scatter16_32_buffer(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 16_32 bit scatter buffer");
+
+        for (int i =3D 0; i < SCATTER_BUFFER_SIZE; i++) {
+            if ((i % MATRIX_SIZE) =3D=3D 0) {
+                printf("\n");
+            }
+            for (int j =3D 0; j < 2; j++) {
+                printf("%c",
+                      (unsigned char)((vtcm.vscatter16_32[i] >> j * 8) & 0=
xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+/* print the gather 16_32 buffer */
+void print_gather_result_16_32(void)
+{
+    if (PRINT_DATA) {
+        printf("\n\nPrinting the 16_32 bit gather result\n");
+
+        for (int i =3D 0; i < MATRIX_SIZE; i++) {
+            for (int j =3D 0; j < 2; j++) {
+                printf("%c",
+                       (unsigned char)((vtcm.vgather16_32[i] >> j * 8) & 0=
xff));
+            }
+            printf(" ");
+        }
+        printf("\n");
+    }
+}
+
+int main()
+{
+    prefill_vtcm_scratch();
+
+    /* 16 bit elements with 16 bit offsets */
+    create_offsets_values_preds_16();
+
+    vector_scatter_16();
+    print_scatter16_buffer();
+    check_scatter_16();
+
+    vector_gather_16();
+    print_gather_result_16();
+    check_gather_16();
+
+    vector_gather_16_masked();
+    print_gather_result_16();
+    check_gather_16_masked();
+
+    vector_scatter_16_acc();
+    print_scatter16_buffer();
+    check_scatter_16_acc();
+
+    vector_scatter_16_masked();
+    print_scatter16_buffer();
+    check_scatter_16_masked();
+
+    /* 32 bit elements with 32 bit offsets */
+    create_offsets_values_preds_32();
+
+    vector_scatter_32();
+    print_scatter32_buffer();
+    check_scatter_32();
+
+    vector_gather_32();
+    print_gather_result_32();
+    check_gather_32();
+
+    vector_gather_32_masked();
+    print_gather_result_32();
+    check_gather_32_masked();
+
+    vector_scatter_32_acc();
+    print_scatter32_buffer();
+    check_scatter_32_acc();
+
+    vector_scatter_32_masked();
+    print_scatter32_buffer();
+    check_scatter_32_masked();
+
+    /* 16 bit elements with 32 bit offsets */
+    create_offsets_values_preds_16_32();
+
+    vector_scatter_16_32();
+    print_scatter16_32_buffer();
+    check_scatter_16_32();
+
+    vector_gather_16_32();
+    print_gather_result_16_32();
+    check_gather_16_32();
+
+    vector_gather_16_32_masked();
+    print_gather_result_16_32();
+    check_gather_16_32_masked();
+
+    vector_scatter_16_32_acc();
+    print_scatter16_32_buffer();
+    check_scatter_16_32_acc();
+
+    vector_scatter_16_32_masked();
+    print_scatter16_32_buffer();
+    check_scatter_16_32_masked();
+
+    puts(err ? "FAIL" : "PASS");
+    return err;
+}
diff --git a/tests/tcg/hexagon/Makefile.target b/tests/tcg/hexagon/Makefile=
.target
index 6ac9ac6..bde90e7 100644
--- a/tests/tcg/hexagon/Makefile.target
+++ b/tests/tcg/hexagon/Makefile.target
@@ -47,11 +47,13 @@ HEX_TESTS +=3D brev
 HEX_TESTS +=3D load_unpack
 HEX_TESTS +=3D load_align
 HEX_TESTS +=3D vector_add_int
+HEX_TESTS +=3D scatter_gather
 HEX_TESTS +=3D atomics
 HEX_TESTS +=3D fpstuff
 HEX_TESTS +=3D hvx_misc
=20
 TESTS +=3D $(HEX_TESTS)
=20
+scatter_gather: CFLAGS +=3D -mhvx
 vector_add_int: CFLAGS +=3D -mhvx -fvectorize
 hvx_misc: CFLAGS +=3D -mhvx
--=20
2.7.4