From nobody Mon May 11 00:10:12 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1E582C433EF
	for <linux-kernel@archiver.kernel.org>; Wed, 20 Apr 2022 23:52:44 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1383309AbiDTXz3 (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 20 Apr 2022 19:55:29 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33478 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232336AbiDTXzY (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Apr 2022 19:55:24 -0400
Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com
 [IPv6:2607:f8b0:4864:20::114a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0FF033DDC7
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:36 -0700 (PDT)
Received: by mail-yw1-x114a.google.com with SMTP id
 00721157ae682-2d7eaa730d9so29355787b3.13
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc:content-transfer-encoding;
        bh=wXmJayhij+TwqNidzBSB+0E2/eNzdXfcp3O+/c0CbLs=;
        b=OaTB0NiuNGz3gZItyS5lGwvFOV7EZfpnX61C01bWWJjrQvw2KVEzFOtQmxRdxaSqrK
         DyZu71fHPSfuIhXETJEDfwfomhHzLen7FXo4xGDixCofaw1VUz53e+6jEJsNCFKbS4R2
         b32RTlbQVSExtwz2j01qt89K5YDAARZUOHLDBkXMIjUvYc2di8K1OGvXpUpBQA9KxVQe
         VqDWByInG2C9PEoeUTqqMWo3pWkEziJDfUyj4iIBaw0drmx47/i43FMmzuAJ665mVVL1
         mu6hSBkRK9Kj23/NVYHS3v83cv/2KEOTKYF7s8dEDCyuNULl9QHRJ2FdImYzP++gH5jq
         o9KA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc:content-transfer-encoding;
        bh=wXmJayhij+TwqNidzBSB+0E2/eNzdXfcp3O+/c0CbLs=;
        b=mdSSx5AEt6KoqVTxn9zXKBzo32JMZEG3idAsEiM0sA80GSb6+f5pQBlqJ4GQ3Sx8Yb
         wOV3IXblHz19p5OuQZPweogIQaCHBxqHUX3ADork6iYC7WF+oJJGUxpVgdCK/C9WYvGR
         h7Y8LnRvWf4k/C+dv4b3yJ4E+X/82KgCvcnmL7MTWSW+edAwv8IncYLglw7QQSlE/vKh
         dML+xR0bz7rVBVMzd+Kw0kaDOQ8XfuElr7KkzySum7RMAv0EjVZ4mCSstLR+fxVx4y79
         gq4b6pVP43i9KO0QEgRqqqmoIgrtLICJaYu1aata61KerAM5x75yDaAwy39JmyZiU6Hg
         YUaQ==
X-Gm-Message-State: AOAM530G8xAJR+qKsH3bf4Xa1fnRi8elTmrevbs68zcXlDdEnwsWusJp
        0k5OT5yZt+3nwxQ3ZU87oNZ2vSdbQC0szNo=
X-Google-Smtp-Source: 
 ABdhPJwdyO1a59Wbmq1dxswXBXqz9W9bswYyvbTDUV+1oi//Kyi7KDzuR7vBGAHyOSEgZ4kGsGhEe+dl4K2LpjM=
X-Received: from tj.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:53a])
 (user=tjmercier job=sendgmr) by 2002:a05:6902:1244:b0:644:b8a5:e195 with SMTP
 id t4-20020a056902124400b00644b8a5e195mr22014142ybu.556.1650498755305; Wed,
 20 Apr 2022 16:52:35 -0700 (PDT)
Date: Wed, 20 Apr 2022 23:52:19 +0000
In-Reply-To: <20220420235228.2767816-1-tjmercier@google.com>
Message-Id: <20220420235228.2767816-2-tjmercier@google.com>
Mime-Version: 1.0
References: <20220420235228.2767816-1-tjmercier@google.com>
X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog
Subject: [RFC v5 1/6] gpu: rfc: Proposal for a GPU cgroup controller
From: "T.J. Mercier" <tjmercier@google.com>
To: tjmercier@google.com, daniel@ffwll.ch, tj@kernel.org,
        Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
        Maxime Ripard <mripard@kernel.org>,
        Thomas Zimmermann <tzimmermann@suse.de>,
        David Airlie <airlied@linux.ie>,
        Jonathan Corbet <corbet@lwn.net>
Cc: hridya@google.com, christian.koenig@amd.com, jstultz@google.com,
        tkjos@android.com, cmllamas@google.com, surenb@google.com,
        kaleshsingh@google.com, Kenny.Ho@amd.com, mkoutny@suse.com,
        skhan@linuxfoundation.org, kernel-team@android.com,
        dri-devel@lists.freedesktop.org, linux-doc@vger.kernel.org,
        linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

From: Hridya Valsaraju <hridya@google.com>

This patch adds a proposal for a new GPU cgroup controller for
accounting/limiting GPU and GPU-related memory allocations.
The proposed controller is based on the DRM cgroup controller[1] and
follows the design of the RDMA cgroup controller.

The new cgroup controller would:
* Allow setting per-device limits on the total size of buffers
  allocated by device within a cgroup.
* Expose a per-device/allocator breakdown of the buffers charged to a
  cgroup.

The prototype in the following patches is only for memory accounting
using the GPU cgroup controller and does not implement limit setting.

[1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@int=
el.com/

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v5 changes
Drop the global GPU cgroup "total" (sum of all device totals) portion
of the design since there is no currently known use for this per
Tejun Heo.

Update for renamed functions/variables.

v3 changes
Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.

Use more common dual author commit message format per John Stultz.
---
 Documentation/gpu/rfc/gpu-cgroup.rst | 190 +++++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst      |   4 +
 2 files changed, 194 insertions(+)
 create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst

diff --git a/Documentation/gpu/rfc/gpu-cgroup.rst b/Documentation/gpu/rfc/g=
pu-cgroup.rst
new file mode 100644
index 000000000000..0be2a3a9f641
--- /dev/null
+++ b/Documentation/gpu/rfc/gpu-cgroup.rst
@@ -0,0 +1,190 @@
+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
+GPU cgroup controller
+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
+
+Goals
+=3D=3D=3D=3D=3D
+This document intends to outline a plan to create a cgroup v2 controller s=
ubsystem
+for the per-cgroup accounting of device and system memory allocated by the=
 GPU
+and related subsystems.
+
+The new cgroup controller would:
+
+* Allow setting per-device limits on the total size of buffers allocated b=
y a
+  device/allocator within a cgroup.
+
+* Expose a per-device/allocator breakdown of the buffers charged to a cgro=
up.
+
+Alternatives Considered
+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
+
+The following alternatives were considered:
+
+The memory cgroup controller
+____________________________
+
+1. As was noted in [1], memory accounting provided by the GPU cgroup
+controller is not a good fit for integration into memcg due to the
+differences in how accounting is performed. It implements a mechanism
+for the allocator attribution of GPU and GPU-related memory by
+charging each buffer to the cgroup of the process on behalf of which
+the memory was allocated. The buffer stays charged to the cgroup until
+it is freed regardless of whether the process retains any references
+to it. On the other hand, the memory cgroup controller offers a more
+fine-grained charging and uncharging behavior depending on the kind of
+page being accounted.
+
+2. Memcg performs accounting in units of pages. In the DMA-BUF buffer shar=
ing model,
+a process takes a reference to the entire buffer(hence keeping it alive) e=
ven if
+it is only accessing parts of it. Therefore, per-page memory tracking for =
DMA-BUF
+memory accounting would only introduce additional overhead without any ben=
efits.
+
+[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9=
506-1-brian.welty@intel.com/#22624705
+
+Userspace service to keep track of buffer allocations and releases
+__________________________________________________________________
+
+1. There is no way for a userspace service to intercept all allocations an=
d releases.
+2. In case the process gets killed or restarted, we lose all accounting so=
 far.
+
+UAPI
+=3D=3D=3D=3D
+When enabled, the new cgroup controller would create the following files i=
n every cgroup.
+
+::
+
+        gpu.memory.current (R)
+        gpu.memory.max (R/W)
+
+gpu.memory.current is a read-only file and would contain per-device memory=
 allocations
+in a key-value format where key is a string representing the device name a=
nd the value
+is the size of memory charged to the device in the cgroup in bytes. The de=
vice name
+should be globally unique.
+
+For example:
+
+::
+
+        cat /sys/kernel/fs/cgroup1/gpu.memory.current
+        dev1 4194304
+        dev2 4194304
+
+The string key for each device is set by the device driver when the device=
 registers
+with the GPU cgroup controller to participate in resource accounting (see =
section
+'Design and Implementation' for more details).
+
+gpu.memory.max is a read/write file. It would show the current size limits=
 on
+memory usage for each allocator/device.
+
+Setting a limit for a particular device/allocator can be done as follows:
+
+::
+
+        echo =E2=80=9Cdev1 4194304=E2=80=9D >  /sys/kernel/fs/cgroup1/gpu.=
memory.max
+
+In this example, 'dev1' is the string key set by the device driver during
+registration.
+
+Design and Implementation
+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
+
+The cgroup controller would closely follow the design of the RDMA cgroup c=
ontroller
+subsystem where each cgroup maintains a list of resource pools.
+Each resource pool is associated with a device name via a pointer to a str=
uct gpucg_bucket
+and contains a counter to track current, total, and the maximum limit set =
for the device.
+
+The below code block is a preliminary estimation on how the core kernel da=
ta structures
+and APIs would look like.
+
+.. code-block:: c
+
+        /* The GPU cgroup controller data structure */
+        struct gpucg {
+                struct cgroup_subsys_state css;
+
+                /* list of all resource pools that belong to this cgroup */
+                struct list_head rpools;
+        };
+
+        /* A named entity representing bucket of tracked memory. */
+        struct gpucg_bucket {
+                /* list of various resource pools in various cgroups that =
the bucket is part of */
+                struct list_head rpools;
+
+                /* list of all buckets registered for GPU cgroup accountin=
g */
+                struct list_head bucket_node;
+
+                /* string to be used as identifier for accounting and limi=
t setting */
+                const char *name;
+        };
+
+        struct gpucg_resource_pool {
+                /* The bucket whose resource usage is tracked by this reso=
urce pool */
+                struct gpucg_bucket *bucket;
+
+                /* list of all resource pools for the cgroup */
+                struct list_head cg_node;
+
+                /* list maintained by the gpucg_bucket to keep track of it=
s resource pools */
+                struct list_head bucket_node;
+
+                /* tracks memory usage of the resource pool */
+                struct page_counter total;
+        };
+
+        /**
+         * gpucg_register_bucket - Registers a bucket for memory accountin=
g using the
+         * GPU cgroup controller.
+         *
+         * @bucket: The bucket to register for memory accounting.
+         * @name: Pointer to a null-terminated string to denote the name o=
f the bucket. This name
+         *        should be globally unique, and should not exceed @GPUCG_=
BUCKET_NAME_MAX_LEN bytes.
+         *
+         * @bucket must remain valid. @name will be copied.
+         */
+        void gpucg_register_bucket(struct gpucg_bucket *bucket, const char=
 *name)
+
+        /**
+         * gpucg_charge - charge memory to the specified gpucg and gpucg_b=
ucket.
+         *
+         * @gpucg: The gpu cgroup to charge the memory to.
+         * @bucket: The pool to charge the memory to.
+         * @size: The size of memory to charge in bytes.
+         *        This size will be rounded up to the nearest page size.
+         *
+         * Return: returns 0 if the charging is successful and otherwise r=
eturns an
+         * error code.
+         */
+        int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket,=
 u64 size);
+
+        /**
+         * gpucg_uncharge - uncharge memory from the specified gpucg and g=
pucg_bucket.
+         * The caller must hold a reference to @gpucg obtained through gpu=
cg_get().
+         *
+         * @gpucg: The gpu cgroup to uncharge the memory from.
+         * @bucket: The bucket to uncharge the memory from.
+         * @size: The size of memory to uncharge in bytes.
+         *        This size will be rounded up to the nearest page size.
+         */
+        void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *buck=
et, u64 size);
+
+        /**
+         * gpucg_transfer_charge - Transfer a GPU charge from one cgroup t=
o another.
+         *
+         * @source:	[in]	The GPU cgroup the charge will be transferred fro=
m.
+         * @dest:	[in]	The GPU cgroup the charge will be transferred to.
+         * @bucket:	[in]	The GPU cgroup bucket corresponding to the charge.
+         * @size:	[in]	The size of the memory in bytes.
+         *                      This size will be rounded up to the neares=
t page size.
+         *
+         * Returns 0 on success, or a negative errno code otherwise.
+         */
+        int gpucg_transfer_charge(struct gpucg *source,
+                                  struct gpucg *dest,
+                                  struct gpucg_bucket *bucket,
+                                  u64 size)
+
+
+Future Work
+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
+Additional GPU resources can be supported by adding new controller files.
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.=
rst
index 91e93a705230..0a9bcd94e95d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
=20
     i915_scheduler.rst
+
+.. toctree::
+
+    gpu-cgroup.rst
--=20
2.36.0.rc0.470.gd361397f0d-goog
From nobody Mon May 11 00:10:12 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 877AAC433FE
	for <linux-kernel@archiver.kernel.org>; Wed, 20 Apr 2022 23:52:48 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1383317AbiDTXzd (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 20 Apr 2022 19:55:33 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33558 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1383306AbiDTXz2 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Apr 2022 19:55:28 -0400
Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com
 [IPv6:2607:f8b0:4864:20::114a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A18A93E0C1
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:40 -0700 (PDT)
Received: by mail-yw1-x114a.google.com with SMTP id
 00721157ae682-2ec12272fb2so29235947b3.6
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=VlK/N65m/5fmcNRwSLWo0mws2lFJ6Ub3DEivUEYg98M=;
        b=tLUmazkmovB9iU+mqAK5knrhBSAPu3QjDm8b1hLdL0BVb1j0pMw8PSvaQf8re5PeSJ
         bUzsFis4ic1vlC1T6VD3o4VBQ95vY+V7WvjYmAArKb3twFoFPabqbKQkMB0uHL/kETLr
         mPick5SnaYURJyuTlp3SKEreeO+FIQKujI+tHqd9woBtbQqj6pkpZ8ALfUxRGSns90CB
         pFXW86i04HAslSUgE8CzJUkioFJUCVbEQ8Fn7San8Y8iebvj+uelJbyuJcrhnXLVUp1M
         ir9v/JKNIObXhEM6k122Ah+TjJn0dl3RHtHq/eYh3xn2Z0P92GtkwLnqJF8RdY4nr3cW
         AoLg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=VlK/N65m/5fmcNRwSLWo0mws2lFJ6Ub3DEivUEYg98M=;
        b=YRN6vSX0Qclk3kU1ID16t3yA77egQboAiRgY59w+WwqMg4lJU9EvN/mEhHelwwAoTl
         4NP/YETr1a/E/euSU4pDS+Lg2m5Nc29/J5aHA8zsxiZ+fCT1aXBHIEZ+xIuSCPB4vK7c
         RUSZDt6mfc/iVm5GDjAOKo4i7d2JtfXw4qIix3PSCcE6Nws1YLyP+MMgQ/MEt4gt+KxF
         43kDh0EwQBlPKQSPeMjCJXkEsqyucQtCA+HP8QbaW+exWfauXGjaGwYTA4+0s1vP7OKC
         UHjeLLmEXR7FYhKCxVXArEwTI/95Scc2+r2sEbqMHl7riKKtVmmkpMe6RgMP3X9xZ4UD
         MHdQ==
X-Gm-Message-State: AOAM532pOJTbwnQh+Flua9syVUWkqZajtKiM2rz5WsRB6ZLcNGnx8E3L
        xFtUUOpeZcHp8wAG+IFEXeFhajwYnngXUR4=
X-Google-Smtp-Source: 
 ABdhPJwgNKzjVGsHBaxiO+vOasI6LOgyAKHvcZIMVyf5+8nsBCJgMdx3an95uJ3DGrx5zydqTMLIpN5BTFBb3Eo=
X-Received: from tj.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:53a])
 (user=tjmercier job=sendgmr) by 2002:a81:2557:0:b0:2ef:3c92:afd4 with SMTP id
 l84-20020a812557000000b002ef3c92afd4mr22686874ywl.408.1650498759906; Wed, 20
 Apr 2022 16:52:39 -0700 (PDT)
Date: Wed, 20 Apr 2022 23:52:20 +0000
In-Reply-To: <20220420235228.2767816-1-tjmercier@google.com>
Message-Id: <20220420235228.2767816-3-tjmercier@google.com>
Mime-Version: 1.0
References: <20220420235228.2767816-1-tjmercier@google.com>
X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog
Subject: [RFC v5 2/6] cgroup: gpu: Add a cgroup controller for allocator
 attribution of GPU memory
From: "T.J. Mercier" <tjmercier@google.com>
To: tjmercier@google.com, daniel@ffwll.ch, tj@kernel.org,
        Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>
Cc: hridya@google.com, christian.koenig@amd.com, jstultz@google.com,
        tkjos@android.com, cmllamas@google.com, surenb@google.com,
        kaleshsingh@google.com, Kenny.Ho@amd.com, mkoutny@suse.com,
        skhan@linuxfoundation.org, kernel-team@android.com,
        linux-kernel@vger.kernel.org, cgroups@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Hridya Valsaraju <hridya@google.com>

The cgroup controller provides accounting for GPU and GPU-related
memory allocations. The memory being accounted can be device memory or
memory allocated from pools dedicated to serve GPU-related tasks.

This patch adds APIs to:
-allow a device to register for memory accounting using the GPU cgroup
controller.
-charge and uncharge allocated memory to a cgroup.

When the cgroup controller is enabled, it would expose information about
the memory allocated by each device(registered for GPU cgroup memory
accounting) for each cgroup.

The API/UAPI can be extended to set per-device/total allocation limits
in the future.

The cgroup controller has been named following the discussion in [1].

[1]: https://lore.kernel.org/amd-gfx/YCJp%2F%2FkMC7YjVMXv@phenom.ffwll.loca=
l/

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v5 changes
Support all strings for gpucg_register_device instead of just string
literals.

Enforce globally unique gpucg_bucket names.

Constrain gpucg_bucket name lengths to 64 bytes.

Obtain just a single css refcount instead of nr_pages for each
charge.

Rename:
gpucg_try_charge -> gpucg_charge
find_cg_rpool_locked -> cg_rpool_find_locked
init_cg_rpool -> cg_rpool_init
get_cg_rpool_locked -> cg_rpool_get_locked
"gpu cgroup controller" -> "GPU controller"
gpucg_device -> gpucg_bucket
usage -> size

v4 changes
Adjust gpucg_try_charge critical section for future charge transfer
functionality.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Fix incorrect Kconfig help section indentation per Randy Dunlap.
---
 include/linux/cgroup_gpu.h    | 123 +++++++++++++
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |   7 +
 kernel/cgroup/Makefile        |   1 +
 kernel/cgroup/gpu.c           | 324 ++++++++++++++++++++++++++++++++++
 5 files changed, 459 insertions(+)
 create mode 100644 include/linux/cgroup_gpu.h
 create mode 100644 kernel/cgroup/gpu.c

diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
new file mode 100644
index 000000000000..4dfe633d6ec7
--- /dev/null
+++ b/include/linux/cgroup_gpu.h
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: MIT
+ * Copyright 2019 Advanced Micro Devices, Inc.
+ * Copyright (C) 2022 Google LLC.
+ */
+#ifndef _CGROUP_GPU_H
+#define _CGROUP_GPU_H
+
+#include <linux/cgroup.h>
+#include <linux/list.h>
+
+#define GPUCG_BUCKET_NAME_MAX_LEN 64
+
+#ifdef CONFIG_CGROUP_GPU
+ /* The GPU cgroup controller data structure */
+struct gpucg {
+	struct cgroup_subsys_state css;
+
+	/* list of all resource pools that belong to this cgroup */
+	struct list_head rpools;
+};
+
+/* A named entity representing bucket of tracked memory. */
+struct gpucg_bucket {
+	/* list of various resource pools in various cgroups that the bucket is p=
art of */
+	struct list_head rpools;
+
+	/* list of all buckets registered for GPU cgroup accounting */
+	struct list_head bucket_node;
+
+	/* string to be used as identifier for accounting and limit setting */
+	const char *name;
+};
+
+/**
+ * css_to_gpucg - get the corresponding gpucg ref from a cgroup_subsys_sta=
te
+ * @css: the target cgroup_subsys_state
+ *
+ * Returns: gpu cgroup that contains the @css
+ */
+static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct gpucg, css) : NULL;
+}
+
+/**
+ * gpucg_get - get the gpucg reference that a task belongs to
+ * @task: the target task
+ *
+ * This increases the reference count of the css that the @task belongs to.
+ *
+ * Returns: reference to the gpu cgroup the task belongs to.
+ */
+static inline struct gpucg *gpucg_get(struct task_struct *task)
+{
+	if (!cgroup_subsys_enabled(gpu_cgrp_subsys))
+		return NULL;
+	return css_to_gpucg(task_get_css(task, gpu_cgrp_id));
+}
+
+/**
+ * gpucg_put - put a gpucg reference
+ * @gpucg: the target gpucg
+ *
+ * Put a reference obtained via gpucg_get
+ */
+static inline void gpucg_put(struct gpucg *gpucg)
+{
+	if (gpucg)
+		css_put(&gpucg->css);
+}
+
+/**
+ * gpucg_parent - find the parent of a gpu cgroup
+ * @cg: the target gpucg
+ *
+ * This does not increase the reference count of the parent cgroup
+ *
+ * Returns: parent gpu cgroup of @cg
+ */
+static inline struct gpucg *gpucg_parent(struct gpucg *cg)
+{
+	return css_to_gpucg(cg->css.parent);
+}
+
+int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 siz=
e);
+void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 =
size);
+int gpucg_register_bucket(struct gpucg_bucket *bucket, const char *name);
+#else /* CONFIG_CGROUP_GPU */
+
+struct gpucg;
+struct gpucg_bucket;
+
+static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
+{
+	return NULL;
+}
+
+static inline struct gpucg *gpucg_get(struct task_struct *task)
+{
+	return NULL;
+}
+
+static inline void gpucg_put(struct gpucg *gpucg) {}
+
+static inline struct gpucg *gpucg_parent(struct gpucg *cg)
+{
+	return NULL;
+}
+
+static inline int gpucg_charge(struct gpucg *gpucg,
+			       struct gpucg_bucket *bucket,
+			       u64 size)
+{
+	return 0;
+}
+
+static inline void gpucg_uncharge(struct gpucg *gpucg,
+				  struct gpucg_bucket *bucket,
+				  u64 size) {}
+
+static inline int gpucg_register_bucket(struct gpucg_bucket *bucket, const=
 char *name) {}
+#endif /* CONFIG_CGROUP_GPU */
+#endif /* _CGROUP_GPU_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 445235487230..46a2a7b93c41 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -65,6 +65,10 @@ SUBSYS(rdma)
 SUBSYS(misc)
 #endif
=20
+#if IS_ENABLED(CONFIG_CGROUP_GPU)
+SUBSYS(gpu)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index ddcbefe535e9..2e00a190e170 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -984,6 +984,13 @@ config BLK_CGROUP
=20
 	See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more inf=
ormation.
=20
+config CGROUP_GPU
+	bool "GPU controller (EXPERIMENTAL)"
+	select PAGE_COUNTER
+	help
+	  Provides accounting and limit setting for memory allocations by the GPU=
 and
+	  GPU-related subsystems.
+
 config CGROUP_WRITEBACK
 	bool
 	depends on MEMCG && BLK_CGROUP
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 12f8457ad1f9..be95a5a532fc 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_CGROUP_RDMA) +=3D rdma.o
 obj-$(CONFIG_CPUSETS) +=3D cpuset.o
 obj-$(CONFIG_CGROUP_MISC) +=3D misc.o
 obj-$(CONFIG_CGROUP_DEBUG) +=3D debug.o
+obj-$(CONFIG_CGROUP_GPU) +=3D gpu.o
diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
new file mode 100644
index 000000000000..34d0a5b85834
--- /dev/null
+++ b/kernel/cgroup/gpu.c
@@ -0,0 +1,324 @@
+// SPDX-License-Identifier: MIT
+// Copyright 2019 Advanced Micro Devices, Inc.
+// Copyright (C) 2022 Google LLC.
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_gpu.h>
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/mm.h>
+#include <linux/page_counter.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+static struct gpucg *root_gpucg __read_mostly;
+
+/*
+ * Protects list of resource pools maintained on per cgroup basis and list
+ * of buckets registered for memory accounting using the GPU cgroup contro=
ller.
+ */
+static DEFINE_MUTEX(gpucg_mutex);
+static LIST_HEAD(gpucg_buckets);
+
+struct gpucg_resource_pool {
+	/* The bucket whose resource usage is tracked by this resource pool */
+	struct gpucg_bucket *bucket;
+
+	/* list of all resource pools for the cgroup */
+	struct list_head cg_node;
+
+	/* list maintained by the gpucg_bucket to keep track of its resource pool=
s */
+	struct list_head bucket_node;
+
+	/* tracks memory usage of the resource pool */
+	struct page_counter total;
+};
+
+static void free_cg_rpool_locked(struct gpucg_resource_pool *rpool)
+{
+	lockdep_assert_held(&gpucg_mutex);
+
+	list_del(&rpool->cg_node);
+	list_del(&rpool->bucket_node);
+	kfree(rpool);
+}
+
+static void gpucg_css_free(struct cgroup_subsys_state *css)
+{
+	struct gpucg_resource_pool *rpool, *tmp;
+	struct gpucg *gpucg =3D css_to_gpucg(css);
+
+	// delete all resource pools
+	mutex_lock(&gpucg_mutex);
+	list_for_each_entry_safe(rpool, tmp, &gpucg->rpools, cg_node)
+		free_cg_rpool_locked(rpool);
+	mutex_unlock(&gpucg_mutex);
+
+	kfree(gpucg);
+}
+
+static struct cgroup_subsys_state *
+gpucg_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct gpucg *gpucg, *parent;
+
+	gpucg =3D kzalloc(sizeof(struct gpucg), GFP_KERNEL);
+	if (!gpucg)
+		return ERR_PTR(-ENOMEM);
+
+	parent =3D css_to_gpucg(parent_css);
+	if (!parent)
+		root_gpucg =3D gpucg;
+
+	INIT_LIST_HEAD(&gpucg->rpools);
+
+	return &gpucg->css;
+}
+
+static struct gpucg_resource_pool *cg_rpool_find_locked(
+	struct gpucg *cg,
+	struct gpucg_bucket *bucket)
+{
+	struct gpucg_resource_pool *rpool;
+
+	lockdep_assert_held(&gpucg_mutex);
+
+	list_for_each_entry(rpool, &cg->rpools, cg_node)
+		if (rpool->bucket =3D=3D bucket)
+			return rpool;
+
+	return NULL;
+}
+
+static struct gpucg_resource_pool *cg_rpool_init(struct gpucg *cg,
+						 struct gpucg_bucket *bucket)
+{
+	struct gpucg_resource_pool *rpool =3D kzalloc(sizeof(*rpool),
+							GFP_KERNEL);
+	if (!rpool)
+		return ERR_PTR(-ENOMEM);
+
+	rpool->bucket =3D bucket;
+
+	page_counter_init(&rpool->total, NULL);
+	INIT_LIST_HEAD(&rpool->cg_node);
+	INIT_LIST_HEAD(&rpool->bucket_node);
+	list_add_tail(&rpool->cg_node, &cg->rpools);
+	list_add_tail(&rpool->bucket_node, &bucket->rpools);
+
+	return rpool;
+}
+
+/**
+ * get_cg_rpool_locked - find the resource pool for the specified bucket a=
nd
+ * specified cgroup. If the resource pool does not exist for the cg, it is
+ * created in a hierarchical manner in the cgroup and its ancestor cgroups=
 who
+ * do not already have a resource pool entry for the bucket.
+ *
+ * @cg: The cgroup to find the resource pool for.
+ * @bucket: The bucket associated with the returned resource pool.
+ *
+ * Return: return resource pool entry corresponding to the specified bucke=
t in
+ * the specified cgroup (hierarchically creating them if not existing alre=
ady).
+ *
+ */
+static struct gpucg_resource_pool *
+cg_rpool_get_locked(struct gpucg *cg, struct gpucg_bucket *bucket)
+{
+	struct gpucg *parent_cg, *p, *stop_cg;
+	struct gpucg_resource_pool *rpool, *tmp_rpool;
+	struct gpucg_resource_pool *parent_rpool =3D NULL, *leaf_rpool =3D NULL;
+
+	rpool =3D cg_rpool_find_locked(cg, bucket);
+	if (rpool)
+		return rpool;
+
+	stop_cg =3D cg;
+	do {
+		rpool =3D cg_rpool_init(stop_cg, bucket);
+		if (IS_ERR(rpool))
+			goto err;
+
+		if (!leaf_rpool)
+			leaf_rpool =3D rpool;
+
+		stop_cg =3D gpucg_parent(stop_cg);
+		if (!stop_cg)
+			break;
+
+		rpool =3D cg_rpool_find_locked(stop_cg, bucket);
+	} while (!rpool);
+
+	/*
+	 * Re-initialize page counters of all rpools created in this invocation
+	 * to enable hierarchical charging.
+	 * stop_cg is the first ancestor cg who already had a resource pool for
+	 * the bucket. It can also be NULL if no ancestors had a pre-existing
+	 * resource pool for the bucket before this invocation.
+	 */
+	rpool =3D leaf_rpool;
+	for (p =3D cg; p !=3D stop_cg; p =3D parent_cg) {
+		parent_cg =3D gpucg_parent(p);
+		if (!parent_cg)
+			break;
+		parent_rpool =3D cg_rpool_find_locked(parent_cg, bucket);
+		page_counter_init(&rpool->total, &parent_rpool->total);
+
+		rpool =3D parent_rpool;
+	}
+
+	return leaf_rpool;
+err:
+	for (p =3D cg; p !=3D stop_cg; p =3D gpucg_parent(p)) {
+		tmp_rpool =3D cg_rpool_find_locked(p, bucket);
+		free_cg_rpool_locked(tmp_rpool);
+	}
+	return rpool;
+}
+
+/**
+ * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket.
+ * Caller must hold a reference to @gpucg obtained through gpucg_get(). Th=
e size
+ * of the memory is rounded up to be a multiple of the page size.
+ *
+ * @gpucg: The gpu cgroup to charge the memory to.
+ * @bucket: The bucket to charge the memory to.
+ * @size: The size of memory to charge in bytes.
+ *        This size will be rounded up to the nearest page size.
+ *
+ * Return: returns 0 if the charging is successful and otherwise returns an
+ * error code.
+ */
+int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 siz=
e)
+{
+	struct page_counter *counter;
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp;
+	int ret =3D 0;
+
+	nr_pages =3D PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+	mutex_lock(&gpucg_mutex);
+	rp =3D cg_rpool_get_locked(gpucg, bucket);
+	/*
+	 * Continue to hold gpucg_mutex because we use it to block charges while =
transfers are in
+	 * progress to avoid potentially exceeding a limit.
+	 */
+	if (IS_ERR(rp)) {
+		mutex_unlock(&gpucg_mutex);
+		return PTR_ERR(rp);
+	}
+
+	if (page_counter_try_charge(&rp->total, nr_pages, &counter))
+		css_get(&gpucg->css);
+	else
+		ret =3D -ENOMEM;
+	mutex_unlock(&gpucg_mutex);
+
+	return ret;
+}
+
+/**
+ * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_buc=
ket.
+ * The caller must hold a reference to @gpucg obtained through gpucg_get().
+ *
+ * @gpucg: The gpu cgroup to uncharge the memory from.
+ * @bucket: The bucket to uncharge the memory from.
+ * @size: The size of memory to uncharge in bytes.
+ *        This size will be rounded up to the nearest page size.
+ */
+void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 =
size)
+{
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp;
+
+	mutex_lock(&gpucg_mutex);
+	rp =3D cg_rpool_find_locked(gpucg, bucket);
+	/*
+	 * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is fr=
eed and there are
+	 * active refs on gpucg. Uncharges are fine while transfers are in progre=
ss since there is
+	 * no potential to exceed a limit while uncharging and transferring.
+	 */
+	mutex_unlock(&gpucg_mutex);
+
+	if (unlikely(!rp)) {
+		pr_err("Resource pool not found, incorrect charge/uncharge ordering?\n");
+		return;
+	}
+
+	nr_pages =3D PAGE_ALIGN(size) >> PAGE_SHIFT;
+	page_counter_uncharge(&rp->total, nr_pages);
+	css_put(&gpucg->css);
+}
+
+/**
+ * gpucg_register_bucket - Registers a bucket for memory accounting using =
the
+ * GPU cgroup controller.
+ *
+ * @bucket: The bucket to register for memory accounting.
+ * @name: Pointer to a null-terminated string to denote the name of the bu=
cket. This name should be
+ *        globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LE=
N bytes.
+ *
+ * @bucket must remain valid. @name will be copied.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int gpucg_register_bucket(struct gpucg_bucket *bucket, const char *name)
+{
+	struct gpucg_bucket *b;
+
+	if (!bucket || !name)
+		return -EINVAL;
+
+	if (strlen(name) >=3D GPUCG_BUCKET_NAME_MAX_LEN)
+		return -ENAMETOOLONG;
+
+	INIT_LIST_HEAD(&bucket->bucket_node);
+	INIT_LIST_HEAD(&bucket->rpools);
+	bucket->name =3D kstrdup_const(name, GFP_KERNEL);
+
+	mutex_lock(&gpucg_mutex);
+	list_for_each_entry(b, &gpucg_buckets, bucket_node) {
+		if (strncmp(b->name, bucket->name, GPUCG_BUCKET_NAME_MAX_LEN) =3D=3D 0) {
+			mutex_unlock(&gpucg_mutex);
+			kfree_const(bucket->name);
+			return -EEXIST;
+		}
+	}
+	list_add_tail(&bucket->bucket_node, &gpucg_buckets);
+	mutex_unlock(&gpucg_mutex);
+
+	return 0;
+}
+
+static int gpucg_resource_show(struct seq_file *sf, void *v)
+{
+	struct gpucg_resource_pool *rpool;
+	struct gpucg *cg =3D css_to_gpucg(seq_css(sf));
+
+	mutex_lock(&gpucg_mutex);
+	list_for_each_entry(rpool, &cg->rpools, cg_node) {
+		seq_printf(sf, "%s %lu\n", rpool->bucket->name,
+			   page_counter_read(&rpool->total) * PAGE_SIZE);
+	}
+	mutex_unlock(&gpucg_mutex);
+
+	return 0;
+}
+
+struct cftype files[] =3D {
+	{
+		.name =3D "memory.current",
+		.seq_show =3D gpucg_resource_show,
+	},
+	{ }     /* terminate */
+};
+
+struct cgroup_subsys gpu_cgrp_subsys =3D {
+	.css_alloc      =3D gpucg_css_alloc,
+	.css_free       =3D gpucg_css_free,
+	.early_init     =3D false,
+	.legacy_cftypes =3D files,
+	.dfl_cftypes    =3D files,
+};
--=20
2.36.0.rc0.470.gd361397f0d-goog
From nobody Mon May 11 00:10:12 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 874F4C433F5
	for <linux-kernel@archiver.kernel.org>; Wed, 20 Apr 2022 23:52:55 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1383334AbiDTXzk (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 20 Apr 2022 19:55:40 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33644 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1383313AbiDTXzc (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Apr 2022 19:55:32 -0400
Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com
 [IPv6:2607:f8b0:4864:20::1149])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CA64F3E0D4
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:44 -0700 (PDT)
Received: by mail-yw1-x1149.google.com with SMTP id
 00721157ae682-2f18a73fabeso29115677b3.20
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc:content-transfer-encoding;
        bh=5E9Rbs7lnkN3+W98JnkIUAPTQ49TUyPznJB6J7YNkeo=;
        b=SWzxIRgdRQUtqcnezZ1qAW7U1kX7+57LPNNbfRy88rcYumm4IA2OJuYgyyg7UDgn4q
         MCDVqH2XccdjAMgVUESHKUP693m4h0vVR930REBPdZ6wch5vagUCuVlCO3pq1G/WttSm
         AKvzWT9Nlv5PQyonszFAMeFGYDERNSYU+Ih9zKKj6eBos5pKLMLliVS1pcZcxSRG7Zrn
         Oqa7NQcw9kxxBDKs5sWyq9NwhYkTbsb6IZoi2/ioWxpbEzVBB21ZKhOvAmxpG8APKTEc
         XJCoqM2RQuc7J0nxY6KCv4Zd+OgEpsgHqAse8y3xvopI5/qJepc4wtE8OIQeY+rpy9yg
         tZlw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc:content-transfer-encoding;
        bh=5E9Rbs7lnkN3+W98JnkIUAPTQ49TUyPznJB6J7YNkeo=;
        b=4aN42HLsRtpdlsfJh57soeEygvNbToPk/P+RnmeT5FuwNo0/qgWkePwYUxa2egGat4
         bHCda4/MCGn2M835T5dXn5xljEB2H66gZ9DVLs0s5KdJ79Jo10V/F62sjCQDezOMtSUo
         Dqp0AUp3IVPHCUvqQT76SFKfHSn9iLkri5OBvqSHv94KntSltbWXnrH2I+89ktU5DpE8
         l4PfZAyI2m08CiHy8Q34CTudkoWefjaD/D33F97x8G6UmITKuOaP8cvtGTlX3YUfIiiJ
         Kal7/Zt/Ljy38JqAIAbcORIg3qemKee3SJEvcelLNVT1mMC/KU1tdZACC4c4Txg3YIE6
         8ILQ==
X-Gm-Message-State: AOAM530Xxlt6XgetfV8JP8ZPN/ikSyoSdCC5ym/WG7FpnOQJPrNPfijB
        rqszYj1GMEYAiSXkNyPnSsPEQ/acfXvHWwI=
X-Google-Smtp-Source: 
 ABdhPJw5EwHTC7Q1N41+mc2GzL5YAYNhrKpjL8DBLEBrLVaVg9GO5bS/iutUfeBK1JN69nQVWaRhMs4VIs/z4os=
X-Received: from tj.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:53a])
 (user=tjmercier job=sendgmr) by 2002:a25:7795:0:b0:645:682a:d56e with SMTP id
 s143-20020a257795000000b00645682ad56emr3285177ybc.403.1650498763943; Wed, 20
 Apr 2022 16:52:43 -0700 (PDT)
Date: Wed, 20 Apr 2022 23:52:21 +0000
In-Reply-To: <20220420235228.2767816-1-tjmercier@google.com>
Message-Id: <20220420235228.2767816-4-tjmercier@google.com>
Mime-Version: 1.0
References: <20220420235228.2767816-1-tjmercier@google.com>
X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog
Subject: [RFC v5 3/6] dmabuf: heaps: export system_heap buffers with GPU
 cgroup charging
From: "T.J. Mercier" <tjmercier@google.com>
To: tjmercier@google.com, daniel@ffwll.ch, tj@kernel.org,
        Sumit Semwal <sumit.semwal@linaro.org>,
        "=?UTF-8?q?Christian=20K=C3=B6nig?=" <christian.koenig@amd.com>,
        Benjamin Gaignard <benjamin.gaignard@collabora.com>,
        Liam Mark <lmark@codeaurora.org>,
        Laura Abbott <labbott@redhat.com>,
        Brian Starkey <Brian.Starkey@arm.com>,
        John Stultz <john.stultz@linaro.org>
Cc: hridya@google.com, jstultz@google.com, tkjos@android.com,
        cmllamas@google.com, surenb@google.com, kaleshsingh@google.com,
        Kenny.Ho@amd.com, mkoutny@suse.com, skhan@linuxfoundation.org,
        kernel-team@android.com, linux-media@vger.kernel.org,
        dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
        linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

All DMA heaps now register a new GPU cgroup bucket upon creation, and the
system_heap now exports buffers associated with its GPU cgroup bucket for
tracking purposes.

In order to support GPU cgroup charge transfer on a dma-buf, the current
GPU cgroup information must be stored inside the dma-buf struct. For
tracked buffers, exporters include the struct gpucg and struct
gpucg_bucket pointers in the export info which can later be modified if
the charge is migrated to another cgroup.

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v5 changes
Merge dmabuf: Use the GPU cgroup charge/uncharge APIs into this patch.

Remove all GPU cgroup code from dma-buf except what's necessary to support
charge transfer. Previously charging was done in export, but for
non-Android graphics use-cases this is not ideal since there may be a
dealy between allocation and export, during which time there is no
accounting.

Append "-heap" to gpucg_bucket names.

Charge on allocation instead of export. This should more closely mirror
non-Android use-cases where there is potentially a delay between allocation
and export.

Put the charge and uncharge code in the same file (system_heap_allocate,
system_heap_dma_buf_release) instead of splitting them between the heap and
the dma_buf_release.

Move no-op code to header file to match other files in the series.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian K=C3=B6nig.
---
 drivers/dma-buf/dma-buf.c           | 19 +++++++++++++
 drivers/dma-buf/dma-heap.c          | 39 +++++++++++++++++++++++++++
 drivers/dma-buf/heaps/system_heap.c | 28 +++++++++++++++++---
 include/linux/dma-buf.h             | 41 +++++++++++++++++++++++------
 include/linux/dma-heap.h            | 15 +++++++++++
 5 files changed, 130 insertions(+), 12 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index df23239b04fc..bc89c44bd9b9 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -462,6 +462,24 @@ static struct file *dma_buf_getfile(struct dma_buf *dm=
abuf, int flags)
  * &dma_buf_ops.
  */
=20
+#ifdef CONFIG_CGROUP_GPU
+static void dma_buf_set_gpucg(struct dma_buf *dmabuf, const struct dma_buf=
_export_info *exp)
+{
+	dmabuf->gpucg =3D exp->gpucg;
+	dmabuf->gpucg_bucket =3D exp->gpucg_bucket;
+}
+
+void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
+				struct gpucg *gpucg,
+				struct gpucg_bucket *gpucg_bucket)
+{
+	exp_info->gpucg =3D gpucg;
+	exp_info->gpucg_bucket =3D gpucg_bucket;
+}
+#else
+static void dma_buf_set_gpucg(struct dma_buf *dmabuf, struct dma_buf_expor=
t_info *exp) {}
+#endif
+
 /**
  * dma_buf_export - Creates a new dma_buf, and associates an anon file
  * with this buffer, so it can be exported.
@@ -527,6 +545,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_exp=
ort_info *exp_info)
 	init_waitqueue_head(&dmabuf->poll);
 	dmabuf->cb_in.poll =3D dmabuf->cb_out.poll =3D &dmabuf->poll;
 	dmabuf->cb_in.active =3D dmabuf->cb_out.active =3D 0;
+	dma_buf_set_gpucg(dmabuf, exp_info);
=20
 	if (!resv) {
 		resv =3D (struct dma_resv *)&dmabuf[1];
diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index 8f5848aa144f..b81015548314 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -7,10 +7,12 @@
  */
=20
 #include <linux/cdev.h>
+#include <linux/cgroup_gpu.h>
 #include <linux/debugfs.h>
 #include <linux/device.h>
 #include <linux/dma-buf.h>
 #include <linux/err.h>
+#include <linux/kconfig.h>
 #include <linux/xarray.h>
 #include <linux/list.h>
 #include <linux/slab.h>
@@ -21,6 +23,7 @@
 #include <uapi/linux/dma-heap.h>
=20
 #define DEVNAME "dma_heap"
+#define HEAP_NAME_SUFFIX "-heap"
=20
 #define NUM_HEAP_MINORS 128
=20
@@ -31,6 +34,7 @@
  * @heap_devt		heap device node
  * @list		list head connecting to list of heaps
  * @heap_cdev		heap char device
+ * @gpucg_bucket	gpu cgroup bucket for memory accounting
  *
  * Represents a heap of memory from which buffers can be made.
  */
@@ -41,6 +45,9 @@ struct dma_heap {
 	dev_t heap_devt;
 	struct list_head list;
 	struct cdev heap_cdev;
+#ifdef CONFIG_CGROUP_GPU
+	struct gpucg_bucket gpucg_bucket;
+#endif
 };
=20
 static LIST_HEAD(heap_list);
@@ -216,6 +223,19 @@ const char *dma_heap_get_name(struct dma_heap *heap)
 	return heap->name;
 }
=20
+/**
+ * dma_heap_get_gpucg_bucket() - get struct gpucg_bucket for the heap.
+ * @heap: DMA-Heap to get the gpucg_bucket struct for.
+ *
+ * Returns:
+ * The gpucg_bucket struct for the heap. NULL if the GPU cgroup controller=
 is
+ * not enabled.
+ */
+struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap)
+{
+	return &heap->gpucg_bucket;
+}
+
 struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info)
 {
 	struct dma_heap *heap, *h, *err_ret;
@@ -228,6 +248,12 @@ struct dma_heap *dma_heap_add(const struct dma_heap_ex=
port_info *exp_info)
 		return ERR_PTR(-EINVAL);
 	}
=20
+	if (IS_ENABLED(CONFIG_CGROUP_GPU) && strlen(exp_info->name) + strlen(HEAP=
_NAME_SUFFIX) >=3D
+		GPUCG_BUCKET_NAME_MAX_LEN) {
+		pr_err("dma_heap: Name is too long for GPU cgroup\n");
+		return ERR_PTR(-ENAMETOOLONG);
+	}
+
 	if (!exp_info->ops || !exp_info->ops->allocate) {
 		pr_err("dma_heap: Cannot add heap with invalid ops struct\n");
 		return ERR_PTR(-EINVAL);
@@ -253,6 +279,19 @@ struct dma_heap *dma_heap_add(const struct dma_heap_ex=
port_info *exp_info)
 	heap->ops =3D exp_info->ops;
 	heap->priv =3D exp_info->priv;
=20
+	if (IS_ENABLED(CONFIG_CGROUP_GPU)) {
+		char gpucg_bucket_name[GPUCG_BUCKET_NAME_MAX_LEN];
+
+		snprintf(gpucg_bucket_name, sizeof(gpucg_bucket_name), "%s%s",
+			 exp_info->name, HEAP_NAME_SUFFIX);
+
+		ret =3D gpucg_register_bucket(dma_heap_get_gpucg_bucket(heap), gpucg_buc=
ket_name);
+		if (ret < 0) {
+			err_ret =3D ERR_PTR(ret);
+			goto err0;
+		}
+	}
+
 	/* Find unused minor number */
 	ret =3D xa_alloc(&dma_heap_minors, &minor, heap,
 		       XA_LIMIT(0, NUM_HEAP_MINORS - 1), GFP_KERNEL);
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/sy=
stem_heap.c
index fcf836ba9c1f..27f686faef00 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -297,6 +297,11 @@ static void system_heap_dma_buf_release(struct dma_buf=
 *dmabuf)
 	}
 	sg_free_table(table);
 	kfree(buffer);
+
+	if (dmabuf->gpucg && dmabuf->gpucg_bucket) {
+		gpucg_uncharge(dmabuf->gpucg, dmabuf->gpucg_bucket, dmabuf->size);
+		gpucg_put(dmabuf->gpucg);
+	}
 }
=20
 static const struct dma_buf_ops system_heap_buf_ops =3D {
@@ -346,11 +351,21 @@ static struct dma_buf *system_heap_allocate(struct dm=
a_heap *heap,
 	struct scatterlist *sg;
 	struct list_head pages;
 	struct page *page, *tmp_page;
-	int i, ret =3D -ENOMEM;
+	struct gpucg *gpucg;
+	struct gpucg_bucket *gpucg_bucket;
+	int i, ret;
+
+	gpucg =3D gpucg_get(current);
+	gpucg_bucket =3D dma_heap_get_gpucg_bucket(heap);
+	ret =3D gpucg_charge(gpucg, gpucg_bucket, len);
+	if (ret)
+		goto put_gpucg;
=20
 	buffer =3D kzalloc(sizeof(*buffer), GFP_KERNEL);
-	if (!buffer)
-		return ERR_PTR(-ENOMEM);
+	if (!buffer) {
+		ret =3D -ENOMEM;
+		goto uncharge_gpucg;
+	}
=20
 	INIT_LIST_HEAD(&buffer->attachments);
 	mutex_init(&buffer->lock);
@@ -396,6 +411,8 @@ static struct dma_buf *system_heap_allocate(struct dma_=
heap *heap,
 	exp_info.size =3D buffer->len;
 	exp_info.flags =3D fd_flags;
 	exp_info.priv =3D buffer;
+	dma_buf_exp_info_set_gpucg(&exp_info, gpucg, gpucg_bucket);
+
 	dmabuf =3D dma_buf_export(&exp_info);
 	if (IS_ERR(dmabuf)) {
 		ret =3D PTR_ERR(dmabuf);
@@ -414,7 +431,10 @@ static struct dma_buf *system_heap_allocate(struct dma=
_heap *heap,
 	list_for_each_entry_safe(page, tmp_page, &pages, lru)
 		__free_pages(page, compound_order(page));
 	kfree(buffer);
-
+uncharge_gpucg:
+	gpucg_uncharge(gpucg, gpucg_bucket, len);
+put_gpucg:
+	gpucg_put(gpucg);
 	return ERR_PTR(ret);
 }
=20
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 2097760e8e95..8e7c55c830b3 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -13,6 +13,7 @@
 #ifndef __DMA_BUF_H__
 #define __DMA_BUF_H__
=20
+#include <linux/cgroup_gpu.h>
 #include <linux/iosys-map.h>
 #include <linux/file.h>
 #include <linux/err.h>
@@ -303,7 +304,7 @@ struct dma_buf {
 	/**
 	 * @size:
 	 *
-	 * Size of the buffer; invariant over the lifetime of the buffer.
+	 * Size of the buffer in bytes; invariant over the lifetime of the buffer.
 	 */
 	size_t size;
=20
@@ -453,6 +454,14 @@ struct dma_buf {
 		struct dma_buf *dmabuf;
 	} *sysfs_entry;
 #endif
+
+#ifdef CONFIG_CGROUP_GPU
+	/** @gpucg: Pointer to the GPU cgroup this buffer currently belongs to. */
+	struct gpucg *gpucg;
+
+	/* @gpucg_bucket: Pointer to the GPU cgroup bucket whence this buffer ori=
ginates. */
+	struct gpucg_bucket *gpucg_bucket;
+#endif
 };
=20
 /**
@@ -526,13 +535,15 @@ struct dma_buf_attachment {
=20
 /**
  * struct dma_buf_export_info - holds information needed to export a dma_b=
uf
- * @exp_name:	name of the exporter - useful for debugging.
- * @owner:	pointer to exporter module - used for refcounting kernel module
- * @ops:	Attach allocator-defined dma buf ops to the new buffer
- * @size:	Size of the buffer - invariant over the lifetime of the buffer
- * @flags:	mode flags for the file
- * @resv:	reservation-object, NULL to allocate default one
- * @priv:	Attach private data of allocator to this buffer
+ * @exp_name:		name of the exporter - useful for debugging.
+ * @owner:		pointer to exporter module - used for refcounting kernel module
+ * @ops:		Attach allocator-defined dma buf ops to the new buffer
+ * @size:		Size of the buffer in bytes - invariant over the lifetime of th=
e buffer
+ * @flags:		mode flags for the file
+ * @resv:		reservation-object, NULL to allocate default one
+ * @priv:		Attach private data of allocator to this buffer
+ * @gpucg:		Pointer to GPU cgroup this buffer is charged to, or NULL if no=
t charged
+ * @gpucg_bucket:	Pointer to GPU cgroup bucket this buffer comes from, or =
NULL if not charged
  *
  * This structure holds the information required to export the buffer. Used
  * with dma_buf_export() only.
@@ -545,6 +556,10 @@ struct dma_buf_export_info {
 	int flags;
 	struct dma_resv *resv;
 	void *priv;
+#ifdef CONFIG_CGROUP_GPU
+	struct gpucg *gpucg;
+	struct gpucg_bucket *gpucg_bucket;
+#endif
 };
=20
 /**
@@ -630,4 +645,14 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_stru=
ct *,
 		 unsigned long);
 int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map);
 void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map);
+
+#ifdef CONFIG_CGROUP_GPU
+void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
+				struct gpucg *gpucg,
+				struct gpucg_bucket *gpucg_bucket);
+#else/* CONFIG_CGROUP_GPU */
+static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *=
exp_info,
+					      struct gpucg *gpucg,
+					      struct gpucg_bucket *gpucg_bucket) {}
+#endif /* CONFIG_CGROUP_GPU */
 #endif /* __DMA_BUF_H__ */
diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
index 0c05561cad6e..6321e7636538 100644
--- a/include/linux/dma-heap.h
+++ b/include/linux/dma-heap.h
@@ -10,6 +10,7 @@
 #define _DMA_HEAPS_H
=20
 #include <linux/cdev.h>
+#include <linux/cgroup_gpu.h>
 #include <linux/types.h>
=20
 struct dma_heap;
@@ -59,6 +60,20 @@ void *dma_heap_get_drvdata(struct dma_heap *heap);
  */
 const char *dma_heap_get_name(struct dma_heap *heap);
=20
+#ifdef CONFIG_CGROUP_GPU
+/**
+ * dma_heap_get_gpucg_bucket() - get a pointer to the struct gpucg_bucket =
for the heap.
+ * @heap: DMA-Heap to retrieve gpucg_bucket for
+ *
+ * Returns:
+ * The gpucg_bucket struct for the heap.
+ */
+struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap);
+#else /* CONFIG_CGROUP_GPU */
+static inline struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_he=
ap *heap)
+{ return NULL; }
+#endif /* CONFIG_CGROUP_GPU */
+
 /**
  * dma_heap_add - adds a heap to dmabuf heaps
  * @exp_info:		information needed to register this heap
--=20
2.36.0.rc0.470.gd361397f0d-goog
From nobody Mon May 11 00:10:12 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B18E8C433EF
	for <linux-kernel@archiver.kernel.org>; Wed, 20 Apr 2022 23:53:01 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1383338AbiDTXzq (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 20 Apr 2022 19:55:46 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33796 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1383327AbiDTXzj (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Apr 2022 19:55:39 -0400
Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com
 [IPv6:2607:f8b0:4864:20::b4a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C68C43E0E2
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:48 -0700 (PDT)
Received: by mail-yb1-xb4a.google.com with SMTP id
 i5-20020a258b05000000b006347131d40bso2865813ybl.17
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc:content-transfer-encoding;
        bh=MFjiN2c4422vfH3G5XH/W0LypfixtCAuep+NsTNNNpQ=;
        b=O9DK+nzovZ+3UKgC7FIOwT0aMuF2SvsCaF3ZXeSoWy1EhpK0ePx8O/H08V2UqVfgNE
         UN5jw6LkKrD8qnmQ8yNpabAi8LYhV7IVlIhIyxjzWt3DwY0xQYzoGMBwBTocnz0s4rQ5
         ZOKGBkrXwliivGK3Dk1OOuz1iBAnoM5/GpdgoZDvda2LQ+79y+gHEcN4hEcic5si8/tq
         BgIQcchgyZXlLHtJLZrCzUf6RpuLOSrX5iJi0+Wk0Z3RcDlTPXgAj1jg8XJPOdnclJ4g
         1cwxGcpXpjpo6GYI8wlIE6qmSSkrnjGd9jEmnSx6WLGCOCS3QWAGjtKo+pr14uahGroF
         uPhQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc:content-transfer-encoding;
        bh=MFjiN2c4422vfH3G5XH/W0LypfixtCAuep+NsTNNNpQ=;
        b=WXPa+PiCpR0C3jizRvRKCQExlkmHHtAYjOqp0g+3SoCeIqXmNEKXlHnUyYRx9kCIGO
         p3qHkglG7BGNdaYWRZpF3aheTZtJxatrcaMMidurogC84MZy9iF0dtfUdE6DbLS5kUUO
         UyHgYLz/WWv11vEgbVSFbnNVGUasZFX+Z1BAT8Q41oLB93ZIcJAv/F+tvLrzT/xMh97E
         9wWUU74+Sx6/tYa/sqRK5C7ZMC/Z5egPI0za1W4O96PlR0KZOu5c5EFaZj7ZItzhQjrR
         iJ3r9Ns5tSxt+KnqVHG5UaM885zUZ9KnwsJrHGVUgMXjfXgiJ6NmIcpyJNs+9nJGHoIC
         5hAw==
X-Gm-Message-State: AOAM531egS3/hV745kRhoLh1wsXaszFirZNcXsORvoOkh9y+EiaTaFrz
        OPicxvUxw/LKId1gJGUkebgsxB5yf1DyaFY=
X-Google-Smtp-Source: 
 ABdhPJwgj/q1pkBqUx+EtmezTxZVaTnub+E1mqztCIeOzLZpLnxcjiZoZHoIs2LVhmVe02tDSbzgfaHA7Of00Js=
X-Received: from tj.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:53a])
 (user=tjmercier job=sendgmr) by 2002:a0d:f0c3:0:b0:2f4:d291:9dde with SMTP id
 z186-20020a0df0c3000000b002f4d2919ddemr1882789ywe.437.1650498767961; Wed, 20
 Apr 2022 16:52:47 -0700 (PDT)
Date: Wed, 20 Apr 2022 23:52:22 +0000
In-Reply-To: <20220420235228.2767816-1-tjmercier@google.com>
Message-Id: <20220420235228.2767816-5-tjmercier@google.com>
Mime-Version: 1.0
References: <20220420235228.2767816-1-tjmercier@google.com>
X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog
Subject: [RFC v5 4/6] dmabuf: Add gpu cgroup charge transfer function
From: "T.J. Mercier" <tjmercier@google.com>
To: tjmercier@google.com, daniel@ffwll.ch, tj@kernel.org,
        Sumit Semwal <sumit.semwal@linaro.org>,
        "=?UTF-8?q?Christian=20K=C3=B6nig?=" <christian.koenig@amd.com>,
        Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>
Cc: hridya@google.com, jstultz@google.com, tkjos@android.com,
        cmllamas@google.com, surenb@google.com, kaleshsingh@google.com,
        Kenny.Ho@amd.com, mkoutny@suse.com, skhan@linuxfoundation.org,
        kernel-team@android.com, linux-media@vger.kernel.org,
        dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
        linux-kernel@vger.kernel.org, cgroups@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

The dma_buf_transfer_charge function provides a way for processes to
transfer charge of a buffer to a different process. This is essential
for the cases where a central allocator process does allocations for
various subsystems, hands over the fd to the client who requested the
memory and drops all references to the allocated memory.

Originally-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v5 changes
Fix commit message which still contained the old name for
dma_buf_transfer_charge per Michal Koutn=C3=BD.

Modify the dma_buf_transfer_charge API to accept a task_struct instead
of a gpucg. This avoids requiring the caller to manage the refcount
of the gpucg upon failure and confusing ownership transfer logic.

v4 changes
Adjust ordering of charge/uncharge during transfer to avoid potentially
hitting cgroup limit per Michal Koutn=C3=BD.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian K=C3=B6nig.
---
 drivers/dma-buf/dma-buf.c  | 57 +++++++++++++++++++++++++++++++++++
 include/linux/cgroup_gpu.h | 14 +++++++++
 include/linux/dma-buf.h    |  6 ++++
 kernel/cgroup/gpu.c        | 62 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 139 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index bc89c44bd9b9..f3fb844925e2 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -1341,6 +1341,63 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct i=
osys_map *map)
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap, DMA_BUF);
=20
+/**
+ * dma_buf_transfer_charge - Change the GPU cgroup to which the provided d=
ma_buf is charged.
+ * @dmabuf:	[in]	buffer whose charge will be migrated to a different GPU c=
group
+ * @target:	[in]	the task_struct of the destination process for the GPU cg=
roup charge
+ *
+ * Only tasks that belong to the same cgroup the buffer is currently charg=
ed to
+ * may call this function, otherwise it will return -EPERM.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *ta=
rget)
+{
+	struct gpucg *current_gpucg, *target_gpucg, *to_release;
+	int ret;
+
+	if (!dmabuf->gpucg || !dmabuf->gpucg_bucket) {
+		/* This dmabuf is not tracked under GPU cgroup accounting */
+		return 0;
+	}
+
+	current_gpucg =3D gpucg_get(current);
+	target_gpucg =3D gpucg_get(target);
+	to_release =3D target_gpucg;
+
+	/* If the source and destination cgroups are the same, don't do anything.=
 */
+	if (current_gpucg =3D=3D target_gpucg) {
+		ret =3D 0;
+		goto skip_transfer;
+	}
+
+	/*
+	 * Verify that the cgroup of the process requesting the transfer
+	 * is the same as the one the buffer is currently charged to.
+	 */
+	mutex_lock(&dmabuf->lock);
+	if (current_gpucg !=3D dmabuf->gpucg) {
+		ret =3D -EPERM;
+		goto err;
+	}
+
+	ret =3D gpucg_transfer_charge(
+		dmabuf->gpucg, target_gpucg, dmabuf->gpucg_bucket, dmabuf->size);
+	if (ret)
+		goto err;
+
+	to_release =3D dmabuf->gpucg;
+	dmabuf->gpucg =3D target_gpucg;
+
+err:
+	mutex_unlock(&dmabuf->lock);
+skip_transfer:
+	gpucg_put(current_gpucg);
+	gpucg_put(to_release);
+	return ret;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_transfer_charge, DMA_BUF);
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
index 4dfe633d6ec7..f5973ef9f926 100644
--- a/include/linux/cgroup_gpu.h
+++ b/include/linux/cgroup_gpu.h
@@ -83,7 +83,13 @@ static inline struct gpucg *gpucg_parent(struct gpucg *c=
g)
 }
=20
 int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 siz=
e);
+
 void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 =
size);
+
+int gpucg_transfer_charge(struct gpucg *source,
+			  struct gpucg *dest,
+			  struct gpucg_bucket *bucket,
+			  u64 size);
 int gpucg_register_bucket(struct gpucg_bucket *bucket, const char *name);
 #else /* CONFIG_CGROUP_GPU */
=20
@@ -118,6 +124,14 @@ static inline void gpucg_uncharge(struct gpucg *gpucg,
 				  struct gpucg_bucket *bucket,
 				  u64 size) {}
=20
+static inline int gpucg_transfer_charge(struct gpucg *source,
+					struct gpucg *dest,
+					struct gpucg_bucket *bucket,
+					u64 size)
+{
+	return 0;
+}
+
 static inline int gpucg_register_bucket(struct gpucg_bucket *bucket, const=
 char *name) {}
 #endif /* CONFIG_CGROUP_GPU */
 #endif /* _CGROUP_GPU_H */
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 8e7c55c830b3..438ad8577b76 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -18,6 +18,7 @@
 #include <linux/file.h>
 #include <linux/err.h>
 #include <linux/scatterlist.h>
+#include <linux/sched.h>
 #include <linux/list.h>
 #include <linux/dma-mapping.h>
 #include <linux/fs.h>
@@ -650,9 +651,14 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct ios=
ys_map *map);
 void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
 				struct gpucg *gpucg,
 				struct gpucg_bucket *gpucg_bucket);
+
+int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *ta=
rget);
 #else/* CONFIG_CGROUP_GPU */
 static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *=
exp_info,
 					      struct gpucg *gpucg,
 					      struct gpucg_bucket *gpucg_bucket) {}
+
+static inline int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct t=
ask_struct *target)
+{ return 0; }
 #endif /* CONFIG_CGROUP_GPU */
 #endif /* __DMA_BUF_H__ */
diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
index 34d0a5b85834..7dfbe0fd7e45 100644
--- a/kernel/cgroup/gpu.c
+++ b/kernel/cgroup/gpu.c
@@ -252,6 +252,68 @@ void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_=
bucket *bucket, u64 size)
 	css_put(&gpucg->css);
 }
=20
+/**
+ * gpucg_transfer_charge - Transfer a GPU charge from one cgroup to anothe=
r.
+ *
+ * @source:	[in]	The GPU cgroup the charge will be transferred from.
+ * @dest:	[in]	The GPU cgroup the charge will be transferred to.
+ * @bucket:	[in]	The GPU cgroup bucket corresponding to the charge.
+ * @size:	[in]	The size of the memory in bytes.
+ *                      This size will be rounded up to the nearest page s=
ize.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int gpucg_transfer_charge(struct gpucg *source,
+			  struct gpucg *dest,
+			  struct gpucg_bucket *bucket,
+			  u64 size)
+{
+	struct page_counter *counter;
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp_source, *rp_dest;
+	int ret =3D 0;
+
+	nr_pages =3D PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+	mutex_lock(&gpucg_mutex);
+	rp_source =3D cg_rpool_find_locked(source, bucket);
+	if (unlikely(!rp_source)) {
+		ret =3D -ENOENT;
+		goto exit_early;
+	}
+
+	rp_dest =3D cg_rpool_get_locked(dest, bucket);
+	if (IS_ERR(rp_dest)) {
+		ret =3D PTR_ERR(rp_dest);
+		goto exit_early;
+	}
+
+	/*
+	 * First uncharge from the pool it's currently charged to. This ordering =
avoids double
+	 * charging while the transfer is in progress, which could cause us to hi=
t a limit.
+	 * If the try_charge fails for this transfer, we need to be able to rever=
se this uncharge,
+	 * so we continue to hold the gpucg_mutex here.
+	 */
+	page_counter_uncharge(&rp_source->total, nr_pages);
+	css_put(&source->css);
+
+	/* Now attempt the new charge */
+	if (page_counter_try_charge(&rp_dest->total, nr_pages, &counter)) {
+		css_get(&dest->css);
+	} else {
+		/*
+		 * The new charge failed, so reverse the uncharge from above. This shoul=
d always
+		 * succeed since charges on source are blocked by gpucg_mutex.
+		 */
+		WARN_ON(!page_counter_try_charge(&rp_source->total, nr_pages, &counter));
+		css_get(&source->css);
+		ret =3D -ENOMEM;
+	}
+exit_early:
+	mutex_unlock(&gpucg_mutex);
+	return ret;
+}
+
 /**
  * gpucg_register_bucket - Registers a bucket for memory accounting using =
the
  * GPU cgroup controller.
--=20
2.36.0.rc0.470.gd361397f0d-goog
From nobody Mon May 11 00:10:12 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6D77CC433EF
	for <linux-kernel@archiver.kernel.org>; Wed, 20 Apr 2022 23:53:09 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1383341AbiDTXzw (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 20 Apr 2022 19:55:52 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33900 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1383345AbiDTXzl (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Apr 2022 19:55:41 -0400
Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com
 [IPv6:2607:f8b0:4864:20::1149])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 74EEC3E0CA
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:52 -0700 (PDT)
Received: by mail-yw1-x1149.google.com with SMTP id
 00721157ae682-2ec0490dc1bso29488977b3.5
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc:content-transfer-encoding;
        bh=/gvN+QpYV7OLh/aEKqlMjgKuKem4NnQTY6rvHp91u8M=;
        b=sMA3qf7elQFHyscq19H1EpWDhaRa/I7ci48fDvM7QclHAcQqhkQtppdGXfJls2ija/
         MvtUn9QqCmeX2jXoh9aXOsqLWBhS95CDoqK3LEcBjVCrsZn3rhJ6gF2q3wsCJy6141rn
         HXihJHcxQFcvvrhcZnXaBAOeuPSzPOu5OP6IP7nryfGBwy/4jH9/4rQvdFweYWS4ZJ7o
         iFBWJJboWcdIl0qxM5PnlYsSTP+g7LGwTte4o0TcoqvscLQAxZbvTu5sODCDXJ8Q1tgA
         rw3i8pgp7ryIMuZezHAtqRxaB/qJTOcE2DcxLzUs4No3oDDwuF4FO1tWii3lpPVJrOBb
         r2rg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc:content-transfer-encoding;
        bh=/gvN+QpYV7OLh/aEKqlMjgKuKem4NnQTY6rvHp91u8M=;
        b=VCbDRte46m4pzY5zT5mftotum/E01Fz3ScMKVyxQupjKgKC9QeUvSmz/+OCqf3cx3r
         UrOpwMsInX8n/Q6aRFAluJWHW7mqiQTxPw0bccVklwU8Gv3V7VIpdRn2303SWDnzE6qW
         t1wUXm1WVbOm+pr4VlyoJLykU4JlvolRF14DkcUR9y6Kkh6uInAwv0obL81rZBLH8ety
         jSkHA3QaF9LVciaVQ52cwvn0CBVPnF040NZBYNAhcvmZCTL4ogmWzTOV+kUSi3SJuvOl
         fX7DrYQaJnRtaELLMbPOPm3sEW1sYeQehjzxQ2ykt9g6C57r6xKyfyTPVn4X7bVpMWzX
         jzMA==
X-Gm-Message-State: AOAM5314xoBTrjqrj/8OdW7xeo26Ya/Na5k9K3Oqc6J5AJ37hjF6+prj
        7PB1SyBUp/3dqGVF2x6xbRMnLRw9hsLFwvs=
X-Google-Smtp-Source: 
 ABdhPJxTnlpo/iNagwkjPHbucxYH+CtEj8ohxJdE5GbeBLJ3G2GWRSAkV6DdlP8lN0r7l8RCXlv3YW3DGl+NeWs=
X-Received: from tj.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:53a])
 (user=tjmercier job=sendgmr) by 2002:a25:2d4d:0:b0:641:d14e:ff85 with SMTP id
 s13-20020a252d4d000000b00641d14eff85mr21685745ybe.128.1650498771646; Wed, 20
 Apr 2022 16:52:51 -0700 (PDT)
Date: Wed, 20 Apr 2022 23:52:23 +0000
In-Reply-To: <20220420235228.2767816-1-tjmercier@google.com>
Message-Id: <20220420235228.2767816-6-tjmercier@google.com>
Mime-Version: 1.0
References: <20220420235228.2767816-1-tjmercier@google.com>
X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog
Subject: [RFC v5 5/6] binder: Add flags to relinquish ownership of fds
From: "T.J. Mercier" <tjmercier@google.com>
To: tjmercier@google.com, daniel@ffwll.ch, tj@kernel.org,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        "=?UTF-8?q?Arve=20Hj=C3=B8nnev=C3=A5g?=" <arve@android.com>,
        Todd Kjos <tkjos@android.com>,
        Martijn Coenen <maco@android.com>,
        Joel Fernandes <joel@joelfernandes.org>,
        Christian Brauner <brauner@kernel.org>,
        Hridya Valsaraju <hridya@google.com>,
        Suren Baghdasaryan <surenb@google.com>,
        Sumit Semwal <sumit.semwal@linaro.org>,
        "=?UTF-8?q?Christian=20K=C3=B6nig?=" <christian.koenig@amd.com>
Cc: jstultz@google.com, cmllamas@google.com, kaleshsingh@google.com,
        Kenny.Ho@amd.com, mkoutny@suse.com, skhan@linuxfoundation.org,
        kernel-team@android.com, linux-kernel@vger.kernel.org,
        linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org,
        linaro-mm-sig@lists.linaro.org
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

From: Hridya Valsaraju <hridya@google.com>

This patch introduces flags BINDER_FD_FLAG_SENDER_NO_NEED, and
BINDER_FDA_FLAG_SENDER_NO_NEED that a process sending an individual
fd or fd array to another process over binder IPC can set to relinquish
ownership of the fds being sent for memory accounting purposes. If the
flag is found to be set during the fd or fd array translation and the
fd is for a DMA-BUF, the buffer is uncharged from the sender's cgroup
and charged to the receiving process's cgroup instead.

It is up to the sending process to ensure that it closes the fds
regardless of whether the transfer failed or succeeded.

Most graphics shared memory allocations in Android are done by the
graphics allocator HAL process. On requests from clients, the HAL process
allocates memory and sends the fds to the clients over binder IPC.
The graphics allocator HAL will not retain any references to the
buffers. When the HAL sets *_FLAG_SENDER_NO_NEED for fd arrays holding
DMA-BUF fds, or individual fd objects, the gpu cgroup controller will
be able to correctly charge the buffers to the client processes instead
of the graphics allocator HAL.

Since this is a new feature exposed to userspace, the kernel and userspace
must be compatible for the accounting to work for transfers. In all cases
the allocation and transport of DMA buffers via binder will succeed, but
only when both the kernel supports, and userspace depends on this feature
will the transfer accounting work. The possible scenarios are detailed
below:

1. new kernel + old userspace
The kernel supports the feature but userspace does not use it. The old
userspace won't mount the new cgroup controller, accounting is not
performed, charge is not transferred.

2. old kernel + new userspace
The new cgroup controller is not supported by the kernel, accounting is
not performed, charge is not transferred.

3. old kernel + old userspace
Same as #2

4. new kernel + new userspace
Cgroup is mounted, feature is supported and used.

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

Reviewed-by: Carlos Llamas <cmllamas@google.com>
---
v5 changes
Support both binder_fd_array_object and binder_fd_object. This is
necessary because new versions of Android will use binder_fd_object
instead of binder_fd_array_object, and we need to support both.

Use the new, simpler dma_buf_transfer_charge API.

v3 changes
Remove android from title per Todd Kjos.

Use more common dual author commit message format per John Stultz.

Include details on behavior for all combinations of kernel/userspace
versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian K=C3=B6nig.
---
 drivers/android/binder.c            | 27 +++++++++++++++++++++++----
 drivers/dma-buf/dma-buf.c           |  4 ++--
 include/linux/dma-buf.h             |  2 +-
 include/uapi/linux/android/binder.h | 23 +++++++++++++++++++----
 4 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 8351c5638880..b07d50fe1c80 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -42,6 +42,7 @@
=20
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
=20
+#include <linux/dma-buf.h>
 #include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/freezer.h>
@@ -2170,7 +2171,7 @@ static int binder_translate_handle(struct flat_binder=
_object *fp,
 	return ret;
 }
=20
-static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
+static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flag=
s,
 			       struct binder_transaction *t,
 			       struct binder_thread *thread,
 			       struct binder_transaction *in_reply_to)
@@ -2208,6 +2209,23 @@ static int binder_translate_fd(u32 fd, binder_size_t=
 fd_offset,
 		goto err_security;
 	}
=20
+	if (IS_ENABLED(CONFIG_CGROUP_GPU) && (flags & BINDER_FD_FLAG_SENDER_NO_NE=
ED)) {
+		if (is_dma_buf_file(file)) {
+			struct dma_buf *dmabuf =3D file->private_data;
+
+			ret =3D dma_buf_transfer_charge(dmabuf, target_proc->tsk);
+			if (ret)
+				pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d\n",
+					proc->pid, thread->pid, target_proc->pid);
+		} else {
+			binder_user_error(
+				"%d:%d got transaction with SENDER_NO_NEED for non-dmabuf fd, %d\n",
+				proc->pid, thread->pid, fd);
+			ret =3D -EINVAL;
+			goto err_noneed;
+		}
+	}
+
 	/*
 	 * Add fixup record for this transaction. The allocation
 	 * of the fd in the target needs to be done from a
@@ -2226,6 +2244,7 @@ static int binder_translate_fd(u32 fd, binder_size_t =
fd_offset,
 	return ret;
=20
 err_alloc:
+err_noneed:
 err_security:
 	fput(file);
 err_fget:
@@ -2528,7 +2547,7 @@ static int binder_translate_fd_array(struct list_head=
 *pf_head,
=20
 		ret =3D copy_from_user(&fd, sender_ufda_base + sender_uoffset, sizeof(fd=
));
 		if (!ret)
-			ret =3D binder_translate_fd(fd, offset, t, thread,
+			ret =3D binder_translate_fd(fd, offset, fda->flags, t, thread,
 						  in_reply_to);
 		if (ret)
 			return ret > 0 ? -EINVAL : ret;
@@ -3179,8 +3198,8 @@ static void binder_transaction(struct binder_proc *pr=
oc,
 			struct binder_fd_object *fp =3D to_binder_fd_object(hdr);
 			binder_size_t fd_offset =3D object_offset +
 				(uintptr_t)&fp->fd - (uintptr_t)fp;
-			int ret =3D binder_translate_fd(fp->fd, fd_offset, t,
-						      thread, in_reply_to);
+			int ret =3D binder_translate_fd(fp->fd, fd_offset, fp->flags,
+						      t, thread, in_reply_to);
=20
 			fp->pad_binder =3D 0;
 			if (ret < 0 ||
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f3fb844925e2..36ed6cd4ddcc 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -31,7 +31,6 @@
=20
 #include "dma-buf-sysfs-stats.h"
=20
-static inline int is_dma_buf_file(struct file *);
=20
 struct dma_buf_list {
 	struct list_head head;
@@ -400,10 +399,11 @@ static const struct file_operations dma_buf_fops =3D {
 /*
  * is_dma_buf_file - Check if struct file* is associated with dma_buf
  */
-static inline int is_dma_buf_file(struct file *file)
+int is_dma_buf_file(struct file *file)
 {
 	return file->f_op =3D=3D &dma_buf_fops;
 }
+EXPORT_SYMBOL_NS_GPL(is_dma_buf_file, DMA_BUF);
=20
 static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags)
 {
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 438ad8577b76..2b9812758fee 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -614,7 +614,7 @@ dma_buf_attachment_is_dynamic(struct dma_buf_attachment=
 *attach)
 {
 	return !!attach->importer_ops;
 }
-
+int is_dma_buf_file(struct file *file);
 struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
 					  struct device *dev);
 struct dma_buf_attachment *
diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/andro=
id/binder.h
index 11157fae8a8e..b263cbb603ea 100644
--- a/include/uapi/linux/android/binder.h
+++ b/include/uapi/linux/android/binder.h
@@ -91,14 +91,14 @@ struct flat_binder_object {
 /**
  * struct binder_fd_object - describes a filedescriptor to be fixed up.
  * @hdr:	common header structure
- * @pad_flags:	padding to remain compatible with old userspace code
+ * @flags:	One or more BINDER_FD_FLAG_* flags
  * @pad_binder:	padding to remain compatible with old userspace code
  * @fd:		file descriptor
  * @cookie:	opaque data, used by user-space
  */
 struct binder_fd_object {
 	struct binder_object_header	hdr;
-	__u32				pad_flags;
+	__u32				flags;
 	union {
 		binder_uintptr_t	pad_binder;
 		__u32			fd;
@@ -107,6 +107,17 @@ struct binder_fd_object {
 	binder_uintptr_t		cookie;
 };
=20
+enum {
+	/**
+	 * @BINDER_FD_FLAG_SENDER_NO_NEED
+	 *
+	 * When set, the sender of a binder_fd_object wishes to relinquish owners=
hip of the fd for
+	 * memory accounting purposes. If the fd is for a DMA-BUF, the buffer is =
uncharged from the
+	 * sender's cgroup and charged to the receiving process's cgroup instead.
+	 */
+	BINDER_FD_FLAG_SENDER_NO_NEED =3D 0x2000,
+};
+
 /* struct binder_buffer_object - object describing a userspace buffer
  * @hdr:		common header structure
  * @flags:		one or more BINDER_BUFFER_* flags
@@ -141,7 +152,7 @@ enum {
=20
 /* struct binder_fd_array_object - object describing an array of fds in a =
buffer
  * @hdr:		common header structure
- * @pad:		padding to ensure correct alignment
+ * flags:		One or more BINDER_FDA_FLAG_* flags
  * @num_fds:		number of file descriptors in the buffer
  * @parent:		index in offset array to buffer holding the fd array
  * @parent_offset:	start offset of fd array in the buffer
@@ -162,12 +173,16 @@ enum {
  */
 struct binder_fd_array_object {
 	struct binder_object_header	hdr;
-	__u32				pad;
+	__u32				flags;
 	binder_size_t			num_fds;
 	binder_size_t			parent;
 	binder_size_t			parent_offset;
 };
=20
+enum {
+	BINDER_FDA_FLAG_SENDER_NO_NEED =3D BINDER_FD_FLAG_SENDER_NO_NEED,
+};
+
 /*
  * On 64-bit platforms where user code may run in 32-bits the driver must
  * translate the buffer (and local binder) addresses appropriately.
--=20
2.36.0.rc0.470.gd361397f0d-goog
From nobody Mon May 11 00:10:12 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C2BC8C433EF
	for <linux-kernel@archiver.kernel.org>; Wed, 20 Apr 2022 23:53:17 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1383313AbiDTX4C (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 20 Apr 2022 19:56:02 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33796 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1383344AbiDTXzs (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Apr 2022 19:55:48 -0400
Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com
 [IPv6:2607:f8b0:4864:20::b4a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 03CBF3DDC4
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:57 -0700 (PDT)
Received: by mail-yb1-xb4a.google.com with SMTP id
 b11-20020a5b008b000000b00624ea481d55so2884001ybp.19
        for <linux-kernel@vger.kernel.org>;
 Wed, 20 Apr 2022 16:52:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=SHckm8gWB+C8sWZfu1d4XvMUSU7hfDqBV0cKv+qwB2g=;
        b=f2tx1CKEjfz0S8U1u7eZDDYBTIk7xcqnnE2IBqreG9F/M/7GyOv+pz9dLCWnQ3LLBD
         YD+++bdRmpT6+9F7dMbce8LnYDLCLLuCaY1FBjMeCfF2JB/tZR6ffL8XxAcHAF/e17G/
         4k17w2dP7I1kPoyKJi0X0SeQneOWcaUV8MCt1mrZsR+fetsCJsJi741hBeV92eKE61Gb
         uEs7/48Tor18dHBSZNqTlKM5CpXdsWUgeeDbl+gH7W+LfSwE/17HtoaBpKtQj3RU5kNH
         WPDlGErrXr6d65nGIAJsa5C9PEWIafC5K8Ox3HPrya3/L+riS3+HanZrFCj4nlGbeDxK
         qfyg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=SHckm8gWB+C8sWZfu1d4XvMUSU7hfDqBV0cKv+qwB2g=;
        b=mkLMohb0TV5YEBPnobHXOMsEyMkIwGXHQkj+PrdCAe5mbj6R7CopbkyASgDRL3z/rc
         j2cPxpdnuNbk/LTmXWeDUPkrYWjZZXCG08Ft/efGakcP+Swrp+sRsBJNOIUxCmXcOCV+
         NoH1zplbqBEBmBe8xJCSUHl8oghrgweHjC/wpFC5zIpJVlaJyB4x3BTUoS3MNSwljjFL
         UPa8KONBw0YGVRsb7gDmLcJZI2/Z/dAEEGVFH+C8WXr3urvoLfJcqLdFFycolI995pmK
         bfnt9Rd0jvBFLd20vm95S6661U187vYovXOmJnyUErmRRI1+THE+r9l6n4EQYHA0RYKT
         ADqA==
X-Gm-Message-State: AOAM530AJoBx9+HfOIsXdclVFd+11ES5B4Xzibmjye2np1XEQzticrha
        L4tqiaQG6Okg+SmEld6R46GzC/ORfg4Yu5Y=
X-Google-Smtp-Source: 
 ABdhPJzfi5kO1YkVumHpXwb5efjxrBvEA0VisjaAXgBSSParlVdw1d/Sv0fRyoT8k9WTxLqVc+nIvVQU7OSlREo=
X-Received: from tj.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:53a])
 (user=tjmercier job=sendgmr) by 2002:a81:57c6:0:b0:2f4:d5b6:dc94 with SMTP id
 l189-20020a8157c6000000b002f4d5b6dc94mr874214ywb.90.1650498776211; Wed, 20
 Apr 2022 16:52:56 -0700 (PDT)
Date: Wed, 20 Apr 2022 23:52:24 +0000
In-Reply-To: <20220420235228.2767816-1-tjmercier@google.com>
Message-Id: <20220420235228.2767816-7-tjmercier@google.com>
Mime-Version: 1.0
References: <20220420235228.2767816-1-tjmercier@google.com>
X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog
Subject: [RFC v5 6/6] selftests: Add binder cgroup gpu memory transfer tests
From: "T.J. Mercier" <tjmercier@google.com>
To: tjmercier@google.com, daniel@ffwll.ch, tj@kernel.org,
        Shuah Khan <shuah@kernel.org>
Cc: hridya@google.com, christian.koenig@amd.com, jstultz@google.com,
        tkjos@android.com, cmllamas@google.com, surenb@google.com,
        kaleshsingh@google.com, Kenny.Ho@amd.com, mkoutny@suse.com,
        skhan@linuxfoundation.org, kernel-team@android.com,
        linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

These tests verify that the cgroup GPU memory charge is transferred
correctly when a dmabuf is passed between processes in two different
cgroups and the sender specifies BINDER_FD_FLAG_SENDER_NO_NEED or
BINDER_FDA_FLAG_SENDER_NO_NEED in the binder transaction data
containing the dmabuf file descriptor.

Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v5 changes
Tests for both binder_fd_array_object and binder_fd_object.

Return error code instead of struct binder{fs}_ctx.

Use ifdef __ANDROID__ to choose platform-dependent temp path instead of a
runtime fallback.

Ensure binderfs_mntpt ends with a trailing '/' character instead of
prepending it where used.

v4 changes
Skip test if not run as root per Shuah Khan.

Add better logging for abnormal child termination per Shuah Khan.
---
 .../selftests/drivers/android/binder/Makefile |   8 +
 .../drivers/android/binder/binder_util.c      | 250 +++++++++
 .../drivers/android/binder/binder_util.h      |  32 ++
 .../selftests/drivers/android/binder/config   |   4 +
 .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
 5 files changed, 820 insertions(+)
 create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_u=
til.c
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_u=
til.h
 create mode 100644 tools/testing/selftests/drivers/android/binder/config
 create mode 100644 tools/testing/selftests/drivers/android/binder/test_dma=
buf_cgroup_transfer.c

diff --git a/tools/testing/selftests/drivers/android/binder/Makefile b/tool=
s/testing/selftests/drivers/android/binder/Makefile
new file mode 100644
index 000000000000..726439d10675
--- /dev/null
+++ b/tools/testing/selftests/drivers/android/binder/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS +=3D -Wall
+
+TEST_GEN_PROGS =3D test_dmabuf_cgroup_transfer
+
+include ../../../lib.mk
+
+$(OUTPUT)/test_dmabuf_cgroup_transfer: ../../../cgroup/cgroup_util.c binde=
r_util.c
diff --git a/tools/testing/selftests/drivers/android/binder/binder_util.c b=
/tools/testing/selftests/drivers/android/binder/binder_util.c
new file mode 100644
index 000000000000..cdd97cb0bb60
--- /dev/null
+++ b/tools/testing/selftests/drivers/android/binder/binder_util.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "binder_util.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/mount.h>
+
+#include <linux/limits.h>
+#include <linux/android/binder.h>
+#include <linux/android/binderfs.h>
+
+static const size_t BINDER_MMAP_SIZE =3D 64 * 1024;
+
+static void binderfs_unmount(const char *mountpoint)
+{
+	if (umount2(mountpoint, MNT_DETACH))
+		fprintf(stderr, "Failed to unmount binderfs at %s: %s\n",
+			mountpoint, strerror(errno));
+	else
+		fprintf(stderr, "Binderfs unmounted: %s\n", mountpoint);
+
+	if (rmdir(mountpoint))
+		fprintf(stderr, "Failed to remove binderfs mount %s: %s\n",
+			mountpoint, strerror(errno));
+	else
+		fprintf(stderr, "Binderfs mountpoint destroyed: %s\n", mountpoint);
+}
+
+int create_binderfs(struct binderfs_ctx *ctx, const char *name)
+{
+	int fd, ret, saved_errno;
+	struct binderfs_device device =3D { 0 };
+
+	/*
+	 * P_tmpdir is set to "/tmp/" on Android platforms where Binder is most c=
ommonly used, but
+	 * this path does not actually exist on Android. For Android we'll try us=
ing
+	 * "/data/local/tmp" and P_tmpdir for non-Android platforms.
+	 *
+	 * This mount point should have a trailing '/' character, but mkdtemp req=
uires that the last
+	 * six characters (before the first null terminator) must be "XXXXXX". Ma=
nually append an
+	 * additional null character in the string literal to allocate a characte=
r array of the
+	 * correct final size, which we will replace with a '/' after successful =
completion of the
+	 * mkdtemp call.
+	 */
+#ifdef __ANDROID__
+	char binderfs_mntpt[] =3D "/data/local/tmp/binderfs_XXXXXX\0";
+#else
+	/* P_tmpdir may or may not contain a trailing '/' separator. We always ap=
pend one here. */
+	char binderfs_mntpt[] =3D P_tmpdir "/binderfs_XXXXXX\0";
+#endif
+	static const char BINDER_CONTROL_NAME[] =3D "binder-control";
+	char device_path[strlen(binderfs_mntpt) + 1 + strlen(BINDER_CONTROL_NAME)=
 + 1];
+
+	if (mkdtemp(binderfs_mntpt) =3D=3D NULL) {
+		fprintf(stderr, "Failed to create binderfs mountpoint at %s: %s.\n",
+			binderfs_mntpt, strerror(errno));
+		return -1;
+	}
+	binderfs_mntpt[strlen(binderfs_mntpt)] =3D '/';
+	fprintf(stderr, "Binderfs mountpoint created at %s\n", binderfs_mntpt);
+
+	if (mount(NULL, binderfs_mntpt, "binder", 0, 0)) {
+		perror("Could not mount binderfs");
+		rmdir(binderfs_mntpt);
+		return -1;
+	}
+	fprintf(stderr, "Binderfs mounted at %s\n", binderfs_mntpt);
+
+	strncpy(device.name, name, sizeof(device.name));
+	snprintf(device_path, sizeof(device_path), "%s%s", binderfs_mntpt, BINDER=
_CONTROL_NAME);
+	fd =3D open(device_path, O_RDONLY | O_CLOEXEC);
+	if (!fd) {
+		fprintf(stderr, "Failed to open %s device", BINDER_CONTROL_NAME);
+		binderfs_unmount(binderfs_mntpt);
+		return -1;
+	}
+
+	ret =3D ioctl(fd, BINDER_CTL_ADD, &device);
+	saved_errno =3D errno;
+	close(fd);
+	errno =3D saved_errno;
+	if (ret) {
+		perror("Failed to allocate new binder device");
+		binderfs_unmount(binderfs_mntpt);
+		return -1;
+	}
+
+	fprintf(stderr, "Allocated new binder device with major %d, minor %d, and=
 name %s at %s\n",
+		device.major, device.minor, device.name, binderfs_mntpt);
+
+	ctx->name =3D strdup(name);
+	ctx->mountpoint =3D strdup(binderfs_mntpt);
+
+	return 0;
+}
+
+void destroy_binderfs(struct binderfs_ctx *ctx)
+{
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), "%s%s", ctx->mountpoint, ctx->name);
+
+	if (unlink(path))
+		fprintf(stderr, "Failed to unlink binder device %s: %s\n", path, strerro=
r(errno));
+	else
+		fprintf(stderr, "Destroyed binder %s at %s\n", ctx->name, ctx->mountpoin=
t);
+
+	binderfs_unmount(ctx->mountpoint);
+
+	free(ctx->name);
+	free(ctx->mountpoint);
+}
+
+int open_binder(const struct binderfs_ctx *bfs_ctx, struct binder_ctx *ctx)
+{
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), "%s%s", bfs_ctx->mountpoint, bfs_ctx->name);
+	ctx->fd =3D open(path, O_RDWR | O_NONBLOCK | O_CLOEXEC);
+	if (ctx->fd < 0) {
+		fprintf(stderr, "Error opening binder device %s: %s\n", path, strerror(e=
rrno));
+		return -1;
+	}
+
+	ctx->memory =3D mmap(NULL, BINDER_MMAP_SIZE, PROT_READ, MAP_SHARED, ctx->=
fd, 0);
+	if (ctx->memory =3D=3D NULL) {
+		perror("Error mapping binder memory");
+		close(ctx->fd);
+		ctx->fd =3D -1;
+		return -1;
+	}
+
+	return 0;
+}
+
+void close_binder(struct binder_ctx *ctx)
+{
+	if (munmap(ctx->memory, BINDER_MMAP_SIZE))
+		perror("Failed to unmap binder memory");
+	ctx->memory =3D NULL;
+
+	if (close(ctx->fd))
+		perror("Failed to close binder");
+	ctx->fd =3D -1;
+}
+
+int become_binder_context_manager(int binder_fd)
+{
+	return ioctl(binder_fd, BINDER_SET_CONTEXT_MGR, 0);
+}
+
+int do_binder_write_read(int binder_fd, void *writebuf, binder_size_t writ=
esize,
+			 void *readbuf, binder_size_t readsize)
+{
+	int err;
+	struct binder_write_read bwr =3D {
+		.write_buffer =3D (binder_uintptr_t)writebuf,
+		.write_size =3D writesize,
+		.read_buffer =3D (binder_uintptr_t)readbuf,
+		.read_size =3D readsize
+	};
+
+	do {
+		if (ioctl(binder_fd, BINDER_WRITE_READ, &bwr) >=3D 0)
+			err =3D 0;
+		else
+			err =3D -errno;
+	} while (err =3D=3D -EINTR);
+
+	if (err < 0) {
+		perror("BINDER_WRITE_READ");
+		return -1;
+	}
+
+	if (bwr.write_consumed < writesize) {
+		fprintf(stderr, "Binder did not consume full write buffer %llu %llu\n",
+			bwr.write_consumed, writesize);
+		return -1;
+	}
+
+	return bwr.read_consumed;
+}
+
+static const char *reply_string(int cmd)
+{
+	switch (cmd) {
+	case BR_ERROR:
+		return "BR_ERROR";
+	case BR_OK:
+		return "BR_OK";
+	case BR_TRANSACTION_SEC_CTX:
+		return "BR_TRANSACTION_SEC_CTX";
+	case BR_TRANSACTION:
+		return "BR_TRANSACTION";
+	case BR_REPLY:
+		return "BR_REPLY";
+	case BR_ACQUIRE_RESULT:
+		return "BR_ACQUIRE_RESULT";
+	case BR_DEAD_REPLY:
+		return "BR_DEAD_REPLY";
+	case BR_TRANSACTION_COMPLETE:
+		return "BR_TRANSACTION_COMPLETE";
+	case BR_INCREFS:
+		return "BR_INCREFS";
+	case BR_ACQUIRE:
+		return "BR_ACQUIRE";
+	case BR_RELEASE:
+		return "BR_RELEASE";
+	case BR_DECREFS:
+		return "BR_DECREFS";
+	case BR_ATTEMPT_ACQUIRE:
+		return "BR_ATTEMPT_ACQUIRE";
+	case BR_NOOP:
+		return "BR_NOOP";
+	case BR_SPAWN_LOOPER:
+		return "BR_SPAWN_LOOPER";
+	case BR_FINISHED:
+		return "BR_FINISHED";
+	case BR_DEAD_BINDER:
+		return "BR_DEAD_BINDER";
+	case BR_CLEAR_DEATH_NOTIFICATION_DONE:
+		return "BR_CLEAR_DEATH_NOTIFICATION_DONE";
+	case BR_FAILED_REPLY:
+		return "BR_FAILED_REPLY";
+	case BR_FROZEN_REPLY:
+		return "BR_FROZEN_REPLY";
+	case BR_ONEWAY_SPAM_SUSPECT:
+		return "BR_ONEWAY_SPAM_SUSPECT";
+	default:
+		return "Unknown";
+	};
+}
+
+int expect_binder_reply(int32_t actual, int32_t expected)
+{
+	if (actual !=3D expected) {
+		fprintf(stderr, "Expected %s but received %s\n",
+			reply_string(expected), reply_string(actual));
+		return -1;
+	}
+	return 0;
+}
+
diff --git a/tools/testing/selftests/drivers/android/binder/binder_util.h b=
/tools/testing/selftests/drivers/android/binder/binder_util.h
new file mode 100644
index 000000000000..adc2b20e8d0a
--- /dev/null
+++ b/tools/testing/selftests/drivers/android/binder/binder_util.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef SELFTEST_BINDER_UTIL_H
+#define SELFTEST_BINDER_UTIL_H
+
+#include <stdint.h>
+
+#include <linux/android/binder.h>
+
+struct binderfs_ctx {
+	char *name;
+	char *mountpoint;
+};
+
+struct binder_ctx {
+	int fd;
+	void *memory;
+};
+
+int create_binderfs(struct binderfs_ctx *ctx, const char *name);
+void destroy_binderfs(struct binderfs_ctx *ctx);
+
+int open_binder(const struct binderfs_ctx *bfs_ctx, struct binder_ctx *ctx=
);
+void close_binder(struct binder_ctx *ctx);
+
+int become_binder_context_manager(int binder_fd);
+
+int do_binder_write_read(int binder_fd, void *writebuf, binder_size_t writ=
esize,
+			 void *readbuf, binder_size_t readsize);
+
+int expect_binder_reply(int32_t actual, int32_t expected);
+#endif
diff --git a/tools/testing/selftests/drivers/android/binder/config b/tools/=
testing/selftests/drivers/android/binder/config
new file mode 100644
index 000000000000..fcc5f8f693b3
--- /dev/null
+++ b/tools/testing/selftests/drivers/android/binder/config
@@ -0,0 +1,4 @@
+CONFIG_CGROUP_GPU=3Dy
+CONFIG_ANDROID=3Dy
+CONFIG_ANDROID_BINDERFS=3Dy
+CONFIG_ANDROID_BINDER_IPC=3Dy
diff --git a/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgr=
oup_transfer.c b/tools/testing/selftests/drivers/android/binder/test_dmabuf=
_cgroup_transfer.c
new file mode 100644
index 000000000000..75ca49d1a2e4
--- /dev/null
+++ b/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_tra=
nsfer.c
@@ -0,0 +1,526 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * These tests verify that the cgroup GPU memory charge is transferred cor=
rectly when a dmabuf is
+ * passed between processes in two different cgroups and the sender specif=
ies
+ * BINDER_FD_FLAG_SENDER_NO_NEED or BINDER_FDA_FLAG_SENDER_NO_NEED in the =
binder transaction data
+ * containing the dmabuf file descriptor.
+ *
+ * The parent test process becomes the binder context manager, then forks =
a child who initiates a
+ * transaction with the context manager by specifying a target of 0. The c=
ontext manager reply
+ * contains a dmabuf file descriptor (or an array of one file descriptor) =
which was allocated by the
+ * parent, but should be charged to the child cgroup after the binder tran=
saction.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/epoll.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+#include "binder_util.h"
+#include "../../../cgroup/cgroup_util.h"
+#include "../../../kselftest.h"
+#include "../../../kselftest_harness.h"
+
+#include <linux/limits.h>
+#include <linux/dma-heap.h>
+#include <linux/android/binder.h>
+
+#define UNUSED(x) ((void)(x))
+
+static const unsigned int BINDER_CODE =3D 8675309; /* Any number will work=
 here */
+
+struct cgroup_ctx {
+	char *root;
+	char *source;
+	char *dest;
+};
+
+void destroy_cgroups(struct __test_metadata *_metadata, struct cgroup_ctx =
*ctx)
+{
+	if (ctx->source !=3D NULL) {
+		TH_LOG("Destroying cgroup: %s", ctx->source);
+		rmdir(ctx->source);
+		free(ctx->source);
+	}
+
+	if (ctx->dest !=3D NULL) {
+		TH_LOG("Destroying cgroup: %s", ctx->dest);
+		rmdir(ctx->dest);
+		free(ctx->dest);
+	}
+
+	free(ctx->root);
+	ctx->root =3D ctx->source =3D ctx->dest =3D NULL;
+}
+
+struct cgroup_ctx create_cgroups(struct __test_metadata *_metadata)
+{
+	struct cgroup_ctx ctx =3D {0};
+	char root[PATH_MAX], *tmp;
+	static const char template[] =3D "/gpucg_XXXXXX";
+
+	if (cg_find_unified_root(root, sizeof(root))) {
+		TH_LOG("Could not find cgroups root");
+		return ctx;
+	}
+
+	if (cg_read_strstr(root, "cgroup.controllers", "gpu")) {
+		TH_LOG("Could not find GPU controller");
+		return ctx;
+	}
+
+	if (cg_write(root, "cgroup.subtree_control", "+gpu")) {
+		TH_LOG("Could not enable GPU controller");
+		return ctx;
+	}
+
+	ctx.root =3D strdup(root);
+
+	snprintf(root, sizeof(root), "%s/%s", ctx.root, template);
+	tmp =3D mkdtemp(root);
+	if (tmp =3D=3D NULL) {
+		TH_LOG("%s - Could not create source cgroup", strerror(errno));
+		destroy_cgroups(_metadata, &ctx);
+		return ctx;
+	}
+	ctx.source =3D strdup(tmp);
+
+	snprintf(root, sizeof(root), "%s/%s", ctx.root, template);
+	tmp =3D mkdtemp(root);
+	if (tmp =3D=3D NULL) {
+		TH_LOG("%s - Could not create destination cgroup", strerror(errno));
+		destroy_cgroups(_metadata, &ctx);
+		return ctx;
+	}
+	ctx.dest =3D strdup(tmp);
+
+	TH_LOG("Created cgroups: %s %s", ctx.source, ctx.dest);
+
+	return ctx;
+}
+
+int dmabuf_heap_alloc(int fd, size_t len, int *dmabuf_fd)
+{
+	struct dma_heap_allocation_data data =3D {
+		.len =3D len,
+		.fd =3D 0,
+		.fd_flags =3D O_RDONLY | O_CLOEXEC,
+		.heap_flags =3D 0,
+	};
+	int ret;
+
+	if (!dmabuf_fd)
+		return -EINVAL;
+
+	ret =3D ioctl(fd, DMA_HEAP_IOCTL_ALLOC, &data);
+	if (ret < 0)
+		return ret;
+	*dmabuf_fd =3D (int)data.fd;
+	return ret;
+}
+
+/* The system heap is known to export dmabufs with support for cgroup trac=
king */
+int alloc_dmabuf_from_system_heap(struct __test_metadata *_metadata, size_=
t bytes)
+{
+	int heap_fd =3D -1, dmabuf_fd =3D -1;
+	static const char * const heap_path =3D "/dev/dma_heap/system";
+
+	heap_fd =3D open(heap_path, O_RDONLY);
+	if (heap_fd < 0) {
+		TH_LOG("%s - open %s failed!\n", strerror(errno), heap_path);
+		return -1;
+	}
+
+	if (dmabuf_heap_alloc(heap_fd, bytes, &dmabuf_fd))
+		TH_LOG("dmabuf allocation failed! - %s", strerror(errno));
+	close(heap_fd);
+
+	return dmabuf_fd;
+}
+
+int binder_request_dmabuf(int binder_fd)
+{
+	int ret;
+
+	/*
+	 * We just send an empty binder_buffer_object to initiate a transaction
+	 * with the context manager, who should respond with a single dmabuf
+	 * inside a binder_fd_array_object or a binder_fd_object.
+	 */
+
+	struct binder_buffer_object bbo =3D {
+		.hdr.type =3D BINDER_TYPE_PTR,
+		.flags =3D 0,
+		.buffer =3D 0,
+		.length =3D 0,
+		.parent =3D 0, /* No parent */
+		.parent_offset =3D 0 /* No parent */
+	};
+
+	binder_size_t offsets[] =3D {0};
+
+	struct {
+		int32_t cmd;
+		struct binder_transaction_data btd;
+	} __attribute__((packed)) bc =3D {
+		.cmd =3D BC_TRANSACTION,
+		.btd =3D {
+			.target =3D { 0 },
+			.cookie =3D 0,
+			.code =3D BINDER_CODE,
+			.flags =3D TF_ACCEPT_FDS, /* We expect a FD/FDA in the reply */
+			.data_size =3D sizeof(bbo),
+			.offsets_size =3D sizeof(offsets),
+			.data.ptr =3D {
+				(binder_uintptr_t)&bbo,
+				(binder_uintptr_t)offsets
+			}
+		},
+	};
+
+	struct {
+		int32_t reply_noop;
+	} __attribute__((packed)) br;
+
+	ret =3D do_binder_write_read(binder_fd, &bc, sizeof(bc), &br, sizeof(br));
+	if (ret >=3D sizeof(br) && expect_binder_reply(br.reply_noop, BR_NOOP)) {
+		return -1;
+	} else if (ret < sizeof(br)) {
+		fprintf(stderr, "Not enough bytes in binder reply %d\n", ret);
+		return -1;
+	}
+	return 0;
+}
+
+int send_dmabuf_reply_fda(int binder_fd, struct binder_transaction_data *t=
r, int dmabuf_fd)
+{
+	int ret;
+	/*
+	 * The trailing 0 is to achieve the necessary alignment for the binder
+	 * buffer_size.
+	 */
+	int fdarray[] =3D { dmabuf_fd, 0 };
+
+	struct binder_buffer_object bbo =3D {
+		.hdr.type =3D BINDER_TYPE_PTR,
+		.flags =3D 0,
+		.buffer =3D (binder_uintptr_t)fdarray,
+		.length =3D sizeof(fdarray),
+		.parent =3D 0, /* No parent */
+		.parent_offset =3D 0 /* No parent */
+	};
+
+	struct binder_fd_array_object bfdao =3D {
+		.hdr.type =3D BINDER_TYPE_FDA,
+		.flags =3D BINDER_FDA_FLAG_SENDER_NO_NEED,
+		.num_fds =3D 1,
+		.parent =3D 0, /* The binder_buffer_object */
+		.parent_offset =3D 0 /* FDs follow immediately */
+	};
+
+	uint64_t sz =3D sizeof(fdarray);
+	uint8_t data[sizeof(sz) + sizeof(bbo) + sizeof(bfdao)];
+	binder_size_t offsets[] =3D {sizeof(sz), sizeof(sz)+sizeof(bbo)};
+
+	memcpy(data,                            &sz, sizeof(sz));
+	memcpy(data + sizeof(sz),               &bbo, sizeof(bbo));
+	memcpy(data + sizeof(sz) + sizeof(bbo), &bfdao, sizeof(bfdao));
+
+	struct {
+		int32_t cmd;
+		struct binder_transaction_data_sg btd;
+	} __attribute__((packed)) bc =3D {
+		.cmd =3D BC_REPLY_SG,
+		.btd.transaction_data =3D {
+			.target =3D { tr->target.handle },
+			.cookie =3D tr->cookie,
+			.code =3D BINDER_CODE,
+			.flags =3D 0,
+			.data_size =3D sizeof(data),
+			.offsets_size =3D sizeof(offsets),
+			.data.ptr =3D {
+				(binder_uintptr_t)data,
+				(binder_uintptr_t)offsets
+			}
+		},
+		.btd.buffers_size =3D sizeof(fdarray)
+	};
+
+	struct {
+		int32_t reply_noop;
+	} __attribute__((packed)) br;
+
+	ret =3D do_binder_write_read(binder_fd, &bc, sizeof(bc), &br, sizeof(br));
+	if (ret >=3D sizeof(br) && expect_binder_reply(br.reply_noop, BR_NOOP)) {
+		return -1;
+	} else if (ret < sizeof(br)) {
+		fprintf(stderr, "Not enough bytes in binder reply %d\n", ret);
+		return -1;
+	}
+	return 0;
+}
+
+int send_dmabuf_reply_fd(int binder_fd, struct binder_transaction_data *tr=
, int dmabuf_fd)
+{
+	int ret;
+
+	struct binder_fd_object bfdo =3D {
+		.hdr.type =3D BINDER_TYPE_FD,
+		.flags =3D BINDER_FD_FLAG_SENDER_NO_NEED,
+		.fd =3D dmabuf_fd
+	};
+
+	binder_size_t offset =3D 0;
+
+	struct {
+		int32_t cmd;
+		struct binder_transaction_data btd;
+	} __attribute__((packed)) bc =3D {
+		.cmd =3D BC_REPLY,
+		.btd =3D {
+			.target =3D { tr->target.handle },
+			.cookie =3D tr->cookie,
+			.code =3D BINDER_CODE,
+			.flags =3D 0,
+			.data_size =3D sizeof(bfdo),
+			.offsets_size =3D sizeof(offset),
+			.data.ptr =3D {
+				(binder_uintptr_t)&bfdo,
+				(binder_uintptr_t)&offset
+			}
+		}
+	};
+
+	struct {
+		int32_t reply_noop;
+	} __attribute__((packed)) br;
+
+	ret =3D do_binder_write_read(binder_fd, &bc, sizeof(bc), &br, sizeof(br));
+	if (ret >=3D sizeof(br) && expect_binder_reply(br.reply_noop, BR_NOOP)) {
+		return -1;
+	} else if (ret < sizeof(br)) {
+		fprintf(stderr, "Not enough bytes in binder reply %d\n", ret);
+		return -1;
+	}
+	return 0;
+}
+
+struct binder_transaction_data *binder_wait_for_transaction(int binder_fd,
+							    uint32_t *readbuf,
+							    size_t readsize)
+{
+	static const int MAX_EVENTS =3D 1, EPOLL_WAIT_TIME_MS =3D 3 * 1000;
+	struct binder_reply {
+		int32_t reply0;
+		int32_t reply1;
+		struct binder_transaction_data btd;
+	} *br;
+	struct binder_transaction_data *ret =3D NULL;
+	struct epoll_event events[MAX_EVENTS];
+	int epoll_fd, num_events, readcount;
+	uint32_t bc[] =3D { BC_ENTER_LOOPER };
+
+	do_binder_write_read(binder_fd, &bc, sizeof(bc), NULL, 0);
+
+	epoll_fd =3D epoll_create1(EPOLL_CLOEXEC);
+	if (epoll_fd =3D=3D -1) {
+		perror("epoll_create");
+		return NULL;
+	}
+
+	events[0].events =3D EPOLLIN;
+	if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, binder_fd, &events[0])) {
+		perror("epoll_ctl add");
+		goto err_close;
+	}
+
+	num_events =3D epoll_wait(epoll_fd, events, MAX_EVENTS, EPOLL_WAIT_TIME_M=
S);
+	if (num_events < 0) {
+		perror("epoll_wait");
+		goto err_ctl;
+	} else if (num_events =3D=3D 0) {
+		fprintf(stderr, "No events\n");
+		goto err_ctl;
+	}
+
+	readcount =3D do_binder_write_read(binder_fd, NULL, 0, readbuf, readsize);
+	fprintf(stderr, "Read %d bytes from binder\n", readcount);
+
+	if (readcount < (int)sizeof(struct binder_reply)) {
+		fprintf(stderr, "read_consumed not large enough\n");
+		goto err_ctl;
+	}
+
+	br =3D (struct binder_reply *)readbuf;
+	if (expect_binder_reply(br->reply0, BR_NOOP))
+		goto err_ctl;
+
+	if (br->reply1 =3D=3D BR_TRANSACTION) {
+		if (br->btd.code =3D=3D BINDER_CODE)
+			ret =3D &br->btd;
+		else
+			fprintf(stderr, "Received transaction with unexpected code: %u\n",
+				br->btd.code);
+	} else {
+		expect_binder_reply(br->reply1, BR_TRANSACTION_COMPLETE);
+	}
+
+err_ctl:
+	if (epoll_ctl(epoll_fd, EPOLL_CTL_DEL, binder_fd, NULL))
+		perror("epoll_ctl del");
+err_close:
+	close(epoll_fd);
+	return ret;
+}
+
+static int child_request_dmabuf_transfer(const char *cgroup, void *arg)
+{
+	UNUSED(cgroup);
+	int ret =3D -1;
+	uint32_t readbuf[32];
+	struct binderfs_ctx bfs_ctx =3D *(struct binderfs_ctx *)arg;
+	struct binder_ctx b_ctx;
+
+	fprintf(stderr, "Child PID: %d\n", getpid());
+
+	if (open_binder(&bfs_ctx, &b_ctx)) {
+		fprintf(stderr, "Child unable to open binder\n");
+		return -1;
+	}
+
+	if (binder_request_dmabuf(b_ctx.fd))
+		goto err;
+
+	/* The child must stay alive until the binder reply is received */
+	if (binder_wait_for_transaction(b_ctx.fd, readbuf, sizeof(readbuf)) =3D=
=3D NULL)
+		ret =3D 0;
+
+	/*
+	 * We don't close the received dmabuf here so that the parent can
+	 * inspect the cgroup gpu memory charges to verify the charge transfer
+	 * completed successfully.
+	 */
+err:
+	close_binder(&b_ctx);
+	fprintf(stderr, "Child done\n");
+	return ret;
+}
+
+static const char * const GPUMEM_FILENAME =3D "gpu.memory.current";
+static const size_t ONE_MiB =3D 1024 * 1024;
+
+FIXTURE(fix) {
+	int dmabuf_fd;
+	struct binderfs_ctx bfs_ctx;
+	struct binder_ctx b_ctx;
+	struct cgroup_ctx cg_ctx;
+	struct binder_transaction_data *tr;
+	pid_t child_pid;
+};
+
+FIXTURE_SETUP(fix)
+{
+	long memsize;
+	uint32_t readbuf[32];
+	struct flat_binder_object *fbo;
+	struct binder_buffer_object *bbo;
+
+	if (geteuid() !=3D 0)
+		ksft_exit_skip("Need to be root to mount binderfs\n");
+
+	if (create_binderfs(&self->bfs_ctx, "testbinder"))
+		ksft_exit_skip("The Android binderfs filesystem is not available\n");
+
+	self->cg_ctx =3D create_cgroups(_metadata);
+	if (self->cg_ctx.root =3D=3D NULL) {
+		destroy_binderfs(&self->bfs_ctx);
+		ksft_exit_skip("cgroup v2 isn't mounted\n");
+	}
+
+	ASSERT_EQ(cg_enter_current(self->cg_ctx.source), 0) {
+		TH_LOG("Could not move parent to cgroup: %s", self->cg_ctx.source);
+	}
+
+	self->dmabuf_fd =3D alloc_dmabuf_from_system_heap(_metadata, ONE_MiB);
+	ASSERT_GE(self->dmabuf_fd, 0);
+	TH_LOG("Allocated dmabuf");
+
+	memsize =3D cg_read_key_long(self->cg_ctx.source, GPUMEM_FILENAME, "syste=
m-heap");
+	ASSERT_EQ(memsize, ONE_MiB) {
+		TH_LOG("GPU memory used after allocation: %ld but it should be %lu",
+		       memsize, (unsigned long)ONE_MiB);
+	}
+
+	ASSERT_EQ(open_binder(&self->bfs_ctx, &self->b_ctx), 0) {
+		TH_LOG("Parent unable to open binder");
+	}
+	TH_LOG("Opened binder at %s/%s", self->bfs_ctx.mountpoint, self->bfs_ctx.=
name);
+
+	ASSERT_EQ(become_binder_context_manager(self->b_ctx.fd), 0) {
+		TH_LOG("Cannot become context manager: %s", strerror(errno));
+	}
+
+	self->child_pid =3D cg_run_nowait(
+		self->cg_ctx.dest, child_request_dmabuf_transfer, &self->bfs_ctx);
+	ASSERT_GT(self->child_pid, 0) {
+		TH_LOG("Error forking: %s", strerror(errno));
+	}
+
+	self->tr =3D binder_wait_for_transaction(self->b_ctx.fd, readbuf, sizeof(=
readbuf));
+	ASSERT_NE(self->tr, NULL) {
+		TH_LOG("Error receiving transaction request from child");
+	}
+	fbo =3D (struct flat_binder_object *)self->tr->data.ptr.buffer;
+	ASSERT_EQ(fbo->hdr.type, BINDER_TYPE_PTR) {
+		TH_LOG("Did not receive a buffer object from child");
+	}
+	bbo =3D (struct binder_buffer_object *)fbo;
+	ASSERT_EQ(bbo->length, 0) {
+		TH_LOG("Did not receive an empty buffer object from child");
+	}
+
+	TH_LOG("Received transaction from child");
+}
+
+FIXTURE_TEARDOWN(fix)
+{
+	close_binder(&self->b_ctx);
+	close(self->dmabuf_fd);
+	destroy_cgroups(_metadata, &self->cg_ctx);
+	destroy_binderfs(&self->bfs_ctx);
+}
+
+
+void verify_transfer_success(struct _test_data_fix *self, struct __test_me=
tadata *_metadata)
+{
+	ASSERT_EQ(cg_read_key_long(self->cg_ctx.dest, GPUMEM_FILENAME, "system-he=
ap"), ONE_MiB) {
+		TH_LOG("Destination cgroup does not have system-heap charge!");
+	}
+	ASSERT_EQ(cg_read_key_long(self->cg_ctx.source, GPUMEM_FILENAME, "system-=
heap"), 0) {
+		TH_LOG("Source cgroup still has system-heap charge!");
+	}
+	TH_LOG("Charge transfer succeeded!");
+}
+
+TEST_F(fix, individual_fd)
+{
+	send_dmabuf_reply_fd(self->b_ctx.fd, self->tr, self->dmabuf_fd);
+	verify_transfer_success(self, _metadata);
+}
+
+TEST_F(fix, fd_array)
+{
+	send_dmabuf_reply_fda(self->b_ctx.fd, self->tr, self->dmabuf_fd);
+	verify_transfer_success(self, _metadata);
+}
+
+TEST_HARNESS_MAIN
--=20
2.36.0.rc0.470.gd361397f0d-goog