[PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()

Waiman Long posted 3 patches 2 years, 1 month ago
There is a newer version of this series
[PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
Posted by Waiman Long 2 years, 1 month ago
The current design of cgroup_rstat_cpu_pop_updated() is to traverse
the updated tree in a way to pop out the leaf nodes first before
their parents. This can cause traversal of multiple nodes before a
leaf node can be found and popped out. IOW, a given node in the tree
can be visited multiple times before the whole operation is done. So
it is not very efficient and the code can be hard to read.

With the introduction of cgroup_rstat_updated_list() to build a list
of cgroups to be flushed first before any flushing operation is being
done, we can optimize the way the updated tree nodes are being popped
by pushing the parents first to the tail end of the list before their
children. In this way, most updated tree nodes will be visited only
once with the exception of the subtree root as we still need to go
back to its parent and popped it out of its updated_children list.
This also makes the code easier to read.

A parallel kernel build on a 2-socket x86-64 server is used as the
benchmarking tool for measuring the lock hold time. Below were the lock
hold time frequency distribution before and after the patch:

     Hold time        Before patch       After patch
     ---------        ------------       -----------
       0-01 us        13,738,708         14,594,545
      01-05 us         1,177,194            439,926
      05-10 us             4,984              5,960
      10-15 us             3,562              3,543
      15-20 us             1,314              1,397
      20-25 us                18                 25
      25-30 us                12                 12

It can be seen that the patch pushes the lock hold time towards the
lower end.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/rstat.c | 132 ++++++++++++++++++++++--------------------
 1 file changed, 70 insertions(+), 62 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 1f300bf4dc40..d2b709cfeb2a 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -74,64 +74,90 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 }
 
 /**
- * cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated tree
- * @pos: current position
- * @root: root of the tree to traversal
+ * cgroup_rstat_push_children - push children cgroups into the given list
+ * @head: current head of the list (= parent cgroup)
+ * @prstatc: cgroup_rstat_cpu of the parent cgroup
  * @cpu: target cpu
+ * Return: A new singly linked list of cgroups to be flush
  *
- * Walks the updated rstat_cpu tree on @cpu from @root.  %NULL @pos starts
- * the traversal and %NULL return indicates the end.  During traversal,
- * each returned cgroup is unlinked from the tree.  Must be called with the
- * matching cgroup_rstat_cpu_lock held.
+ * Recursively traverse down the cgroup_rstat_cpu updated tree and push
+ * parent first before its children. The parent is pushed by the caller.
+ * The recursion depth is the depth of the current updated tree.
+ */
+static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
+				struct cgroup_rstat_cpu *prstatc, int cpu)
+{
+	struct cgroup *child, *parent;
+	struct cgroup_rstat_cpu *crstatc;
+
+	parent = head;
+	child = prstatc->updated_children;
+	prstatc->updated_children = parent;
+
+	/* updated_next is parent cgroup terminated */
+	while (child != parent) {
+		child->rstat_flush_next = head;
+		head = child;
+		crstatc = cgroup_rstat_cpu(child, cpu);
+		if (crstatc->updated_children != parent)
+			head = cgroup_rstat_push_children(head, crstatc, cpu);
+		child = crstatc->updated_next;
+		crstatc->updated_next = NULL;
+	}
+	return head;
+}
+
+/**
+ * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
+ * @root: root of the cgroup subtree to traverse
+ * @cpu: target cpu
+ * Return: A singly linked list of cgroups to be flushed
+ *
+ * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
+ * each returned cgroup is unlinked from the updated tree.  Must be called
+ * with the matching cgroup_rstat_cpu_lock held.
  *
  * The only ordering guarantee is that, for a parent and a child pair
- * covered by a given traversal, if a child is visited, its parent is
- * guaranteed to be visited afterwards.
+ * covered by a given traversal, the child is before its parent in
+ * the list.
+ *
+ * Note that updated_children is self terminated while updated_next is
+ * parent cgroup terminated except the cgroup root which can be self
+ * terminated.
  */
-static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
-						   struct cgroup *root, int cpu)
+static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc;
-	struct cgroup *parent;
-
-	if (pos == root)
-		return NULL;
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
+	struct cgroup *head = NULL, *parent;
+	unsigned long flags;
 
 	/*
-	 * We're gonna walk down to the first leaf and visit/remove it.  We
-	 * can pick whatever unvisited node as the starting point.
+	 * The _irqsave() is needed because cgroup_rstat_lock is
+	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
+	 * this lock with the _irq() suffix only disables interrupts on
+	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
+	 * interrupts on both configurations. The _irqsave() ensures
+	 * that interrupts are always disabled and later restored.
 	 */
-	if (!pos) {
-		pos = root;
-		/* return NULL if this subtree is not on-list */
-		if (!cgroup_rstat_cpu(pos, cpu)->updated_next)
-			return NULL;
-	} else {
-		pos = cgroup_parent(pos);
-	}
+	raw_spin_lock_irqsave(cpu_lock, flags);
 
-	/* walk down to the first leaf */
-	while (true) {
-		rstatc = cgroup_rstat_cpu(pos, cpu);
-		if (rstatc->updated_children == pos)
-			break;
-		pos = rstatc->updated_children;
-	}
+	/* Return NULL if this subtree is not on-list */
+	if (!rstatc->updated_next)
+		goto unlock_ret;
 
 	/*
-	 * Unlink @pos from the tree.  As the updated_children list is
+	 * Unlink @root from its parent. As the updated_children list is
 	 * singly linked, we have to walk it to find the removal point.
-	 * However, due to the way we traverse, @pos will be the first
-	 * child in most cases. The only exception is @root.
 	 */
-	parent = cgroup_parent(pos);
+	parent = cgroup_parent(root);
 	if (parent) {
 		struct cgroup_rstat_cpu *prstatc;
 		struct cgroup **nextp;
 
 		prstatc = cgroup_rstat_cpu(parent, cpu);
 		nextp = &prstatc->updated_children;
-		while (*nextp != pos) {
+		while (*nextp != root) {
 			struct cgroup_rstat_cpu *nrstatc;
 
 			nrstatc = cgroup_rstat_cpu(*nextp, cpu);
@@ -142,31 +168,13 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 	}
 
 	rstatc->updated_next = NULL;
-	return pos;
-}
-
-/* Return a list of updated cgroups to be flushed */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
-{
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
-	struct cgroup *head, *tail, *next;
-	unsigned long flags;
 
-	/*
-	 * The _irqsave() is needed because cgroup_rstat_lock is
-	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
-	 * this lock with the _irq() suffix only disables interrupts on
-	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
-	 * interrupts on both configurations. The _irqsave() ensures
-	 * that interrupts are always disabled and later restored.
-	 */
-	raw_spin_lock_irqsave(cpu_lock, flags);
-	head = tail = cgroup_rstat_cpu_pop_updated(NULL, root, cpu);
-	while (tail) {
-		next = cgroup_rstat_cpu_pop_updated(tail, root, cpu);
-		tail->rstat_flush_next = next;
-		tail = next;
-	}
+	/* Push @root to the list first before pushing the children */
+	head = root;
+	root->rstat_flush_next = NULL;
+	if (rstatc->updated_children != root)
+		head = cgroup_rstat_push_children(head, rstatc, cpu);
+unlock_ret:
 	raw_spin_unlock_irqrestore(cpu_lock, flags);
 	return head;
 }
-- 
2.39.3
Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
Posted by Yosry Ahmed 2 years, 1 month ago
On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@redhat.com> wrote:
>
> The current design of cgroup_rstat_cpu_pop_updated() is to traverse
> the updated tree in a way to pop out the leaf nodes first before
> their parents. This can cause traversal of multiple nodes before a
> leaf node can be found and popped out. IOW, a given node in the tree
> can be visited multiple times before the whole operation is done. So
> it is not very efficient and the code can be hard to read.
>
> With the introduction of cgroup_rstat_updated_list() to build a list
> of cgroups to be flushed first before any flushing operation is being
> done, we can optimize the way the updated tree nodes are being popped
> by pushing the parents first to the tail end of the list before their
> children. In this way, most updated tree nodes will be visited only
> once with the exception of the subtree root as we still need to go
> back to its parent and popped it out of its updated_children list.
> This also makes the code easier to read.
>
> A parallel kernel build on a 2-socket x86-64 server is used as the
> benchmarking tool for measuring the lock hold time. Below were the lock
> hold time frequency distribution before and after the patch:
>
>      Hold time        Before patch       After patch
>      ---------        ------------       -----------
>        0-01 us        13,738,708         14,594,545
>       01-05 us         1,177,194            439,926
>       05-10 us             4,984              5,960
>       10-15 us             3,562              3,543
>       15-20 us             1,314              1,397
>       20-25 us                18                 25
>       25-30 us                12                 12
>
> It can be seen that the patch pushes the lock hold time towards the
> lower end.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---

I don't know why git decided to show this diff in the most confusing
way possible.

>  kernel/cgroup/rstat.c | 132 ++++++++++++++++++++++--------------------
>  1 file changed, 70 insertions(+), 62 deletions(-)
>
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 1f300bf4dc40..d2b709cfeb2a 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -74,64 +74,90 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
>  }
>
>  /**
> - * cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated tree
> - * @pos: current position
> - * @root: root of the tree to traversal
> + * cgroup_rstat_push_children - push children cgroups into the given list
> + * @head: current head of the list (= parent cgroup)
> + * @prstatc: cgroup_rstat_cpu of the parent cgroup
>   * @cpu: target cpu
> + * Return: A new singly linked list of cgroups to be flush
>   *
> - * Walks the updated rstat_cpu tree on @cpu from @root.  %NULL @pos starts
> - * the traversal and %NULL return indicates the end.  During traversal,
> - * each returned cgroup is unlinked from the tree.  Must be called with the
> - * matching cgroup_rstat_cpu_lock held.
> + * Recursively traverse down the cgroup_rstat_cpu updated tree and push
> + * parent first before its children. The parent is pushed by the caller.

I think it might be useful here (and elsewhere in the patch) where
"push" is being used to elaborate that we push to the beginning in a
stack-like fashion.

> + * The recursion depth is the depth of the current updated tree.
> + */
> +static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
> +                               struct cgroup_rstat_cpu *prstatc, int cpu)
> +{
> +       struct cgroup *child, *parent;
> +       struct cgroup_rstat_cpu *crstatc;
> +
> +       parent = head;
> +       child = prstatc->updated_children;
> +       prstatc->updated_children = parent;
> +
> +       /* updated_next is parent cgroup terminated */
> +       while (child != parent) {
> +               child->rstat_flush_next = head;
> +               head = child;
> +               crstatc = cgroup_rstat_cpu(child, cpu);
> +               if (crstatc->updated_children != parent)

I think cgroup->updated_children is set to the cgroup itself if it's
empty, right? Shouldn't this be crstatc->updated_children != child?

> +                       head = cgroup_rstat_push_children(head, crstatc, cpu);
> +               child = crstatc->updated_next;
> +               crstatc->updated_next = NULL;
> +       }
> +       return head;
> +}
> +
> +/**
> + * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
> + * @root: root of the cgroup subtree to traverse
> + * @cpu: target cpu
> + * Return: A singly linked list of cgroups to be flushed
> + *
> + * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
> + * each returned cgroup is unlinked from the updated tree.  Must be called
> + * with the matching cgroup_rstat_cpu_lock held.

This function takes care of holding the lock actually. I think that
sentence should be applied to cgroup_rstat_push_children() above?

>   *
>   * The only ordering guarantee is that, for a parent and a child pair
> - * covered by a given traversal, if a child is visited, its parent is
> - * guaranteed to be visited afterwards.
> + * covered by a given traversal, the child is before its parent in
> + * the list.
> + *
> + * Note that updated_children is self terminated while updated_next is
> + * parent cgroup terminated except the cgroup root which can be self
> + * terminated.

IIUC updated_children and updated_next is the same list.
updated_children is the head, and updated_next is how the list items
are linked. This comment makes it seem like they are two different
lists.

I am actually wondering if it's worth using the singly linked list
here. We are saving 8 bytes percpu, but the semantics are fairly
confusing. Wouldn't this be easier to reason about if you just use
list_head?

updated_children would be replaced with LIST_HEAD (or similar), and
the list would be NULL terminated instead of terminated by self/parent
cgroup. IIUC the reason it's not NULL-terminated now is because we use
cgroup->updated_next to check quickly if a cgroup is on the list or
not. If we use list_heads, we can just use list_emtpy() IIUC.

We can also simplify the semantics of unlinking @root from the updated
tree below, it would just be list_del() IIUC, which is actually more
performant as well. It seems like overall we would simplify a lot of
things. When forming the updated_list, we can just walk the tree and
splice the lists in the correct order.

It seems to me that saving 8 bytes percpu is not worth the complexity
of the custom list semantics here. Am I missing something here?
Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
Posted by Waiman Long 2 years, 1 month ago
On 11/6/23 15:07, Yosry Ahmed wrote:
> On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@redhat.com> wrote:
>> The current design of cgroup_rstat_cpu_pop_updated() is to traverse
>> the updated tree in a way to pop out the leaf nodes first before
>> their parents. This can cause traversal of multiple nodes before a
>> leaf node can be found and popped out. IOW, a given node in the tree
>> can be visited multiple times before the whole operation is done. So
>> it is not very efficient and the code can be hard to read.
>>
>> With the introduction of cgroup_rstat_updated_list() to build a list
>> of cgroups to be flushed first before any flushing operation is being
>> done, we can optimize the way the updated tree nodes are being popped
>> by pushing the parents first to the tail end of the list before their
>> children. In this way, most updated tree nodes will be visited only
>> once with the exception of the subtree root as we still need to go
>> back to its parent and popped it out of its updated_children list.
>> This also makes the code easier to read.
>>
>> A parallel kernel build on a 2-socket x86-64 server is used as the
>> benchmarking tool for measuring the lock hold time. Below were the lock
>> hold time frequency distribution before and after the patch:
>>
>>       Hold time        Before patch       After patch
>>       ---------        ------------       -----------
>>         0-01 us        13,738,708         14,594,545
>>        01-05 us         1,177,194            439,926
>>        05-10 us             4,984              5,960
>>        10-15 us             3,562              3,543
>>        15-20 us             1,314              1,397
>>        20-25 us                18                 25
>>        25-30 us                12                 12
>>
>> It can be seen that the patch pushes the lock hold time towards the
>> lower end.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
> I don't know why git decided to show this diff in the most confusing
> way possible.
I agree. The diff is really hard to look at. It will be easier to apply 
the patch & looks at the actual rstat.c file.
>
>>   kernel/cgroup/rstat.c | 132 ++++++++++++++++++++++--------------------
>>   1 file changed, 70 insertions(+), 62 deletions(-)
>>
>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>> index 1f300bf4dc40..d2b709cfeb2a 100644
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
>> @@ -74,64 +74,90 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
>>   }
>>
>>   /**
>> - * cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated tree
>> - * @pos: current position
>> - * @root: root of the tree to traversal
>> + * cgroup_rstat_push_children - push children cgroups into the given list
>> + * @head: current head of the list (= parent cgroup)
>> + * @prstatc: cgroup_rstat_cpu of the parent cgroup
>>    * @cpu: target cpu
>> + * Return: A new singly linked list of cgroups to be flush
>>    *
>> - * Walks the updated rstat_cpu tree on @cpu from @root.  %NULL @pos starts
>> - * the traversal and %NULL return indicates the end.  During traversal,
>> - * each returned cgroup is unlinked from the tree.  Must be called with the
>> - * matching cgroup_rstat_cpu_lock held.
>> + * Recursively traverse down the cgroup_rstat_cpu updated tree and push
>> + * parent first before its children. The parent is pushed by the caller.
> I think it might be useful here (and elsewhere in the patch) where
> "push" is being used to elaborate that we push to the beginning in a
> stack-like fashion.
Right, I am thinking about a stack when I use the word "push". I will 
clarify that in the comment.
>
>> + * The recursion depth is the depth of the current updated tree.
>> + */
>> +static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
>> +                               struct cgroup_rstat_cpu *prstatc, int cpu)
>> +{
>> +       struct cgroup *child, *parent;
>> +       struct cgroup_rstat_cpu *crstatc;
>> +
>> +       parent = head;
>> +       child = prstatc->updated_children;
>> +       prstatc->updated_children = parent;
>> +
>> +       /* updated_next is parent cgroup terminated */
>> +       while (child != parent) {
>> +               child->rstat_flush_next = head;
>> +               head = child;
>> +               crstatc = cgroup_rstat_cpu(child, cpu);
>> +               if (crstatc->updated_children != parent)
> I think cgroup->updated_children is set to the cgroup itself if it's
> empty, right? Shouldn't this be crstatc->updated_children != child?
My mistake. Will fix it in the next version.
>
>> +                       head = cgroup_rstat_push_children(head, crstatc, cpu);
>> +               child = crstatc->updated_next;
>> +               crstatc->updated_next = NULL;
>> +       }
>> +       return head;
>> +}
>> +
>> +/**
>> + * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
>> + * @root: root of the cgroup subtree to traverse
>> + * @cpu: target cpu
>> + * Return: A singly linked list of cgroups to be flushed
>> + *
>> + * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
>> + * each returned cgroup is unlinked from the updated tree.  Must be called
>> + * with the matching cgroup_rstat_cpu_lock held.
> This function takes care of holding the lock actually. I think that
> sentence should be applied to cgroup_rstat_push_children() above?
It is left over from before this patch. Will remove that.
>
>>    *
>>    * The only ordering guarantee is that, for a parent and a child pair
>> - * covered by a given traversal, if a child is visited, its parent is
>> - * guaranteed to be visited afterwards.
>> + * covered by a given traversal, the child is before its parent in
>> + * the list.
>> + *
>> + * Note that updated_children is self terminated while updated_next is
>> + * parent cgroup terminated except the cgroup root which can be self
>> + * terminated.
> IIUC updated_children and updated_next is the same list.
> updated_children is the head, and updated_next is how the list items
> are linked. This comment makes it seem like they are two different
> lists.
Thanks for the comment. I will rework the comment to clarify that a bit 
more.
>
> I am actually wondering if it's worth using the singly linked list
> here. We are saving 8 bytes percpu, but the semantics are fairly
> confusing. Wouldn't this be easier to reason about if you just use
> list_head?
>
> updated_children would be replaced with LIST_HEAD (or similar), and
> the list would be NULL terminated instead of terminated by self/parent
> cgroup. IIUC the reason it's not NULL-terminated now is because we use
> cgroup->updated_next to check quickly if a cgroup is on the list or
> not. If we use list_heads, we can just use list_emtpy() IIUC.
>
> We can also simplify the semantics of unlinking @root from the updated
> tree below, it would just be list_del() IIUC, which is actually more
> performant as well. It seems like overall we would simplify a lot of
> things. When forming the updated_list, we can just walk the tree and
> splice the lists in the correct order.
>
> It seems to me that saving 8 bytes percpu is not worth the complexity
> of the custom list semantics here. Am I missing something here?

It will cost an additional 16 bytes of percpu memory if converted to 
list_heads. Like other lists, there will be sibling and children 
list_heads. There are also 2 pointers to update instead of one. Anyway, 
I don't have an objection to convert them to list_heads if agreed by Tejun.

Cheers,
Longman


Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
Posted by Yosry Ahmed 2 years, 1 month ago
On Mon, Nov 6, 2023 at 12:37 PM Waiman Long <longman@redhat.com> wrote:
>
> On 11/6/23 15:07, Yosry Ahmed wrote:
> > On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@redhat.com> wrote:
> >> The current design of cgroup_rstat_cpu_pop_updated() is to traverse
> >> the updated tree in a way to pop out the leaf nodes first before
> >> their parents. This can cause traversal of multiple nodes before a
> >> leaf node can be found and popped out. IOW, a given node in the tree
> >> can be visited multiple times before the whole operation is done. So
> >> it is not very efficient and the code can be hard to read.
> >>
> >> With the introduction of cgroup_rstat_updated_list() to build a list
> >> of cgroups to be flushed first before any flushing operation is being
> >> done, we can optimize the way the updated tree nodes are being popped
> >> by pushing the parents first to the tail end of the list before their
> >> children. In this way, most updated tree nodes will be visited only
> >> once with the exception of the subtree root as we still need to go
> >> back to its parent and popped it out of its updated_children list.
> >> This also makes the code easier to read.
> >>
> >> A parallel kernel build on a 2-socket x86-64 server is used as the
> >> benchmarking tool for measuring the lock hold time. Below were the lock
> >> hold time frequency distribution before and after the patch:
> >>
> >>       Hold time        Before patch       After patch
> >>       ---------        ------------       -----------
> >>         0-01 us        13,738,708         14,594,545
> >>        01-05 us         1,177,194            439,926
> >>        05-10 us             4,984              5,960
> >>        10-15 us             3,562              3,543
> >>        15-20 us             1,314              1,397
> >>        20-25 us                18                 25
> >>        25-30 us                12                 12
> >>
> >> It can be seen that the patch pushes the lock hold time towards the
> >> lower end.
> >>
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> >> ---
> > I don't know why git decided to show this diff in the most confusing
> > way possible.
> I agree. The diff is really hard to look at. It will be easier to apply
> the patch & looks at the actual rstat.c file.

Would the diff be simpler if patches 1 & 2 were squashed?

[..]
> >
> >>    *
> >>    * The only ordering guarantee is that, for a parent and a child pair
> >> - * covered by a given traversal, if a child is visited, its parent is
> >> - * guaranteed to be visited afterwards.
> >> + * covered by a given traversal, the child is before its parent in
> >> + * the list.
> >> + *
> >> + * Note that updated_children is self terminated while updated_next is
> >> + * parent cgroup terminated except the cgroup root which can be self
> >> + * terminated.
> > IIUC updated_children and updated_next is the same list.
> > updated_children is the head, and updated_next is how the list items
> > are linked. This comment makes it seem like they are two different
> > lists.
> Thanks for the comment. I will rework the comment to clarify that a bit
> more.
> >
> > I am actually wondering if it's worth using the singly linked list
> > here. We are saving 8 bytes percpu, but the semantics are fairly
> > confusing. Wouldn't this be easier to reason about if you just use
> > list_head?
> >
> > updated_children would be replaced with LIST_HEAD (or similar), and
> > the list would be NULL terminated instead of terminated by self/parent
> > cgroup. IIUC the reason it's not NULL-terminated now is because we use
> > cgroup->updated_next to check quickly if a cgroup is on the list or
> > not. If we use list_heads, we can just use list_emtpy() IIUC.
> >
> > We can also simplify the semantics of unlinking @root from the updated
> > tree below, it would just be list_del() IIUC, which is actually more
> > performant as well. It seems like overall we would simplify a lot of
> > things. When forming the updated_list, we can just walk the tree and
> > splice the lists in the correct order.
> >
> > It seems to me that saving 8 bytes percpu is not worth the complexity
> > of the custom list semantics here. Am I missing something here?
>
> It will cost an additional 16 bytes of percpu memory if converted to
> list_heads. Like other lists, there will be sibling and children
> list_heads. There are also 2 pointers to update instead of one. Anyway,
> I don't have an objection to convert them to list_heads if agreed by Tejun.

Yes you are right. It's definitely not free, but it's also not super
costly. It's just that every time I look at the rstat code I need to
remind myself of how updated_next and updated_children work. I will
let Tejun decide.
Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
Posted by Waiman Long 2 years, 1 month ago
On 11/6/23 16:17, Yosry Ahmed wrote:
> On Mon, Nov 6, 2023 at 12:37 PM Waiman Long <longman@redhat.com> wrote:
>> On 11/6/23 15:07, Yosry Ahmed wrote:
>>> On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@redhat.com> wrote:
>>>> The current design of cgroup_rstat_cpu_pop_updated() is to traverse
>>>> the updated tree in a way to pop out the leaf nodes first before
>>>> their parents. This can cause traversal of multiple nodes before a
>>>> leaf node can be found and popped out. IOW, a given node in the tree
>>>> can be visited multiple times before the whole operation is done. So
>>>> it is not very efficient and the code can be hard to read.
>>>>
>>>> With the introduction of cgroup_rstat_updated_list() to build a list
>>>> of cgroups to be flushed first before any flushing operation is being
>>>> done, we can optimize the way the updated tree nodes are being popped
>>>> by pushing the parents first to the tail end of the list before their
>>>> children. In this way, most updated tree nodes will be visited only
>>>> once with the exception of the subtree root as we still need to go
>>>> back to its parent and popped it out of its updated_children list.
>>>> This also makes the code easier to read.
>>>>
>>>> A parallel kernel build on a 2-socket x86-64 server is used as the
>>>> benchmarking tool for measuring the lock hold time. Below were the lock
>>>> hold time frequency distribution before and after the patch:
>>>>
>>>>        Hold time        Before patch       After patch
>>>>        ---------        ------------       -----------
>>>>          0-01 us        13,738,708         14,594,545
>>>>         01-05 us         1,177,194            439,926
>>>>         05-10 us             4,984              5,960
>>>>         10-15 us             3,562              3,543
>>>>         15-20 us             1,314              1,397
>>>>         20-25 us                18                 25
>>>>         25-30 us                12                 12
>>>>
>>>> It can be seen that the patch pushes the lock hold time towards the
>>>> lower end.
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> ---
>>> I don't know why git decided to show this diff in the most confusing
>>> way possible.
>> I agree. The diff is really hard to look at. It will be easier to apply
>> the patch & looks at the actual rstat.c file.
> Would the diff be simpler if patches 1 & 2 were squashed?
Maybe, but I prefer to keep them separate for now.
>
> [..]
>>>>     *
>>>>     * The only ordering guarantee is that, for a parent and a child pair
>>>> - * covered by a given traversal, if a child is visited, its parent is
>>>> - * guaranteed to be visited afterwards.
>>>> + * covered by a given traversal, the child is before its parent in
>>>> + * the list.
>>>> + *
>>>> + * Note that updated_children is self terminated while updated_next is
>>>> + * parent cgroup terminated except the cgroup root which can be self
>>>> + * terminated.
>>> IIUC updated_children and updated_next is the same list.
>>> updated_children is the head, and updated_next is how the list items
>>> are linked. This comment makes it seem like they are two different
>>> lists.
>> Thanks for the comment. I will rework the comment to clarify that a bit
>> more.
>>> I am actually wondering if it's worth using the singly linked list
>>> here. We are saving 8 bytes percpu, but the semantics are fairly
>>> confusing. Wouldn't this be easier to reason about if you just use
>>> list_head?
>>>
>>> updated_children would be replaced with LIST_HEAD (or similar), and
>>> the list would be NULL terminated instead of terminated by self/parent
>>> cgroup. IIUC the reason it's not NULL-terminated now is because we use
>>> cgroup->updated_next to check quickly if a cgroup is on the list or
>>> not. If we use list_heads, we can just use list_emtpy() IIUC.
>>>
>>> We can also simplify the semantics of unlinking @root from the updated
>>> tree below, it would just be list_del() IIUC, which is actually more
>>> performant as well. It seems like overall we would simplify a lot of
>>> things. When forming the updated_list, we can just walk the tree and
>>> splice the lists in the correct order.
>>>
>>> It seems to me that saving 8 bytes percpu is not worth the complexity
>>> of the custom list semantics here. Am I missing something here?
>> It will cost an additional 16 bytes of percpu memory if converted to
>> list_heads. Like other lists, there will be sibling and children
>> list_heads. There are also 2 pointers to update instead of one. Anyway,
>> I don't have an objection to convert them to list_heads if agreed by Tejun.
> Yes you are right. It's definitely not free, but it's also not super
> costly. It's just that every time I look at the rstat code I need to
> remind myself of how updated_next and updated_children work. I will
> let Tejun decide.

After further thought, changing it to list_head may not be possible with 
the current design since the actual linkage is like:

update_next -> cgroup + cpu --> update_next --> ...

So unless we change the design to link cgroup_rstat_cpu directly to each 
other for a given CPU, not via a cgroup intermediary, we will not be 
able to use list_head and the associated list_add() & list_del() macros. 
Direct linkage, however, requires a cgroup back pointer. So the real 
cost will be 24 bytes instead.

Cheers,
Longman

Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
Posted by Yosry Ahmed 2 years, 1 month ago
> >>>>     *
> >>>>     * The only ordering guarantee is that, for a parent and a child pair
> >>>> - * covered by a given traversal, if a child is visited, its parent is
> >>>> - * guaranteed to be visited afterwards.
> >>>> + * covered by a given traversal, the child is before its parent in
> >>>> + * the list.
> >>>> + *
> >>>> + * Note that updated_children is self terminated while updated_next is
> >>>> + * parent cgroup terminated except the cgroup root which can be self
> >>>> + * terminated.
> >>> IIUC updated_children and updated_next is the same list.
> >>> updated_children is the head, and updated_next is how the list items
> >>> are linked. This comment makes it seem like they are two different
> >>> lists.
> >> Thanks for the comment. I will rework the comment to clarify that a bit
> >> more.
> >>> I am actually wondering if it's worth using the singly linked list
> >>> here. We are saving 8 bytes percpu, but the semantics are fairly
> >>> confusing. Wouldn't this be easier to reason about if you just use
> >>> list_head?
> >>>
> >>> updated_children would be replaced with LIST_HEAD (or similar), and
> >>> the list would be NULL terminated instead of terminated by self/parent
> >>> cgroup. IIUC the reason it's not NULL-terminated now is because we use
> >>> cgroup->updated_next to check quickly if a cgroup is on the list or
> >>> not. If we use list_heads, we can just use list_emtpy() IIUC.
> >>>
> >>> We can also simplify the semantics of unlinking @root from the updated
> >>> tree below, it would just be list_del() IIUC, which is actually more
> >>> performant as well. It seems like overall we would simplify a lot of
> >>> things. When forming the updated_list, we can just walk the tree and
> >>> splice the lists in the correct order.
> >>>
> >>> It seems to me that saving 8 bytes percpu is not worth the complexity
> >>> of the custom list semantics here. Am I missing something here?
> >> It will cost an additional 16 bytes of percpu memory if converted to
> >> list_heads. Like other lists, there will be sibling and children
> >> list_heads. There are also 2 pointers to update instead of one. Anyway,
> >> I don't have an objection to convert them to list_heads if agreed by Tejun.
> > Yes you are right. It's definitely not free, but it's also not super
> > costly. It's just that every time I look at the rstat code I need to
> > remind myself of how updated_next and updated_children work. I will
> > let Tejun decide.
>
> After further thought, changing it to list_head may not be possible with
> the current design since the actual linkage is like:
>
> update_next -> cgroup + cpu --> update_next --> ...
>
> So unless we change the design to link cgroup_rstat_cpu directly to each
> other for a given CPU, not via a cgroup intermediary, we will not be
> able to use list_head and the associated list_add() & list_del() macros.
> Direct linkage, however, requires a cgroup back pointer. So the real
> cost will be 24 bytes instead.

Yes you are right. Perhaps it's not worth it and it may not be as
simple as I thought. Please dismiss this suggestion, we'll have to
rely on comments for now to keep things clear.