[v7] Add support for qca8k mdio rw in Ethernet packet

[RFC PATCH v7 00/16] Add support for qca8k mdio rw in Ethernet packet

Posted by Ansuel Smith 4 years, 5 months ago

Hi, this is ready but require some additional test on a wider userbase.

The main reason for this is that we notice some routing problem in the
switch and it seems assisted learning is needed. Considering mdio is
quite slow due to the indirect write using this Ethernet alternative way
seems to be quicker.

The qca8k switch supports a special way to pass mdio read/write request
using specially crafted Ethernet packet.
This works by putting some defined data in the Ethernet header where the
mac source and dst should be placed. The Ethernet type header is set to qca
header and is set to a mdio read/write type.
This is used to communicate to the switch that this is a special packet
and should be parsed differently.

Currently we use Ethernet packet for
- MIB counter
- mdio read/write configuration
- phy read/write for each port

Current implementation of this use completion API to wait for the packet
to be processed by the tagger and has a timeout that fallback to the
legacy mdio way and mutex to enforce one transaction at time.

We now have connect()/disconnect() ops for the tagger. They are used to
allocate priv data in the dsa priv. The header still has to be put in
global include to make it usable by a dsa driver.
They are called when the tag is connect to the dst and the data is freed
using discconect on tagger change.

(if someone wonder why the bind function is put at in the general setup
function it's because tag is set in the cpu port where the notifier is
still not available and we require the notifier to sen the
tag_proto_connect() event.

We now have a tag_proto_connect() for the dsa driver used to put
additional data in the tagger priv (that is actually the dsa priv).
This is called using a switch event DSA_NOTIFIER_TAG_PROTO_CONNECT.
Current use for this is adding handler for the Ethernet packet to keep
the tagger code as dumb as possible.

The tagger priv implement only the handler for the special packet. All the
other stuff is placed in the qca8k_priv and the tagger has to access
it under lock.

We use the new API from Vladimir to track if the master port is
operational or not. We had to track many thing to reach a usable state.
Checking if the port is UP is not enough and tracking a NETDEV_CHANGE is
also not enough since it use also for other task. The correct way was
both track for interface UP and if a qdisc was assigned to the
interface. That tells us the port (and the tagger indirectly) is ready
to accept and process packet.

I tested this with multicpu port and with port6 set as the unique port and
it's sad.
It seems they implemented this feature in a bad way and this is only
supported with cpu port0. When cpu port6 is the unique port, the switch
doesn't send ack packet. With multicpu port, packet ack are not duplicated
and only cpu port0 sends them. This is the same for the MIB counter.
For this reason this feature is enabled only when cpu port0 is enabled and
operational.




Sorry if I missed any comments. This series is 2 month old so I think it
would be good to recheck everything. In the mean time this was tested on
and no regression are observed related to random port drop.

v7:
- Rebase on net-next changes
- Add bulk patches to speedup this even more
v6:
- Fix some error in ethtool handler caused by rebase/cleanup
v5:
- Adapt to new API fixes
- Fix a wrong logic for noop
- Add additional lock for master_state change
- Limit mdio Ethernet to cpu port0 (switch limitation)
- Add priority to these special packet
- Move mdio cache to qca8k_priv
v4:
- Remove duplicate patch sent by mistake.
v3:
- Include MIB with Ethernet packet.
- Include phy read/write with Ethernet packet.
- Reorganize code with new API.
- Introuce master tracking by Vladimir
v2:
- Address all suggestion from Vladimir.
  Try to generilize this with connect/disconnect function from the
  tagger and tag_proto_connect for the driver.

Ansuel Smith (14):
  net: dsa: tag_qca: convert to FIELD macro
  net: dsa: tag_qca: move define to include linux/dsa
  net: dsa: tag_qca: enable promisc_on_master flag
  net: dsa: tag_qca: add define for handling mgmt Ethernet packet
  net: dsa: tag_qca: add define for handling MIB packet
  net: dsa: tag_qca: add support for handling mgmt and MIB Ethernet
    packet
  net: dsa: qca8k: add tracking state of master port
  net: dsa: qca8k: add support for mgmt read/write in Ethernet packet
  net: dsa: qca8k: add support for mib autocast in Ethernet packet
  net: dsa: qca8k: add support for phy read/write with mgmt Ethernet
  net: dsa: qca8k: move page cache to driver priv
  net: dsa: qca8k: cache lo and hi for mdio write
  net: da: qca8k: add support for larger read/write size with mgmt
    Ethernet
  net: dsa: qca8k: introduce qca8k_bulk_read/write function

Vladimir Oltean (2):
  net: dsa: provide switch operations for tracking the master state
  net: dsa: replay master state events in
    dsa_tree_{setup,teardown}_master

 drivers/net/dsa/qca8k.c     | 709 +++++++++++++++++++++++++++++++++---
 drivers/net/dsa/qca8k.h     |  46 ++-
 include/linux/dsa/tag_qca.h |  82 +++++
 include/net/dsa.h           |  17 +
 net/dsa/dsa2.c              |  74 +++-
 net/dsa/dsa_priv.h          |  13 +
 net/dsa/slave.c             |  32 ++
 net/dsa/switch.c            |  15 +
 net/dsa/tag_qca.c           |  83 +++--
 9 files changed, 993 insertions(+), 78 deletions(-)
 create mode 100644 include/linux/dsa/tag_qca.h

-- 
2.33.1

Re: [RFC PATCH v7 00/16] Add support for qca8k mdio rw in Ethernet packet

Posted by Ansuel Smith 4 years, 5 months ago

On Sun, Jan 23, 2022 at 02:33:21AM +0100, Ansuel Smith wrote:
> Hi, this is ready but require some additional test on a wider userbase.
> 
> The main reason for this is that we notice some routing problem in the
> switch and it seems assisted learning is needed. Considering mdio is
> quite slow due to the indirect write using this Ethernet alternative way
> seems to be quicker.
> 
> The qca8k switch supports a special way to pass mdio read/write request
> using specially crafted Ethernet packet.
> This works by putting some defined data in the Ethernet header where the
> mac source and dst should be placed. The Ethernet type header is set to qca
> header and is set to a mdio read/write type.
> This is used to communicate to the switch that this is a special packet
> and should be parsed differently.
> 
> Currently we use Ethernet packet for
> - MIB counter
> - mdio read/write configuration
> - phy read/write for each port
> 
> Current implementation of this use completion API to wait for the packet
> to be processed by the tagger and has a timeout that fallback to the
> legacy mdio way and mutex to enforce one transaction at time.
> 
> We now have connect()/disconnect() ops for the tagger. They are used to
> allocate priv data in the dsa priv. The header still has to be put in
> global include to make it usable by a dsa driver.
> They are called when the tag is connect to the dst and the data is freed
> using discconect on tagger change.
> 
> (if someone wonder why the bind function is put at in the general setup
> function it's because tag is set in the cpu port where the notifier is
> still not available and we require the notifier to sen the
> tag_proto_connect() event.
> 
> We now have a tag_proto_connect() for the dsa driver used to put
> additional data in the tagger priv (that is actually the dsa priv).
> This is called using a switch event DSA_NOTIFIER_TAG_PROTO_CONNECT.
> Current use for this is adding handler for the Ethernet packet to keep
> the tagger code as dumb as possible.
> 
> The tagger priv implement only the handler for the special packet. All the
> other stuff is placed in the qca8k_priv and the tagger has to access
> it under lock.
> 
> We use the new API from Vladimir to track if the master port is
> operational or not. We had to track many thing to reach a usable state.
> Checking if the port is UP is not enough and tracking a NETDEV_CHANGE is
> also not enough since it use also for other task. The correct way was
> both track for interface UP and if a qdisc was assigned to the
> interface. That tells us the port (and the tagger indirectly) is ready
> to accept and process packet.
> 
> I tested this with multicpu port and with port6 set as the unique port and
> it's sad.
> It seems they implemented this feature in a bad way and this is only
> supported with cpu port0. When cpu port6 is the unique port, the switch
> doesn't send ack packet. With multicpu port, packet ack are not duplicated
> and only cpu port0 sends them. This is the same for the MIB counter.
> For this reason this feature is enabled only when cpu port0 is enabled and
> operational.
> 
> 
> 
> 
> Sorry if I missed any comments. This series is 2 month old so I think it
> would be good to recheck everything. In the mean time this was tested on
> and no regression are observed related to random port drop.
> 
> v7:
> - Rebase on net-next changes
> - Add bulk patches to speedup this even more
> v6:
> - Fix some error in ethtool handler caused by rebase/cleanup
> v5:
> - Adapt to new API fixes
> - Fix a wrong logic for noop
> - Add additional lock for master_state change
> - Limit mdio Ethernet to cpu port0 (switch limitation)
> - Add priority to these special packet
> - Move mdio cache to qca8k_priv
> v4:
> - Remove duplicate patch sent by mistake.
> v3:
> - Include MIB with Ethernet packet.
> - Include phy read/write with Ethernet packet.
> - Reorganize code with new API.
> - Introuce master tracking by Vladimir
> v2:
> - Address all suggestion from Vladimir.
>   Try to generilize this with connect/disconnect function from the
>   tagger and tag_proto_connect for the driver.
> 
> Ansuel Smith (14):
>   net: dsa: tag_qca: convert to FIELD macro
>   net: dsa: tag_qca: move define to include linux/dsa
>   net: dsa: tag_qca: enable promisc_on_master flag
>   net: dsa: tag_qca: add define for handling mgmt Ethernet packet
>   net: dsa: tag_qca: add define for handling MIB packet
>   net: dsa: tag_qca: add support for handling mgmt and MIB Ethernet
>     packet
>   net: dsa: qca8k: add tracking state of master port
>   net: dsa: qca8k: add support for mgmt read/write in Ethernet packet
>   net: dsa: qca8k: add support for mib autocast in Ethernet packet
>   net: dsa: qca8k: add support for phy read/write with mgmt Ethernet
>   net: dsa: qca8k: move page cache to driver priv
>   net: dsa: qca8k: cache lo and hi for mdio write
>   net: da: qca8k: add support for larger read/write size with mgmt
>     Ethernet
>   net: dsa: qca8k: introduce qca8k_bulk_read/write function
> 
> Vladimir Oltean (2):
>   net: dsa: provide switch operations for tracking the master state
>   net: dsa: replay master state events in
>     dsa_tree_{setup,teardown}_master
> 
>  drivers/net/dsa/qca8k.c     | 709 +++++++++++++++++++++++++++++++++---
>  drivers/net/dsa/qca8k.h     |  46 ++-
>  include/linux/dsa/tag_qca.h |  82 +++++
>  include/net/dsa.h           |  17 +
>  net/dsa/dsa2.c              |  74 +++-
>  net/dsa/dsa_priv.h          |  13 +
>  net/dsa/slave.c             |  32 ++
>  net/dsa/switch.c            |  15 +
>  net/dsa/tag_qca.c           |  83 +++--
>  9 files changed, 993 insertions(+), 78 deletions(-)
>  create mode 100644 include/linux/dsa/tag_qca.h
> 
> -- 
> 2.33.1
> 

Hi,
sorry for the delay in sending v8, it's ready but I'm far from home and
I still need to check some mdio improvement with pointer handling.

Anyway I have some concern aboutall the skb alloc.
I wonder if that part can be improved at the cost of some additional
space used.

The idea Is to use the cache stuff also for the eth skb (or duplicate
it?) And use something like build_skb and recycle the skb space
everytime...
This comes from the fact that packet size is ALWAYS the same and it
seems stupid to allocate and free it everytime. Considering we also
enforce a one way transaction (we send packet and we wait for response)
this makes the allocation process even more stupid.

So I wonder if we would have some perf improvement/less load by
declaring the mgmt eth space and build an skb that always use that
preallocate space and just modify data.

I would really love some feedback considering qca8k is also used in very
low spec ath79 device where we need to reduce the load in every way
possible. Also if anyone have more ideas on how to improve this to make
it less heavy cpu side, feel free to point it out even if it would
mean that my implemenation is complete sh*t.

(The use of caching the address would permit us to reduce the write to
this preallocated space even more or ideally to send the same skb)

-- 
	Ansuel

Re: [RFC PATCH v7 00/16] Add support for qca8k mdio rw in Ethernet packet

Posted by Florian Fainelli 4 years, 5 months ago

On 1/30/2022 5:59 AM, Ansuel Smith wrote:
>>
> 
> Hi,
> sorry for the delay in sending v8, it's ready but I'm far from home and
> I still need to check some mdio improvement with pointer handling.
> 
> Anyway I have some concern aboutall the skb alloc.
> I wonder if that part can be improved at the cost of some additional
> space used.
> 
> The idea Is to use the cache stuff also for the eth skb (or duplicate
> it?) And use something like build_skb and recycle the skb space
> everytime...
> This comes from the fact that packet size is ALWAYS the same and it
> seems stupid to allocate and free it everytime. Considering we also
> enforce a one way transaction (we send packet and we wait for response)
> this makes the allocation process even more stupid.
> 
> So I wonder if we would have some perf improvement/less load by
> declaring the mgmt eth space and build an skb that always use that
> preallocate space and just modify data.
> 
> I would really love some feedback considering qca8k is also used in very
> low spec ath79 device where we need to reduce the load in every way
> possible. Also if anyone have more ideas on how to improve this to make
> it less heavy cpu side, feel free to point it out even if it would
> mean that my implemenation is complete sh*t.
> 
> (The use of caching the address would permit us to reduce the write to
> this preallocated space even more or ideally to send the same skb)

I would say first things first: get this patch series included since it 
is very close from being suitable for inclusion in net-next. Then you 
can profile the I/O accesses over the management Ethernet frames and 
devise a strategy to optimize them to make as little CPU cycles 
intensive as possible.

build_skb() is not exactly a magic bullet that will solve all 
performance problems, you still need the non-data portion of the skb to 
be allocated, and also keep in mind that you need tail room at the end 
of the data buffer in order for struct skb_shared_info to be written. 
This means that the hardware is not allowed to write at the end of the 
data buffer, or you must reduce the maximum RX length accordingly to 
prevent that. Your frames are small enough here this is unlikely to be 
an issue.

Since the MDIO layer does not really allow more than one outstanding 
transaction per MDIO device at a time, you might be just fine with just 
have a front and back skb set of buffers and alternating between these two.
-- 
Florian

Re: [RFC PATCH v7 00/16] Add support for qca8k mdio rw in Ethernet packet

Posted by Ansuel Smith 4 years, 4 months ago

On Sun, Jan 30, 2022 at 09:07:16AM -0800, Florian Fainelli wrote:
> 
> 
> On 1/30/2022 5:59 AM, Ansuel Smith wrote:
> > > 
> > 
> > Hi,
> > sorry for the delay in sending v8, it's ready but I'm far from home and
> > I still need to check some mdio improvement with pointer handling.
> > 
> > Anyway I have some concern aboutall the skb alloc.
> > I wonder if that part can be improved at the cost of some additional
> > space used.
> > 
> > The idea Is to use the cache stuff also for the eth skb (or duplicate
> > it?) And use something like build_skb and recycle the skb space
> > everytime...
> > This comes from the fact that packet size is ALWAYS the same and it
> > seems stupid to allocate and free it everytime. Considering we also
> > enforce a one way transaction (we send packet and we wait for response)
> > this makes the allocation process even more stupid.
> > 
> > So I wonder if we would have some perf improvement/less load by
> > declaring the mgmt eth space and build an skb that always use that
> > preallocate space and just modify data.
> > 
> > I would really love some feedback considering qca8k is also used in very
> > low spec ath79 device where we need to reduce the load in every way
> > possible. Also if anyone have more ideas on how to improve this to make
> > it less heavy cpu side, feel free to point it out even if it would
> > mean that my implemenation is complete sh*t.
> > 
> > (The use of caching the address would permit us to reduce the write to
> > this preallocated space even more or ideally to send the same skb)
> 
> I would say first things first: get this patch series included since it is
> very close from being suitable for inclusion in net-next. Then you can
> profile the I/O accesses over the management Ethernet frames and devise a
> strategy to optimize them to make as little CPU cycles intensive as
> possible.
>

Don't know if it's correct to continue this disccusion here.

> build_skb() is not exactly a magic bullet that will solve all performance
> problems, you still need the non-data portion of the skb to be allocated,
> and also keep in mind that you need tail room at the end of the data buffer
> in order for struct skb_shared_info to be written. This means that the
> hardware is not allowed to write at the end of the data buffer, or you must
> reduce the maximum RX length accordingly to prevent that. Your frames are
> small enough here this is unlikely to be an issue.
> 

I did some test with a build_skb() implemenation and I just discovered
that It wouldn't work... Problem of build_skb() is that the driver will
release the data and that's exactly what I want to skip (one allocated
memory space that is reused for every skb)

Wonder if it would be acceptable to allocate a skb when master became
operational and use always that.
When this preallocated skb has to be used, the required data is changed
and the users of the skb is increased so that it's not free. In theory
all the skb shared data and head should be the same as what changes of
the packet is just the data and nothing else.
It looks like an hack but that is the only way I found to skip the
skb_free when the packet is processed. (increasing the skb users)

> Since the MDIO layer does not really allow more than one outstanding
> transaction per MDIO device at a time, you might be just fine with just have
> a front and back skb set of buffers and alternating between these two.

Another way as you suggested would be have 2 buffer and use build_skb to
use build the sbk around the allocated buffer. But still my main concern
is if the use of manually increasing the skb user is accepted to skip
any skb free from happening.

Hope I'm not too annoying with these kind of question.

> -- 
> Florian

-- 
	Ansuel

Re: [RFC PATCH v7 00/16] Add support for qca8k mdio rw in Ethernet packet

Posted by Vladimir Oltean 4 years, 4 months ago

On Thu, Feb 03, 2022 at 06:59:13PM +0100, Ansuel Smith wrote:
> On Sun, Jan 30, 2022 at 09:07:16AM -0800, Florian Fainelli wrote:
> > On 1/30/2022 5:59 AM, Ansuel Smith wrote:
> > > Hi,
> > > sorry for the delay in sending v8, it's ready but I'm far from home and
> > > I still need to check some mdio improvement with pointer handling.
> > > 
> > > Anyway I have some concern aboutall the skb alloc.
> > > I wonder if that part can be improved at the cost of some additional
> > > space used.
> > > 
> > > The idea Is to use the cache stuff also for the eth skb (or duplicate
> > > it?) And use something like build_skb and recycle the skb space
> > > everytime...
> > > This comes from the fact that packet size is ALWAYS the same and it
> > > seems stupid to allocate and free it everytime. Considering we also
> > > enforce a one way transaction (we send packet and we wait for response)
> > > this makes the allocation process even more stupid.
> > > 
> > > So I wonder if we would have some perf improvement/less load by
> > > declaring the mgmt eth space and build an skb that always use that
> > > preallocate space and just modify data.
> > > 
> > > I would really love some feedback considering qca8k is also used in very
> > > low spec ath79 device where we need to reduce the load in every way
> > > possible. Also if anyone have more ideas on how to improve this to make
> > > it less heavy cpu side, feel free to point it out even if it would
> > > mean that my implemenation is complete sh*t.
> > > 
> > > (The use of caching the address would permit us to reduce the write to
> > > this preallocated space even more or ideally to send the same skb)
> > 
> > I would say first things first: get this patch series included since it is
> > very close from being suitable for inclusion in net-next. Then you can
> > profile the I/O accesses over the management Ethernet frames and devise a
> > strategy to optimize them to make as little CPU cycles intensive as
> > possible.
> >
> 
> Don't know if it's correct to continue this disccusion here.
> 
> > build_skb() is not exactly a magic bullet that will solve all performance
> > problems, you still need the non-data portion of the skb to be allocated,
> > and also keep in mind that you need tail room at the end of the data buffer
> > in order for struct skb_shared_info to be written. This means that the
> > hardware is not allowed to write at the end of the data buffer, or you must
> > reduce the maximum RX length accordingly to prevent that. Your frames are
> > small enough here this is unlikely to be an issue.
> > 
> 
> I did some test with a build_skb() implemenation and I just discovered
> that It wouldn't work... Problem of build_skb() is that the driver will
> release the data and that's exactly what I want to skip (one allocated
> memory space that is reused for every skb)
> 
> Wonder if it would be acceptable to allocate a skb when master became
> operational and use always that.
> When this preallocated skb has to be used, the required data is changed
> and the users of the skb is increased so that it's not free. In theory
> all the skb shared data and head should be the same as what changes of
> the packet is just the data and nothing else.
> It looks like an hack but that is the only way I found to skip the
> skb_free when the packet is processed. (increasing the skb users)
>
> > Since the MDIO layer does not really allow more than one outstanding
> > transaction per MDIO device at a time, you might be just fine with just have
> > a front and back skb set of buffers and alternating between these two.
> 
> Another way as you suggested would be have 2 buffer and use build_skb to
> use build the sbk around the allocated buffer. But still my main concern
> is if the use of manually increasing the skb user is accepted to skip
> any skb free from happening.
> 
> Hope I'm not too annoying with these kind of question.

To my knowledge, when you call dev_queue_xmit(), the skb is no longer
yours, end of story, it doesn't matter whether you increase the refcount
on it or not. The DSA master may choose to do whatever it wishes with
that buffer after its TX completion interrupt fires: it may not call
napi_consume_skb() but directly recycle that buffer in its pool of RX
buffers, as part of some weird buffer recycling scheme. So you'll think
that the buffer is yours, but it isn't, because the driver hasn't
returned it to the allocator, and your writes for the next packet may be
concurrent with some RX DMA transactions. I don't have a mainline
example to give you, but I've seen the pattern, and I don't think it's
illegal (although of course, I stand to be corrected if necessary).

Re: [RFC PATCH v7 00/16] Add support for qca8k mdio rw in Ethernet packet

Posted by Jakub Kicinski 4 years, 4 months ago

On Thu, 3 Feb 2022 20:21:28 +0200 Vladimir Oltean wrote:
> To my knowledge, when you call dev_queue_xmit(), the skb is no longer
> yours, end of story, it doesn't matter whether you increase the refcount
> on it or not. The DSA master may choose to do whatever it wishes with
> that buffer after its TX completion interrupt fires: it may not call
> napi_consume_skb() but directly recycle that buffer in its pool of RX
> buffers, as part of some weird buffer recycling scheme. So you'll think
> that the buffer is yours, but it isn't, because the driver hasn't
> returned it to the allocator, and your writes for the next packet may be
> concurrent with some RX DMA transactions. I don't have a mainline
> example to give you, but I've seen the pattern, and I don't think it's
> illegal (although of course, I stand to be corrected if necessary).

Are we talking about holding onto the Tx skb here or also recycling 
the Rx one? Sorry for another out of context comment in advance..

AFAIK in theory shared skbs are supposed to be untouched or unshared
explicitly by the driver on Tx. pktgen takes advantage of it.
We have IFF_TX_SKB_SHARING. 

In practice everyone gets opted into SKB_SHARING because ether_setup()
sets the flag. A lot of drivers are not aware of the requirement and
will assume full ownership (and for example use skb->cb[]) :/

I don't think there is any Tx completion -> Rx pool recycling scheme
inside the drivers (if that's what you described).

Re: [RFC PATCH v7 00/16] Add support for qca8k mdio rw in Ethernet packet

Posted by Ansuel Smith 4 years, 4 months ago

On Thu, Feb 03, 2022 at 12:10:27PM -0800, Jakub Kicinski wrote:
> On Thu, 3 Feb 2022 20:21:28 +0200 Vladimir Oltean wrote:
> > To my knowledge, when you call dev_queue_xmit(), the skb is no longer
> > yours, end of story, it doesn't matter whether you increase the refcount
> > on it or not. The DSA master may choose to do whatever it wishes with
> > that buffer after its TX completion interrupt fires: it may not call
> > napi_consume_skb() but directly recycle that buffer in its pool of RX
> > buffers, as part of some weird buffer recycling scheme. So you'll think
> > that the buffer is yours, but it isn't, because the driver hasn't
> > returned it to the allocator, and your writes for the next packet may be
> > concurrent with some RX DMA transactions. I don't have a mainline
> > example to give you, but I've seen the pattern, and I don't think it's
> > illegal (although of course, I stand to be corrected if necessary).
> 
> Are we talking about holding onto the Tx skb here or also recycling 
> the Rx one? Sorry for another out of context comment in advance..
>

Here we are talking about tx skb. (We can't really operate on the
received skb handled by the tagger)
What I want to improve is the fact that we alloc 3 very small skb that
will have only some part of them modified and the rest equal on all 3
skb. So my idea is find a way to preallocate a space and build a skb
around it. In short as we force a 1:1 comunication where we send an skb
and we wait for the response to be processed, we can reuse the same skb
everytime and just modify the data in it. I'm asking if anyone have some
hint/direction without proposing a patch that is a big massive hack that
will be NACK in 0.1 ms LOL

> AFAIK in theory shared skbs are supposed to be untouched or unshared
> explicitly by the driver on Tx. pktgen takes advantage of it.
> We have IFF_TX_SKB_SHARING. 
> 

Will check how this flags is used. If you have any hint about this feel
free to suggest it.

> In practice everyone gets opted into SKB_SHARING because ether_setup()
> sets the flag. A lot of drivers are not aware of the requirement and
> will assume full ownership (and for example use skb->cb[]) :/
> 

Consider that currently this will be used by stmmac driver, brcm driver
and someother atheros driver.

> I don't think there is any Tx completion -> Rx pool recycling scheme
> inside the drivers (if that's what you described).

-- 
	Ansuel

Re: [RFC PATCH v7 00/16] Add support for qca8k mdio rw in Ethernet packet

Posted by Vladimir Oltean 4 years, 4 months ago

On Thu, Feb 03, 2022 at 12:10:27PM -0800, Jakub Kicinski wrote:
> On Thu, 3 Feb 2022 20:21:28 +0200 Vladimir Oltean wrote:
> > To my knowledge, when you call dev_queue_xmit(), the skb is no longer
> > yours, end of story, it doesn't matter whether you increase the refcount
> > on it or not. The DSA master may choose to do whatever it wishes with
> > that buffer after its TX completion interrupt fires: it may not call
> > napi_consume_skb() but directly recycle that buffer in its pool of RX
> > buffers, as part of some weird buffer recycling scheme. So you'll think
> > that the buffer is yours, but it isn't, because the driver hasn't
> > returned it to the allocator, and your writes for the next packet may be
> > concurrent with some RX DMA transactions. I don't have a mainline
> > example to give you, but I've seen the pattern, and I don't think it's
> > illegal (although of course, I stand to be corrected if necessary).
>
> Are we talking about holding onto the Tx skb here or also recycling
> the Rx one? Sorry for another out of context comment in advance..

We're talking about the possibility that the DSA master holds onto the
TX skb, for the purpose of saving a netdev_alloc_skb() call later in the
RX path.

> AFAIK in theory shared skbs are supposed to be untouched or unshared
> explicitly by the driver on Tx. pktgen takes advantage of it.
> We have IFF_TX_SKB_SHARING.
>
> In practice everyone gets opted into SKB_SHARING because ether_setup()
> sets the flag. A lot of drivers are not aware of the requirement and
> will assume full ownership (and for example use skb->cb[]) :/
>
> I don't think there is any Tx completion -> Rx pool recycling scheme
> inside the drivers (if that's what you described).

You made me go look again at commit acb600def211 ("net: remove skb
recycling"), a revert of which is still carried in some vendor kernels.
So there was a skb_is_recycleable() function with this check:

	if (skb_shared(skb))
		return false;

which means that yes, my argument is basically invalid, skb_get() in DSA
will protect against skb recycling.

I know Ansuel is using OpenWRT, so not a stranger to vendor kernels &
network drivers. My comment wasn't as much a hard NACK as it was a word
of caution. DSA allows pairing any switch driver to any host controller
driver. If somebody hits the jackpot (probably won't be me, won't be
Ansuel), it won't be exactly fun for them until they figure out what's
going on, with the symptoms being random data corruption during switch
register access. But after I revisited the exact situation, I don't have
an example that proves to be problematic.