[Qemu-devel] [PATCH 0/3] vfio-pci: support recovery of AER non fatal error

Cao jin posted 3 patches 7 years, 1 month ago
Failed in applying to current master (apply log)
There is a newer version of this series
hw/pci/pcie_aer.c          |  28 +++++++
hw/vfio/pci.c              | 180 +++++++++++++++++++++++++++++++++++++++++++--
hw/vfio/pci.h              |   3 +
linux-headers/linux/vfio.h |   1 +
4 files changed, 207 insertions(+), 5 deletions(-)
[Qemu-devel] [PATCH 0/3] vfio-pci: support recovery of AER non fatal error
Posted by Cao jin 7 years, 1 month ago
This is nearly new design of the feature, so re-number the verion from 0.

About The test:
Hardware problem(unsteady) still occurs like before. The test server is in
another country spot A, and my contact of the country located spot B, so
it is not quite convenient to find help(plug cable, or check the hardware).
So, my NIC(has 2 functions) still just has func1 connected to gateway.
If there is other people who has the hardware could test the patches, that
would be great help.


Basically, there are two phenomenon of unsteady hardware:
1. Start vm, the hardware emit fatal error itself before I did anything,
   cause vm stop.
2. Start vm, assign IP to func1, then ping the gateway, it will show
   "Destination Host Unreachable" after dozens of or hundreds of successful
   ping, and guest dmesg shows nothing abnormal.  I think this phenomenon is
   the *strong evidence* of saying unsteady hardware, I speculate that
   the cable has problem.

   on the opposite, I also saw perfect result 2 times in my numerous tests,
   which just assign func1 while func0 has no user. It can ping several housrs(
   more than 15000 times ping) withtout any problem, during the period, inject
   non fatal error to func0 & func1, error recovery is very good.

   So, most of time, I must do the test quickly before the hardware goes crazy,
   until get what I expected.


Test:
scenario 1: assign func1 to vm while func0 has no user.
scenario 2: assign both functions to 1 vm, with the same topology as host.
scenario 3: assign both functions to 1 vm, under different bus.
scenario 4: assign each function to a separate vm.

the steps is: assign IP to func1, ping the gateway, inject non fatal error to
both functions, see if func1 still can ping after recovery.

Although we don't have cable for func0, but in the test like scenario 4,
inject to func0, it doesn't affect func1's recovery, so I think it can prove
that one function's recovery doesn't affect another.


Extra info FYI:
1. During the test, some debug lines are added in vfio_err_notifier_handler,
   read the uncor status register in this function when fatal error occured,
   it shows all F's every time.
2. Based on the v10 patch & the corresponding kernel part, modified as
   comments: revert the eventfd handling(don't signal uncor status), and
   guest link reset will induce the host link reset. The test result shows:
   non fatal error recovery is good; fatal error recovery has same result
   with what Alex find before(guest kernel crash), because guest device
   driver's error_detected() access the MMIO registers, get all F's.


Cao jin (3):
  pcie aer: verify if AER functionality is available
  vfio pci: new function to init AER capability
  vfio-pci: process non fatal error of AER

 hw/pci/pcie_aer.c          |  28 +++++++
 hw/vfio/pci.c              | 180 +++++++++++++++++++++++++++++++++++++++++++--
 hw/vfio/pci.h              |   3 +
 linux-headers/linux/vfio.h |   1 +
 4 files changed, 207 insertions(+), 5 deletions(-)

-- 
1.8.3.1




Re: [Qemu-devel] [PATCH 0/3] vfio-pci: support recovery of AER non fatal error
Posted by Cao jin 7 years, 1 month ago
ping

On 02/27/2017 03:30 PM, Cao jin wrote:
> This is nearly new design of the feature, so re-number the verion from 0.
> 
> About The test:
> Hardware problem(unsteady) still occurs like before. The test server is in
> another country spot A, and my contact of the country located spot B, so
> it is not quite convenient to find help(plug cable, or check the hardware).
> So, my NIC(has 2 functions) still just has func1 connected to gateway.
> If there is other people who has the hardware could test the patches, that
> would be great help.
> 
> 
> Basically, there are two phenomenon of unsteady hardware:
> 1. Start vm, the hardware emit fatal error itself before I did anything,
>    cause vm stop.
> 2. Start vm, assign IP to func1, then ping the gateway, it will show
>    "Destination Host Unreachable" after dozens of or hundreds of successful
>    ping, and guest dmesg shows nothing abnormal.  I think this phenomenon is
>    the *strong evidence* of saying unsteady hardware, I speculate that
>    the cable has problem.
> 
>    on the opposite, I also saw perfect result 2 times in my numerous tests,
>    which just assign func1 while func0 has no user. It can ping several housrs(
>    more than 15000 times ping) withtout any problem, during the period, inject
>    non fatal error to func0 & func1, error recovery is very good.
> 
>    So, most of time, I must do the test quickly before the hardware goes crazy,
>    until get what I expected.
> 
> 
> Test:
> scenario 1: assign func1 to vm while func0 has no user.
> scenario 2: assign both functions to 1 vm, with the same topology as host.
> scenario 3: assign both functions to 1 vm, under different bus.
> scenario 4: assign each function to a separate vm.
> 
> the steps is: assign IP to func1, ping the gateway, inject non fatal error to
> both functions, see if func1 still can ping after recovery.
> 
> Although we don't have cable for func0, but in the test like scenario 4,
> inject to func0, it doesn't affect func1's recovery, so I think it can prove
> that one function's recovery doesn't affect another.
> 
> 
> Extra info FYI:
> 1. During the test, some debug lines are added in vfio_err_notifier_handler,
>    read the uncor status register in this function when fatal error occured,
>    it shows all F's every time.
> 2. Based on the v10 patch & the corresponding kernel part, modified as
>    comments: revert the eventfd handling(don't signal uncor status), and
>    guest link reset will induce the host link reset. The test result shows:
>    non fatal error recovery is good; fatal error recovery has same result
>    with what Alex find before(guest kernel crash), because guest device
>    driver's error_detected() access the MMIO registers, get all F's.
> 
> 
> Cao jin (3):
>   pcie aer: verify if AER functionality is available
>   vfio pci: new function to init AER capability
>   vfio-pci: process non fatal error of AER
> 
>  hw/pci/pcie_aer.c          |  28 +++++++
>  hw/vfio/pci.c              | 180 +++++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio/pci.h              |   3 +
>  linux-headers/linux/vfio.h |   1 +
>  4 files changed, 207 insertions(+), 5 deletions(-)
> 

-- 
Sincerely,
Cao jin