[edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

Eric Dong posted 1 patch 5 years, 10 months ago
Failed in applying to current master (apply log)
UefiCpuPkg/Library/MpInitLib/MpLib.c | 11 +++++------
UefiCpuPkg/Library/MpInitLib/MpLib.h |  3 +--
2 files changed, 6 insertions(+), 8 deletions(-)
[edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Posted by Eric Dong 5 years, 10 months ago
Parameter StartCount duplicates with RunningCount. After this change, 
RunningCount means the running AP count.

V2 changes: Remove volatile for RunningCount.

Done Test:
1.PI SCT Test
2.Boot OS / S3

Cc: Ruiyu Ni <ruiyu.ni@intel.com>
Cc: Jeff Fan <vanjeff_919@hotmail.com>
Contributed-under: TianoCore Contribution Agreement 1.1
Signed-off-by: Eric Dong <eric.dong@intel.com>
---
 UefiCpuPkg/Library/MpInitLib/MpLib.c | 11 +++++------
 UefiCpuPkg/Library/MpInitLib/MpLib.h |  3 +--
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/UefiCpuPkg/Library/MpInitLib/MpLib.c b/UefiCpuPkg/Library/MpInitLib/MpLib.c
index 3945771764..52c9679099 100644
--- a/UefiCpuPkg/Library/MpInitLib/MpLib.c
+++ b/UefiCpuPkg/Library/MpInitLib/MpLib.c
@@ -1400,7 +1400,7 @@ CheckAllAPs (
     // value of state after setting the it to CpuStateFinished, so BSP can safely make use of its value.
     //
     if (GetApState(CpuData) != CpuStateBusy) {
-      CpuMpData->RunningCount ++;
+      CpuMpData->RunningCount --;
       CpuMpData->CpuData[ProcessorNumber].Waiting = FALSE;
 
       //
@@ -1425,7 +1425,7 @@ CheckAllAPs (
   //
   // If all APs finish, return EFI_SUCCESS.
   //
-  if (CpuMpData->RunningCount == CpuMpData->StartCount) {
+  if (CpuMpData->RunningCount == 0) {
     return EFI_SUCCESS;
   }
 
@@ -1442,7 +1442,7 @@ CheckAllAPs (
     //
     if (CpuMpData->FailedCpuList != NULL) {
       *CpuMpData->FailedCpuList =
-         AllocatePool ((CpuMpData->StartCount - CpuMpData->FinishedCount + 1) * sizeof (UINTN));
+         AllocatePool ((CpuMpData->RunningCount + 1) * sizeof (UINTN));
       ASSERT (*CpuMpData->FailedCpuList != NULL);
     }
     ListIndex = 0;
@@ -2121,7 +2121,7 @@ StartupAllAPsWorker (
     return EFI_NOT_STARTED;
   }
 
-  CpuMpData->StartCount = 0;
+  CpuMpData->RunningCount = 0;
   for (ProcessorNumber = 0; ProcessorNumber < ProcessorCount; ProcessorNumber++) {
     CpuData = &CpuMpData->CpuData[ProcessorNumber];
     CpuData->Waiting = FALSE;
@@ -2131,7 +2131,7 @@ StartupAllAPsWorker (
         // Mark this processor as responsible for current calling.
         //
         CpuData->Waiting = TRUE;
-        CpuMpData->StartCount++;
+        CpuMpData->RunningCount++;
       }
     }
   }
@@ -2140,7 +2140,6 @@ StartupAllAPsWorker (
   CpuMpData->ProcArguments = ProcedureArgument;
   CpuMpData->SingleThread  = SingleThread;
   CpuMpData->FinishedCount = 0;
-  CpuMpData->RunningCount  = 0;
   CpuMpData->FailedCpuList = FailedCpuList;
   CpuMpData->ExpectedTime  = CalculateTimeout (
                                TimeoutInMicroseconds,
diff --git a/UefiCpuPkg/Library/MpInitLib/MpLib.h b/UefiCpuPkg/Library/MpInitLib/MpLib.h
index 90c09fb8fb..ad62acf766 100644
--- a/UefiCpuPkg/Library/MpInitLib/MpLib.h
+++ b/UefiCpuPkg/Library/MpInitLib/MpLib.h
@@ -210,9 +210,8 @@ struct _CPU_MP_DATA {
   UINTN                          BackupBuffer;
   UINTN                          BackupBufferSize;
 
-  volatile UINT32                StartCount;
   volatile UINT32                FinishedCount;
-  volatile UINT32                RunningCount;
+  UINT32                         RunningCount;
   BOOLEAN                        SingleThread;
   EFI_AP_PROCEDURE               Procedure;
   VOID                           *ProcArguments;
-- 
2.15.0.windows.1

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Posted by Laszlo Ersek 5 years, 10 months ago
Hi Eric,

On 06/29/18 05:20, Eric Dong wrote:
> Parameter StartCount duplicates with RunningCount. After this change,
> RunningCount means the running AP count.
>
> V2 changes: Remove volatile for RunningCount.
>
> Done Test:
> 1.PI SCT Test
> 2.Boot OS / S3
>
> Cc: Ruiyu Ni <ruiyu.ni@intel.com>
> Cc: Jeff Fan <vanjeff_919@hotmail.com>
> Contributed-under: TianoCore Contribution Agreement 1.1
> Signed-off-by: Eric Dong <eric.dong@intel.com>
> ---
>  UefiCpuPkg/Library/MpInitLib/MpLib.c | 11 +++++------
>  UefiCpuPkg/Library/MpInitLib/MpLib.h |  3 +--
>  2 files changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/UefiCpuPkg/Library/MpInitLib/MpLib.c b/UefiCpuPkg/Library/MpInitLib/MpLib.c
> index 3945771764..52c9679099 100644
> --- a/UefiCpuPkg/Library/MpInitLib/MpLib.c
> +++ b/UefiCpuPkg/Library/MpInitLib/MpLib.c
> @@ -1400,7 +1400,7 @@ CheckAllAPs (
>      // value of state after setting the it to CpuStateFinished, so BSP can safely make use of its value.
>      //
>      if (GetApState(CpuData) != CpuStateBusy) {
> -      CpuMpData->RunningCount ++;
> +      CpuMpData->RunningCount --;
>        CpuMpData->CpuData[ProcessorNumber].Waiting = FALSE;
>
>        //
> @@ -1425,7 +1425,7 @@ CheckAllAPs (
>    //
>    // If all APs finish, return EFI_SUCCESS.
>    //
> -  if (CpuMpData->RunningCount == CpuMpData->StartCount) {
> +  if (CpuMpData->RunningCount == 0) {
>      return EFI_SUCCESS;
>    }
>
> @@ -1442,7 +1442,7 @@ CheckAllAPs (
>      //
>      if (CpuMpData->FailedCpuList != NULL) {
>        *CpuMpData->FailedCpuList =
> -         AllocatePool ((CpuMpData->StartCount - CpuMpData->FinishedCount + 1) * sizeof (UINTN));
> +         AllocatePool ((CpuMpData->RunningCount + 1) * sizeof (UINTN));
>        ASSERT (*CpuMpData->FailedCpuList != NULL);
>      }
>      ListIndex = 0;
> @@ -2121,7 +2121,7 @@ StartupAllAPsWorker (
>      return EFI_NOT_STARTED;
>    }
>
> -  CpuMpData->StartCount = 0;
> +  CpuMpData->RunningCount = 0;
>    for (ProcessorNumber = 0; ProcessorNumber < ProcessorCount; ProcessorNumber++) {
>      CpuData = &CpuMpData->CpuData[ProcessorNumber];
>      CpuData->Waiting = FALSE;
> @@ -2131,7 +2131,7 @@ StartupAllAPsWorker (
>          // Mark this processor as responsible for current calling.
>          //
>          CpuData->Waiting = TRUE;
> -        CpuMpData->StartCount++;
> +        CpuMpData->RunningCount++;
>        }
>      }
>    }
> @@ -2140,7 +2140,6 @@ StartupAllAPsWorker (
>    CpuMpData->ProcArguments = ProcedureArgument;
>    CpuMpData->SingleThread  = SingleThread;
>    CpuMpData->FinishedCount = 0;
> -  CpuMpData->RunningCount  = 0;
>    CpuMpData->FailedCpuList = FailedCpuList;
>    CpuMpData->ExpectedTime  = CalculateTimeout (
>                                 TimeoutInMicroseconds,
> diff --git a/UefiCpuPkg/Library/MpInitLib/MpLib.h b/UefiCpuPkg/Library/MpInitLib/MpLib.h
> index 90c09fb8fb..ad62acf766 100644
> --- a/UefiCpuPkg/Library/MpInitLib/MpLib.h
> +++ b/UefiCpuPkg/Library/MpInitLib/MpLib.h
> @@ -210,9 +210,8 @@ struct _CPU_MP_DATA {
>    UINTN                          BackupBuffer;
>    UINTN                          BackupBufferSize;
>
> -  volatile UINT32                StartCount;
>    volatile UINT32                FinishedCount;
> -  volatile UINT32                RunningCount;
> +  UINT32                         RunningCount;
>    BOOLEAN                        SingleThread;
>    EFI_AP_PROCEDURE               Procedure;
>    VOID                           *ProcArguments;
>

I got confused by the way you sent out this patch.

First I thought that you meant it separately (stand-alone). I tried to
test it, but it didn't apply. Also my intent was to ask you whether you
wanted to send a new version of

  [edk2] [Patch 1/2] UefiCpuPkg/MpInitLib: Not use disabled AP.


However, upon seeing that this patch wouldn't apply, I'm now thinking
that you would like to preserve

  [edk2] [Patch 1/2] UefiCpuPkg/MpInitLib: Not use disabled AP.

without any changes, and your intent is to only update

  [edk2] [Patch 2/2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

to version 2.


In such cases, please do not post an independent patch. Instead, pick
one of the following:

- Repost the entire series as v2, and mark in the cover letter that
  patch #1 is unchanged from the v1 posting.

- Alternatively, post the patch in response to the *original* v1 thread
  (using --in-reply-to= with git-send-email), and use the subject

  [edk2] [Patch v2 2/2] UefiCpuPkg/MpInitLib: Remove redundant
                        parameter.

Either of these approaches will let reviewers know that you intend the
two patches to go together (and in what order), and that only patch #2
has been updated.


So... Assuming this was indeed your intent, I applied the following two
patches locally, for testing:

  [edk2] [Patch 1_2] UefiCpuPkg_MpInitLib: Not use disabled AP.
  Message-Id: <20180628112920.5296-1-eric.dong@intel.com>
  http://mid.mail-archive.com/20180628112920.5296-1-eric.dong@intel.com

  [edk2] [Patch V2] UefiCpuPkg_MpInitLib: Remove redundant parameter.
  Message-Id: <20180629032047.6340-1-eric.dong@intel.com>
  http://mid.mail-archive.com/20180629032047.6340-1-eric.dong@intel.com
  (i.e., this patch)

I tested the following three scenarios with QEMU/KVM, using normal boot
and S3, and 8 virtual CPUs (1 socket, 4 cores, 2 threads -- same
topology as my physical laptop CPU):

(1) i440fx machine type, X64 build, without SMM
(2) q35 machine type, IA32 build, with SMM
(3) q35 machine type, IA32X64 build, with SMM

The guest OS was Fedora in every case.

The series keeps test case (1) functional.

The series breaks test cases (2) and (3) however; normal boot for each
(S3 is not possible to attempt). The symptom is that, for both cases (2)
and (3), the ExitBootServices() handlers are invoked, but the following
message is never logged:

    MpInitChangeApLoopCallback() done!

and the boot process gets stuck.

With more context, a before/after log file diff, for case (2):

>  VirtioScsiExitBoot: Context=0x7DF3F010
>  SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
> -MpInitChangeApLoopCallback() done!

With more context, a before/after log file diff, for case (3):

>  VirtioScsiExitBoot: Context=0x7DBD4398
>  VirtioRngExitBoot: Context=0x7DC47318
>  SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
> -MpInitChangeApLoopCallback() done!

I guess the patch series breaks something in
MpInitChangeApLoopCallback() when the SMM driver stack is included in
the build.

... After repeated testing, the boot succeeds *sometimes* (rarely). In
most cases, the boot fails as described above.

Thanks,
Laszlo
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Posted by Dong, Eric 5 years, 9 months ago
Hi Laszlo,

I finally succeed to setup the OVMF platform which can verify the boot failure issue.  But on my platform, if I use image build with below command (I assume it is used to enable SMM), the system can't boot to OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS boot phase after ExitBootService point (I can see the console log which should been printed at ExitBootService point, so I think hang should after this point). 
	build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE 

If I use below command to build the image, the system can boot to OS.  
	build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b NOOPT

Does my OVMF environment still has problem?


When do the above test, I don't include my two patches. Then I include my patches and build the image with SMM enabled, I found I can't reproduce the issue you met. I can find the "MpInitChangeApLoopCallback done!" message in the console log.  Attached the console log.

Can you help to verify the OVMF image build from my side?  Also can you attach the image from you side to let me try? 

Thanks,
Eric

> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
> Laszlo Ersek
> Sent: Friday, June 29, 2018 8:15 PM
> To: Dong, Eric <eric.dong@intel.com>; edk2-devel@lists.01.org
> Cc: Ni, Ruiyu <ruiyu.ni@intel.com>
> Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant
> parameter.
> 
> Hi Eric,
> 
> On 06/29/18 05:20, Eric Dong wrote:
> > Parameter StartCount duplicates with RunningCount. After this change,
> > RunningCount means the running AP count.
> >
> > V2 changes: Remove volatile for RunningCount.
> >
> > Done Test:
> > 1.PI SCT Test
> > 2.Boot OS / S3
> >
> > Cc: Ruiyu Ni <ruiyu.ni@intel.com>
> > Cc: Jeff Fan <vanjeff_919@hotmail.com>
> > Contributed-under: TianoCore Contribution Agreement 1.1
> > Signed-off-by: Eric Dong <eric.dong@intel.com>
> > ---
> >  UefiCpuPkg/Library/MpInitLib/MpLib.c | 11 +++++------
> > UefiCpuPkg/Library/MpInitLib/MpLib.h |  3 +--
> >  2 files changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/UefiCpuPkg/Library/MpInitLib/MpLib.c
> > b/UefiCpuPkg/Library/MpInitLib/MpLib.c
> > index 3945771764..52c9679099 100644
> > --- a/UefiCpuPkg/Library/MpInitLib/MpLib.c
> > +++ b/UefiCpuPkg/Library/MpInitLib/MpLib.c
> > @@ -1400,7 +1400,7 @@ CheckAllAPs (
> >      // value of state after setting the it to CpuStateFinished, so BSP can safely
> make use of its value.
> >      //
> >      if (GetApState(CpuData) != CpuStateBusy) {
> > -      CpuMpData->RunningCount ++;
> > +      CpuMpData->RunningCount --;
> >        CpuMpData->CpuData[ProcessorNumber].Waiting = FALSE;
> >
> >        //
> > @@ -1425,7 +1425,7 @@ CheckAllAPs (
> >    //
> >    // If all APs finish, return EFI_SUCCESS.
> >    //
> > -  if (CpuMpData->RunningCount == CpuMpData->StartCount) {
> > +  if (CpuMpData->RunningCount == 0) {
> >      return EFI_SUCCESS;
> >    }
> >
> > @@ -1442,7 +1442,7 @@ CheckAllAPs (
> >      //
> >      if (CpuMpData->FailedCpuList != NULL) {
> >        *CpuMpData->FailedCpuList =
> > -         AllocatePool ((CpuMpData->StartCount - CpuMpData->FinishedCount
> + 1) * sizeof (UINTN));
> > +         AllocatePool ((CpuMpData->RunningCount + 1) * sizeof
> > + (UINTN));
> >        ASSERT (*CpuMpData->FailedCpuList != NULL);
> >      }
> >      ListIndex = 0;
> > @@ -2121,7 +2121,7 @@ StartupAllAPsWorker (
> >      return EFI_NOT_STARTED;
> >    }
> >
> > -  CpuMpData->StartCount = 0;
> > +  CpuMpData->RunningCount = 0;
> >    for (ProcessorNumber = 0; ProcessorNumber < ProcessorCount;
> ProcessorNumber++) {
> >      CpuData = &CpuMpData->CpuData[ProcessorNumber];
> >      CpuData->Waiting = FALSE;
> > @@ -2131,7 +2131,7 @@ StartupAllAPsWorker (
> >          // Mark this processor as responsible for current calling.
> >          //
> >          CpuData->Waiting = TRUE;
> > -        CpuMpData->StartCount++;
> > +        CpuMpData->RunningCount++;
> >        }
> >      }
> >    }
> > @@ -2140,7 +2140,6 @@ StartupAllAPsWorker (
> >    CpuMpData->ProcArguments = ProcedureArgument;
> >    CpuMpData->SingleThread  = SingleThread;
> >    CpuMpData->FinishedCount = 0;
> > -  CpuMpData->RunningCount  = 0;
> >    CpuMpData->FailedCpuList = FailedCpuList;
> >    CpuMpData->ExpectedTime  = CalculateTimeout (
> >                                 TimeoutInMicroseconds, diff --git
> > a/UefiCpuPkg/Library/MpInitLib/MpLib.h
> > b/UefiCpuPkg/Library/MpInitLib/MpLib.h
> > index 90c09fb8fb..ad62acf766 100644
> > --- a/UefiCpuPkg/Library/MpInitLib/MpLib.h
> > +++ b/UefiCpuPkg/Library/MpInitLib/MpLib.h
> > @@ -210,9 +210,8 @@ struct _CPU_MP_DATA {
> >    UINTN                          BackupBuffer;
> >    UINTN                          BackupBufferSize;
> >
> > -  volatile UINT32                StartCount;
> >    volatile UINT32                FinishedCount;
> > -  volatile UINT32                RunningCount;
> > +  UINT32                         RunningCount;
> >    BOOLEAN                        SingleThread;
> >    EFI_AP_PROCEDURE               Procedure;
> >    VOID                           *ProcArguments;
> >
> 
> I got confused by the way you sent out this patch.
> 
> First I thought that you meant it separately (stand-alone). I tried to test it, but
> it didn't apply. Also my intent was to ask you whether you wanted to send a
> new version of
> 
>   [edk2] [Patch 1/2] UefiCpuPkg/MpInitLib: Not use disabled AP.
> 
> 
> However, upon seeing that this patch wouldn't apply, I'm now thinking that
> you would like to preserve
> 
>   [edk2] [Patch 1/2] UefiCpuPkg/MpInitLib: Not use disabled AP.
> 
> without any changes, and your intent is to only update
> 
>   [edk2] [Patch 2/2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
> 
> to version 2.
> 
> 
> In such cases, please do not post an independent patch. Instead, pick one of
> the following:
> 
> - Repost the entire series as v2, and mark in the cover letter that
>   patch #1 is unchanged from the v1 posting.
> 
> - Alternatively, post the patch in response to the *original* v1 thread
>   (using --in-reply-to= with git-send-email), and use the subject
> 
>   [edk2] [Patch v2 2/2] UefiCpuPkg/MpInitLib: Remove redundant
>                         parameter.
> 
> Either of these approaches will let reviewers know that you intend the two
> patches to go together (and in what order), and that only patch #2 has been
> updated.
> 
> 
> So... Assuming this was indeed your intent, I applied the following two patches
> locally, for testing:
> 
>   [edk2] [Patch 1_2] UefiCpuPkg_MpInitLib: Not use disabled AP.
>   Message-Id: <20180628112920.5296-1-eric.dong@intel.com>
>   http://mid.mail-archive.com/20180628112920.5296-1-eric.dong@intel.com
> 
>   [edk2] [Patch V2] UefiCpuPkg_MpInitLib: Remove redundant parameter.
>   Message-Id: <20180629032047.6340-1-eric.dong@intel.com>
>   http://mid.mail-archive.com/20180629032047.6340-1-eric.dong@intel.com
>   (i.e., this patch)
> 
> I tested the following three scenarios with QEMU/KVM, using normal boot
> and S3, and 8 virtual CPUs (1 socket, 4 cores, 2 threads -- same topology as
> my physical laptop CPU):
> 
> (1) i440fx machine type, X64 build, without SMM
> (2) q35 machine type, IA32 build, with SMM
> (3) q35 machine type, IA32X64 build, with SMM
> 
> The guest OS was Fedora in every case.
> 
> The series keeps test case (1) functional.
> 
> The series breaks test cases (2) and (3) however; normal boot for each
> (S3 is not possible to attempt). The symptom is that, for both cases (2) and (3),
> the ExitBootServices() handlers are invoked, but the following message is
> never logged:
> 
>     MpInitChangeApLoopCallback() done!
> 
> and the boot process gets stuck.
> 
> With more context, a before/after log file diff, for case (2):
> 
> >  VirtioScsiExitBoot: Context=0x7DF3F010
> >  SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
> > -MpInitChangeApLoopCallback() done!
> 
> With more context, a before/after log file diff, for case (3):
> 
> >  VirtioScsiExitBoot: Context=0x7DBD4398
> >  VirtioRngExitBoot: Context=0x7DC47318
> >  SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
> > -MpInitChangeApLoopCallback() done!
> 
> I guess the patch series breaks something in
> MpInitChangeApLoopCallback() when the SMM driver stack is included in the
> build.
> 
> ... After repeated testing, the boot succeeds *sometimes* (rarely). In most
> cases, the boot fails as described above.
> 
> Thanks,
> Laszlo
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Posted by Laszlo Ersek 5 years, 9 months ago
Hi Eric,

apologies about the delay.

On 07/18/18 14:59, Dong, Eric wrote:
> Hi Laszlo,
>
> I finally succeed to setup the OVMF platform which can verify the boot
> failure issue.  But on my platform, if I use image build with below
> command (I assume it is used to enable SMM), the system can't boot to
> OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS
> boot phase after ExitBootService point (I can see the console log
> which should been printed at ExitBootService point, so I think hang
> should after this point).
> 	build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE
>
> If I use below command to build the image, the system can boot to OS.
> 	build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b NOOPT
>
> Does my OVMF environment still has problem?
>
>
> When do the above test, I don't include my two patches.

Yes, I think this host environment is still problematic. Namely, the
latest QEMU version shipped in Fedora 25 is QEMU-2.7:

  https://koji.fedoraproject.org/koji/buildinfo?buildID=918114

and QEMU-2.7 does not have a feature that is important for SMM
stability. This feature is called "SMI broadcast".

In OVMF, the "OvmfPkg/SmmControl2Dxe" runtime driver implements
EFI_SMM_CONTROL2_PROTOCOL (which is a runtime protocol). The Trigger()
member function raises an SMI, by writing to IO port 0xB2
(ICH9_APM_CNT).

Originally, QEMU would raise the SMI synchronously only on the sole VCPU
that called Trigger(). Then, the edk2 SMM driver stack would have to
pull the other processors explicitly into SMM (via APIC accesses, if I
remember correctly). This was extremely slow (the processor first
raising the SMI would wait for a long time for the other processors to
show up in SMM, before it would decide to pull them in with APIC
writes). Also when we switched the edk2 SMM sync mode to "relaxed", the
results remained very unstable. We decided that edk2 supported the
"traditional" SMM sync mode much better, and so we implemented "SMI
broadcast" in QEMU, to satisfy that sync mode.

(My memories are a bit fuzzy at this point; you can read more in the
following RH Bugzilla entries:

  https://bugzilla.redhat.com/show_bug.cgi?id=1412327 [QEMU]
  https://bugzilla.redhat.com/show_bug.cgi?id=1412313 [OVMF])

The idea of "SMI broadcast" is that, regardless of which VCPU triggers
the SMI, QEMU raises the SMI immediately on all VCPUs. This made a
*huge* difference for the performance and the stability of the edk2 SMM
driver stack, used in OVMF and on QEMU/KVM.

Now, in order to be able to use old OVMF on new QEMU and vice versa,
this feature is runtime-negotiated between "OvmfPkg/SmmControl2Dxe" and
QEMU. (The feature is not enabled by default, and without "SMI
broadcast", the "relaxed" sync method is slightly less broken than the
"tradiational" method, so OVMF defaults to that. With the feature
enabled, the "traditional" mode is better -- that config is the absolute
best of all four possible combinations.)

More precisely, on the QEMU side, the feature is not tied to a QEMU
release, but to Q35 *machine type versions*. Therefore, in order to
benefit from the feature, you need all of the following:

- a recent enough OVMF,
- a recent enough QEMU release,
- a recent enough Q35 machine type, specified on the QEMU command line.

The particular minimum machine type is "pc-q35-2.9" (which is clearly
only provided by QEMU-2.9 and later). The machine type requirement is
automatically satisfied if you use QEMU-2.9+, and just request the "q35"
machine type. (Without an explicit machtype version number, the highest
one supported by the QEMU release will be picked.)

The lack of this feature in your environment is confirmed by your OVMF
log:

> NegotiateSmiFeatures: SMI feature negotiation unavailable

If the feature is available, you will see the following two messages
instead:

  NegotiateSmiFeatures: using SMI broadcast
  [...]
  AppendFwCfgBootScript: SMI feature negotiation boot script saved

(The second message only appears if you have S3 enabled -- at S3 resume,
the feature has to be re-enabled, so SmmControl2Dxe saves a boot script
fragment for that.)

Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU 2.9
is shipped:

  https://koji.fedoraproject.org/koji/buildinfo?buildID=986762

... It's even better if you can upgrade to Fedora 27, as Fedora 27 is
the oldest Fedora release still supported at this point. The following
article describes the recommended upgrade method:

  https://fedoraproject.org/wiki/DNF_system_upgrade

> Then I include my patches and build the image with SMM enabled, I
> found I can't reproduce the issue you met. I can find the
> "MpInitChangeApLoopCallback done!" message in the console log.
> Attached the console log.

Yes, I can see "MpInitChangeApLoopCallback() done" in the log.

> Can you help to verify the OVMF image build from my side?

Your firmware image (SHA1: a11169ef30ab4d0182dbe2c3fc072b0b2e98c06a)
reproduces the same issue that I reported, on my end. Out of 10
subsequent attempts, it only succeeded to boot the OS 3 times (attempts
#1, #8 and #10). In the failed cases, the log always ends like this:

  MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
  RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
  <HANG>

That is, one of the APs fails to show up. It always changes which one is
missing; for example, another failure:

  MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
  RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 7 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
  <HANG>

My laptop that I use for testing has 1 socket, 4 cores, and 2 threads.
This is the same VCPU configuration that I use for the guest (hence the
1 BSP + 7 AP config seen above). I got the idea that perhaps the host
was slightly over-subscribed (= more VCPU work than the physical
processors can serve in "near real time"), and so I changed the guest
config to 1 socket, 2 cores, and 2 threads (= 1 BSP + 3 APs).
Unfortunately, the issue reproduced in this config as well, at the 4th
try:

  MpInitChangeApLoopCallback :: Processor 4, Enabled Processor 4!
  RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
  RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
  <HANG>

Just to be sure, I tested a fresh build (without the patches); that
booted the OS fine (10 out of 10).

I think something in the code is sensitive to timing, or lacks some kind
of synchronization. One of the APs may sometimes be missed. I guess it's
possible that the SMI broadcast feature, when enabled, helps expose the
problem.

Thanks,
Laszlo
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Posted by Dong, Eric 5 years, 9 months ago
Hi Laszlo,


> -----Original Message-----
> From: Laszlo Ersek [mailto:lersek@redhat.com]
> Sent: Friday, July 20, 2018 1:01 AM
> To: Dong, Eric <eric.dong@intel.com>; edk2-devel@lists.01.org
> Cc: Ni, Ruiyu <ruiyu.ni@intel.com>
> Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant
> parameter.
> 
> Hi Eric,
> 
> apologies about the delay.
> 
> On 07/18/18 14:59, Dong, Eric wrote:
> > Hi Laszlo,
> >
> > I finally succeed to setup the OVMF platform which can verify the boot
> > failure issue.  But on my platform, if I use image build with below
> > command (I assume it is used to enable SMM), the system can't boot to
> > OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS
> > boot phase after ExitBootService point (I can see the console log
> > which should been printed at ExitBootService point, so I think hang
> > should after this point).
> > 	build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b
> > NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE
> >
> > If I use below command to build the image, the system can boot to OS.
> > 	build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b
> > NOOPT
> >
> > Does my OVMF environment still has problem?
> >
> >
> > When do the above test, I don't include my two patches.
> 
> Yes, I think this host environment is still problematic. Namely, the latest
> QEMU version shipped in Fedora 25 is QEMU-2.7:
> 
>   https://koji.fedoraproject.org/koji/buildinfo?buildID=918114
> 
> and QEMU-2.7 does not have a feature that is important for SMM stability.
> This feature is called "SMI broadcast".
> 
> In OVMF, the "OvmfPkg/SmmControl2Dxe" runtime driver implements
> EFI_SMM_CONTROL2_PROTOCOL (which is a runtime protocol). The Trigger()
> member function raises an SMI, by writing to IO port 0xB2 (ICH9_APM_CNT).
> 
> Originally, QEMU would raise the SMI synchronously only on the sole VCPU
> that called Trigger(). Then, the edk2 SMM driver stack would have to pull the
> other processors explicitly into SMM (via APIC accesses, if I remember
> correctly). This was extremely slow (the processor first raising the SMI would
> wait for a long time for the other processors to show up in SMM, before it
> would decide to pull them in with APIC writes). Also when we switched the
> edk2 SMM sync mode to "relaxed", the results remained very unstable. We
> decided that edk2 supported the "traditional" SMM sync mode much better,
> and so we implemented "SMI broadcast" in QEMU, to satisfy that sync mode.
> 
> (My memories are a bit fuzzy at this point; you can read more in the following
> RH Bugzilla entries:
> 
>   https://bugzilla.redhat.com/show_bug.cgi?id=1412327 [QEMU]
>   https://bugzilla.redhat.com/show_bug.cgi?id=1412313 [OVMF])
> 
> The idea of "SMI broadcast" is that, regardless of which VCPU triggers the
> SMI, QEMU raises the SMI immediately on all VCPUs. This made a
> *huge* difference for the performance and the stability of the edk2 SMM
> driver stack, used in OVMF and on QEMU/KVM.
> 
> Now, in order to be able to use old OVMF on new QEMU and vice versa, this
> feature is runtime-negotiated between "OvmfPkg/SmmControl2Dxe" and
> QEMU. (The feature is not enabled by default, and without "SMI broadcast",
> the "relaxed" sync method is slightly less broken than the "tradiational"
> method, so OVMF defaults to that. With the feature enabled, the "traditional"
> mode is better -- that config is the absolute best of all four possible
> combinations.)
> 
> More precisely, on the QEMU side, the feature is not tied to a QEMU release,
> but to Q35 *machine type versions*. Therefore, in order to benefit from the
> feature, you need all of the following:
> 
> - a recent enough OVMF,
> - a recent enough QEMU release,
> - a recent enough Q35 machine type, specified on the QEMU command line.
> 
> The particular minimum machine type is "pc-q35-2.9" (which is clearly only
> provided by QEMU-2.9 and later). The machine type requirement is
> automatically satisfied if you use QEMU-2.9+, and just request the "q35"
> machine type. (Without an explicit machtype version number, the highest one
> supported by the QEMU release will be picked.)
> 
> The lack of this feature in your environment is confirmed by your OVMF
> log:
> 
> > NegotiateSmiFeatures: SMI feature negotiation unavailable
> 
> If the feature is available, you will see the following two messages
> instead:
> 
>   NegotiateSmiFeatures: using SMI broadcast
>   [...]
>   AppendFwCfgBootScript: SMI feature negotiation boot script saved
> 
> (The second message only appears if you have S3 enabled -- at S3 resume, the
> feature has to be re-enabled, so SmmControl2Dxe saves a boot script
> fragment for that.)
> 
> Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU 2.9 is
> shipped:
> 
>   https://koji.fedoraproject.org/koji/buildinfo?buildID=986762
> 
> ... It's even better if you can upgrade to Fedora 27, as Fedora 27 is the oldest
> Fedora release still supported at this point. The following article describes the
> recommended upgrade method:
> 
>   https://fedoraproject.org/wiki/DNF_system_upgrade
> 

I updated the system to fedora 28, but it failed to boot. :(  so I borrowed an exited fedora 27 DVD and installed it. With this OS, I can reproduce this issue now. I found this issue is an random issue, I booted 5 times and met the issue.  I'm checking the issue.

> > Then I include my patches and build the image with SMM enabled, I
> > found I can't reproduce the issue you met. I can find the
> > "MpInitChangeApLoopCallback done!" message in the console log.
> > Attached the console log.
> 
> Yes, I can see "MpInitChangeApLoopCallback() done" in the log.
> 
> > Can you help to verify the OVMF image build from my side?
> 
> Your firmware image (SHA1: a11169ef30ab4d0182dbe2c3fc072b0b2e98c06a)
> reproduces the same issue that I reported, on my end. Out of 10 subsequent
> attempts, it only succeeded to boot the OS 3 times (attempts #1, #8 and #10).
> In the failed cases, the log always ends like this:
> 
>   MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
>   RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
>   <HANG>
> 
> That is, one of the APs fails to show up. It always changes which one is missing;
> for example, another failure:
> 
>   MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
>   RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 7 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
>   <HANG>
> 
> My laptop that I use for testing has 1 socket, 4 cores, and 2 threads.
> This is the same VCPU configuration that I use for the guest (hence the
> 1 BSP + 7 AP config seen above). I got the idea that perhaps the host was
> slightly over-subscribed (= more VCPU work than the physical processors can
> serve in "near real time"), and so I changed the guest config to 1 socket, 2
> cores, and 2 threads (= 1 BSP + 3 APs).
> Unfortunately, the issue reproduced in this config as well, at the 4th
> try:
> 
>   MpInitChangeApLoopCallback :: Processor 4, Enabled Processor 4!
>   RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
>   <HANG>
> 
> Just to be sure, I tested a fresh build (without the patches); that booted the OS
> fine (10 out of 10).
> 
> I think something in the code is sensitive to timing, or lacks some kind of
> synchronization. One of the APs may sometimes be missed. I guess it's
> possible that the SMI broadcast feature, when enabled, helps expose the
> problem.
> 

Good message.  I'm investigating this issue and will be back when I root caused it.

> Thanks,
> Laszlo
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Posted by Laszlo Ersek 5 years, 9 months ago
On 07/20/18 08:53, Dong, Eric wrote:
>> -----Original Message----- From: Laszlo Ersek
>> [mailto:lersek@redhat.com]

>> Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU
>> 2.9 is shipped:
>> 
>> https://koji.fedoraproject.org/koji/buildinfo?buildID=986762
>> 
>> ... It's even better if you can upgrade to Fedora 27, as Fedora 27
>> is the oldest Fedora release still supported at this point. The
>> following article describes the recommended upgrade method:
>> 
>> https://fedoraproject.org/wiki/DNF_system_upgrade
>> 
> 
> I updated the system to fedora 28, but it failed to boot. :(  so I
> borrowed an exited fedora 27 DVD and installed it. With this OS, I
> can reproduce this issue now. I found this issue is an random issue,
> I booted 5 times and met the issue.  I'm checking the issue.

Awesome!

(I'm not happy about the problem itself, of course, but I'm *very*
thankful that you took the time to install a Linux box, for testing with
KVM!!!)

Laszlo
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Posted by Dong, Eric 5 years, 9 months ago
Hi Laszlo,

I have root cause this issue, the AP hangs in the procedure when PiSmmCpuDxeSmm driver start up trigged this issue.

When PiSmmCpuDxeSmm driver start up, it will call StartAllAps to set memory attribute.  In StartAllAps function, after call WakeUpAp to start Aps, it calls CheckAllAps to wait all Aps finished the task. In CheckAllAps function, it detect AP state to know whether the AP has finished its task. In old code, it check whether the AP state is CpuStateFinished to know whether AP has finished tasks. This state is only set by AP when it truly finished task. In new logic, CpuStateFinished been replace with CpuStateIdle. And CpuStateIdle is also the begin state of the AP. AP will change state from CpuStateIdle to CpuStateBusy when it start execute the procedure. And after it finished the procedure, it will change state back to CpuStateIdle.

So when the hang issue raised, AP state is not been changed to CpuStateBusy when BSP calls CheckAllAps to check whether the AP has finished its task. So the state for the AP still in CpuStateIdle, but BSP think AP has finished its task. In this case, BSP think all the Aps has finished their tasks and it continues boot. But some AP may wake up later and it failed to return from the procedure. In this case, the AP state keeps at CpuStateBusy. So later in ChangeApLoopCallback function, because this AP state still in CpuStateBusy, this AP will not trig the procedure. But BSP wait all APs to trig the procedure(BSP wait the Aps to reduce the mNumberToFinish value in procedure to continue boot) to continue the boot, so the hang occurred.

I think we should keep a middle state to let us know whether the AP truly finished its task. I will send  another serial patch for this issue. Please help to check the new patches.

Thanks,
Eric

> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
> Laszlo Ersek
> Sent: Saturday, July 21, 2018 12:30 AM
> To: Dong, Eric <eric.dong@intel.com>; edk2-devel@lists.01.org
> Cc: Ni, Ruiyu <ruiyu.ni@intel.com>
> Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant
> parameter.
> 
> On 07/20/18 08:53, Dong, Eric wrote:
> >> -----Original Message----- From: Laszlo Ersek
> >> [mailto:lersek@redhat.com]
> 
> >> Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU
> >> 2.9 is shipped:
> >>
> >> https://koji.fedoraproject.org/koji/buildinfo?buildID=986762
> >>
> >> ... It's even better if you can upgrade to Fedora 27, as Fedora 27 is
> >> the oldest Fedora release still supported at this point. The
> >> following article describes the recommended upgrade method:
> >>
> >> https://fedoraproject.org/wiki/DNF_system_upgrade
> >>
> >
> > I updated the system to fedora 28, but it failed to boot. :(  so I
> > borrowed an exited fedora 27 DVD and installed it. With this OS, I can
> > reproduce this issue now. I found this issue is an random issue, I
> > booted 5 times and met the issue.  I'm checking the issue.
> 
> Awesome!
> 
> (I'm not happy about the problem itself, of course, but I'm *very* thankful
> that you took the time to install a Linux box, for testing with
> KVM!!!)
> 
> Laszlo
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Posted by Laszlo Ersek 5 years, 9 months ago
On 07/25/18 05:50, Dong, Eric wrote:
> Hi Laszlo,
>
> I have root cause this issue, the AP hangs in the procedure when
> PiSmmCpuDxeSmm driver start up trigged this issue.
>
> When PiSmmCpuDxeSmm driver start up, it will call StartAllAps to set
> memory attribute.  In StartAllAps function, after call WakeUpAp to
> start Aps, it calls CheckAllAps to wait all Aps finished the task. In
> CheckAllAps function, it detect AP state to know whether the AP has
> finished its task. In old code, it check whether the AP state is
> CpuStateFinished to know whether AP has finished tasks. This state is
> only set by AP when it truly finished task. In new logic,
> CpuStateFinished been replace with CpuStateIdle. And CpuStateIdle is
> also the begin state of the AP. AP will change state from CpuStateIdle
> to CpuStateBusy when it start execute the procedure. And after it
> finished the procedure, it will change state back to CpuStateIdle.
>
> So when the hang issue raised, AP state is not been changed to
> CpuStateBusy when BSP calls CheckAllAps to check whether the AP has
> finished its task. So the state for the AP still in CpuStateIdle, but
> BSP think AP has finished its task. In this case, BSP think all the
> Aps has finished their tasks and it continues boot.

Awesome analysis!

So, this looks like an "inverse" variant of the classic "ABA problem":

https://en.wikipedia.org/wiki/ABA_problem

> But some AP may wake up later and it failed to return from the
> procedure.

Ah! So that explains another symptom I've since seen as well -- although
*very* rarely. Namely, if an AP wakes up *after* PiSmmCpuDxeSmm moves
on, thinking that all APs are finished, the AP can execute garbage in
"no man's land" -- and that crashes the guest. Basically, QEMU/KVM pause
the guest with "emulation failure", and QEMU dumps the VCPU register
state to the standard error on the host side. In particular, the
register state indicates that the crashed VCPU is *not* in SMM.

When I first encountered this symptom now, while playing some more with
your patches, it reminded me of earlier problems with MpInitLib. And now
your analysis makes perfect sense of this additional symptom!

> In this case, the AP state keeps at CpuStateBusy. So later in
> ChangeApLoopCallback function, because this AP state still in
> CpuStateBusy, this AP will not trig the procedure. But BSP wait all
> APs to trig the procedure(BSP wait the Aps to reduce the
> mNumberToFinish value in procedure to continue boot) to continue the
> boot, so the hang occurred.

This completes the explanation.

> I think we should keep a middle state to let us know whether the AP
> truly finished its task. I will send  another serial patch for this
> issue. Please help to check the new patches.

Yes, I'll test them too.

Thanks!
Laszlo
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Posted by Dong, Eric 5 years, 9 months ago
Hi Laszlo,


> -----Original Message-----
> From: Laszlo Ersek [mailto:lersek@redhat.com]
> Sent: Wednesday, July 25, 2018 6:14 PM
> To: Dong, Eric <eric.dong@intel.com>; edk2-devel@lists.01.org
> Cc: Ni, Ruiyu <ruiyu.ni@intel.com>
> Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant
> parameter.
> 
> On 07/25/18 05:50, Dong, Eric wrote:
> > Hi Laszlo,
> >
> > I have root cause this issue, the AP hangs in the procedure when
> > PiSmmCpuDxeSmm driver start up trigged this issue.
> >
> > When PiSmmCpuDxeSmm driver start up, it will call StartAllAps to set
> > memory attribute.  In StartAllAps function, after call WakeUpAp to
> > start Aps, it calls CheckAllAps to wait all Aps finished the task. In
> > CheckAllAps function, it detect AP state to know whether the AP has
> > finished its task. In old code, it check whether the AP state is
> > CpuStateFinished to know whether AP has finished tasks. This state is
> > only set by AP when it truly finished task. In new logic,
> > CpuStateFinished been replace with CpuStateIdle. And CpuStateIdle is
> > also the begin state of the AP. AP will change state from CpuStateIdle
> > to CpuStateBusy when it start execute the procedure. And after it
> > finished the procedure, it will change state back to CpuStateIdle.
> >
> > So when the hang issue raised, AP state is not been changed to
> > CpuStateBusy when BSP calls CheckAllAps to check whether the AP has
> > finished its task. So the state for the AP still in CpuStateIdle, but
> > BSP think AP has finished its task. In this case, BSP think all the
> > Aps has finished their tasks and it continues boot.
> 
> Awesome analysis!
> 
> So, this looks like an "inverse" variant of the classic "ABA problem":
> 
> https://en.wikipedia.org/wiki/ABA_problem

Yes, it's an ABA problem. New patch keep the CpuStateReady to distinguish from the CpuStateIdle state.

> 
> > But some AP may wake up later and it failed to return from the
> > procedure.
> 
> Ah! So that explains another symptom I've since seen as well -- although
> *very* rarely. Namely, if an AP wakes up *after* PiSmmCpuDxeSmm moves
> on, thinking that all APs are finished, the AP can execute garbage in "no man's
> land" -- and that crashes the guest. Basically, QEMU/KVM pause the guest
> with "emulation failure", and QEMU dumps the VCPU register state to the
> standard error on the host side. In particular, the register state indicates that
> the crashed VCPU is *not* in SMM.

Where to get these QEMU dumps info?

I'm not clear who calls StartAllAPs function when I send this mail, I just based on the debug message found StartAllAps trigged right after PiSmmCpuDxeSmm driver. but I think this not blocks my patches for this issue. So I sent the patches and back to dig it more.

Now I found it's PiSmmIpl driver calls gDS->SetMemorySpaceAttributes in SmmIplDxeDispatchEventNotify function which finally calls the StartAllAps.
The Ap procedure is SetMtrrsFromBuffer function in CpuDxe driver.  In SetMtrrsFromBuffer function, it just call MtrrSetAllMtrrs to update the MTRR, and it use local variable to save the MTRR settings. 

When the hang issue raised, in the time the AP begin to run the procedure, the BSP has leave current function. So the saved MTRR setting in the local variable is not valid anymore. So the MtrrSetAllMtrrs function use an invalid Mtrr Setttings to set mtrrs and never exit the procedure. Do you think it's reasonable?

I found till the ExitBootService point, the AP still in busy state. I check the ApWakeupFunction function, found between the AP procedure and set AP state to Idle, no other possible code exist. So I think the AP should not return back from procedure. I try to add debug message to know where the AP is stopped at. But after I add debug message in MtrrSetAllMtrrs, the issue can't reproduce anymore. I tried about 30 times. Do you have other message to get to know where the AP is?

Thanks,
Eric

> 
> When I first encountered this symptom now, while playing some more with
> your patches, it reminded me of earlier problems with MpInitLib. And now
> your analysis makes perfect sense of this additional symptom!
> 
> > In this case, the AP state keeps at CpuStateBusy. So later in
> > ChangeApLoopCallback function, because this AP state still in
> > CpuStateBusy, this AP will not trig the procedure. But BSP wait all
> > APs to trig the procedure(BSP wait the Aps to reduce the
> > mNumberToFinish value in procedure to continue boot) to continue the
> > boot, so the hang occurred.
> 
> This completes the explanation.
> 
> > I think we should keep a middle state to let us know whether the AP
> > truly finished its task. I will send  another serial patch for this
> > issue. Please help to check the new patches.
> 
> Yes, I'll test them too.
> 
> Thanks!
> Laszlo
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.
Posted by Laszlo Ersek 5 years, 9 months ago
On 07/25/18 13:35, Dong, Eric wrote:

>> From: Laszlo Ersek [mailto:lersek@redhat.com]
>> Sent: Wednesday, July 25, 2018 6:14 PM
>> To: Dong, Eric <eric.dong@intel.com>; edk2-devel@lists.01.org
>> Cc: Ni, Ruiyu <ruiyu.ni@intel.com>

>> On 07/25/18 05:50, Dong, Eric wrote:

>>> But some AP may wake up later and it failed to return from the
>>> procedure.
>>
>> Ah! So that explains another symptom I've since seen as well --
>> although *very* rarely. Namely, if an AP wakes up *after*
>> PiSmmCpuDxeSmm moves on, thinking that all APs are finished, the AP
>> can execute garbage in "no man's land" -- and that crashes the guest.
>> Basically, QEMU/KVM pause the guest with "emulation failure", and
>> QEMU dumps the VCPU register state to the standard error on the host
>> side. In particular, the register state indicates that the crashed
>> VCPU is *not* in SMM.
>
> Where to get these QEMU dumps info?

QEMU simply writes the register dump to stderr. If you use QEMU together
with libvirt, then QEMU's stderr is saved in:

  /var/log/libvirt/qemu/$DOMAIN.log

Here's an example:

  KVM internal error. Suberror: 1
  emulation failure
  RAX=0000000000000000 RBX=0000000000000005 RCX=000000007eab9828 RDX=0000000000000002
  RSI=000000000009f000 RDI=000000007eef2ae0 RBP=000000007eef2e40 RSP=000000007eef2a08
  R8 =0000000000000002 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
  R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
  RIP=000000007ea9cdb2 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
  SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
  TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
  GDT=     000000007ebed998 00000047
  IDT=     000000007e237280 00000fff
  CR0=c0010033 CR2=0000000000000000 CR3=000000007ec01000 CR4=00000668
  DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
  DR6=00000000ffff0ff0 DR7=0000000000000400
  EFER=0000000000000500
  Code=89 d0 5f c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc f3 90 <c3> cc cc cc cc cc cc cc cc cc cc cc cc cc 53 89 c8 89 d1 50 0f a2 4c 8b 54 24 38 4d 85 d2

(Anyway, this is just for general discussion now -- your v3 series works
perfectly for me.)

> I'm not clear who calls StartAllAPs function when I send this mail, I
> just based on the debug message found StartAllAps trigged right after
> PiSmmCpuDxeSmm driver. but I think this not blocks my patches for this
> issue. So I sent the patches and back to dig it more.

Sure, I think your analysis was spot-on, regardless of how you produced
it.

The register dumps / emulation failures that I mentioned in my previous
email in this thread are not a new issue with your v3 posting. The v3
series is fine. Instead, these failures were a *further* (but less
frequent) symptom with the *original* series, and I encountered them
separately from the boot hangs that I reported at first.

> Now I found it's PiSmmIpl driver calls gDS->SetMemorySpaceAttributes
> in SmmIplDxeDispatchEventNotify function which finally calls the
> StartAllAps.
> The Ap procedure is SetMtrrsFromBuffer function in CpuDxe driver.  In
> SetMtrrsFromBuffer function, it just call MtrrSetAllMtrrs to update
> the MTRR, and it use local variable to save the MTRR settings.
>
> When the hang issue raised, in the time the AP begin to run the
> procedure, the BSP has leave current function. So the saved MTRR
> setting in the local variable is not valid anymore. So the
> MtrrSetAllMtrrs function use an invalid Mtrr Setttings to set mtrrs
> and never exit the procedure. Do you think it's reasonable?

Absolutely. Basically, if an AP is allowed to continue running the user
procedure after the BSP in MpInitLib thinks that the AP is finished,
anything at all can happen -- the escapes control, it may access memory
that has been freed, and perhaps it might even execute code that has
been freed and overwritten. The actual symptoms can vary wildly.

> I found till the ExitBootService point, the AP still in busy state. I
> check the ApWakeupFunction function, found between the AP procedure
> and set AP state to Idle, no other possible code exist. So I think the
> AP should not return back from procedure. I try to add debug message
> to know where the AP is stopped at. But after I add debug message in
> MtrrSetAllMtrrs, the issue can't reproduce anymore. I tried about 30
> times. Do you have other message to get to know where the AP is?

No, I don't have any other ideas at hand to deduce where the AP was
stuck -- as far as I'm concerned, it had just wandered off into the
weeds :)

I think your idea to check the AP state at the time of the
ExitBootServices handler hang was great.

So, again, in this thread I didn't mean to report a new issue, I just
meant to report another -- less frequent -- symptom of the same "ABA"
issue.

Everything is fine with your v3 posting as much as I can tell (just
please update the commit messages before you push).

Thanks!
Laszlo
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel