[edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM

Michael Kinney posted 1 patch 7 years, 6 months ago
Failed in applying to current master (apply log)
2_unicode_strings_file_format.md |  9 ++++++---
README.md                        | 27 ++++++++++++++-------------
2 files changed, 20 insertions(+), 16 deletions(-)
[edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Michael Kinney 7 years, 6 months ago
https://bugzilla.tianocore.org/show_bug.cgi?id=507

Cc: Jaben Carsey <jaben.carsey@intel.com>
Cc: Yonghong Zhu <yonghong.zhu@intel.com>
Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
Contributed-under: TianoCore Contribution Agreement 1.1
Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
---
 2_unicode_strings_file_format.md |  9 ++++++---
 README.md                        | 27 ++++++++++++++-------------
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/2_unicode_strings_file_format.md b/2_unicode_strings_file_format.md
index 0150c85..7a4a019 100644
--- a/2_unicode_strings_file_format.md
+++ b/2_unicode_strings_file_format.md
@@ -33,7 +33,8 @@
 
 EDK II Unicode files are used for mapping token names to localized strings that
 are identified by an RFC4646 language code. The format for storing EDK II
-Unicode files is UTF-16LE. The character content must be UCS-2.
+Unicode files on disk is UTF-8 (without a BOM character) or UTF-16LE (with a BOM
+character). The character content must be UCS-2.
 
 Strings ends are determined by the first of the following items found:
 
@@ -44,11 +45,13 @@ Strings ends are determined by the first of the following items found:
 
 Comments may appear anywhere within the string file.
 
-All the files must begin with a Unicode BOM character.
+All UTF-16LE files must begin with a Unicode BOM character.
+All UTF-8 files must not begin with a Unicode BOM character.
 
 **********
 **NOTE:** Please make sure you select an editor that supports UCS-2 characters
-that can be stored in a UTF-16LE file.
+that can be stored in either a UTF-8 (without a BOM character) or a UTF-16LE
+file (with a BOM character).
 **********
 
 ## 2.1 Common EBNF
diff --git a/README.md b/README.md
index 63842a1..015aef1 100644
--- a/README.md
+++ b/README.md
@@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. All rights reserved.
 
 ### Revision History
 
-| Revision          | Description                                                                              | Date            |
-| ----------------- | ---------------------------------------------------------------------------------------- | --------------- |
-| 1.0               | Initial Release.                                                                         | February 2014   |
-| 1.1               | Updated EBNF to follow syntax specified in EBNF by the ANTLR project.                    | August 2014     |
-|                   | Added content related to EDK II Meta-Data Unicode files.                                 |                 |
-|                   | Restructured document.                                                                   |                 |
-|                   | Removed security and C format GUID definitions, not required for HII or other UNI files. |                 |
-|                   | Removed invalid escape code sequences.                                                   |                 |
-| 1.2               | Added optional font formatting                                                           | September 2014  |
-| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`                                     | April 2015      |
-| 1.3               | Added: Syntax for non-ascii characters inside quoted strings.                            | March 2016      |
-|                   | Removed: Info on specific consumers (.INF & .DEC) removed.                               |                 |
-| 1.4               | Convert to GitBook format                                                                | March 2017      |
+| Revision          | Description                                                                                                            | Date            |
+| ----------------- | ---------------------------------------------------------------------------------------------------------------------- | --------------- |
+| 1.0               | Initial Release.                                                                                                       | February 2014   |
+| 1.1               | Updated EBNF to follow syntax specified in EBNF by the ANTLR project.                                                  | August 2014     |
+|                   | Added content related to EDK II Meta-Data Unicode files.                                                               |                 |
+|                   | Restructured document.                                                                                                 |                 |
+|                   | Removed security and C format GUID definitions, not required for HII or other UNI files.                               |                 |
+|                   | Removed invalid escape code sequences.                                                                                 |                 |
+| 1.2               | Added optional font formatting                                                                                         | September 2014  |
+| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`                                                                   | April 2015      |
+| 1.3               | Added: Syntax for non-ascii characters inside quoted strings.                                                          | March 2016      |
+|                   | Removed: Info on specific consumers (.INF & .DEC) removed.                                                             |                 |
+| 1.4               | Convert to GitBook format                                                                                              | April 2017      |
+|                   | [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507) UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |                 |
-- 
2.6.3.windows.1

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Tim Lewis 7 years, 6 months ago
Mike --

This breaks our existing build tools, which assume that a file without a BOM is UTF-16. 

Tim

-----Original Message-----
From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Michael Kinney
Sent: Tuesday, April 25, 2017 6:07 PM
To: edk2-devel@lists.01.org
Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw <kevin.w.shaw@intel.com>
Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM

https://bugzilla.tianocore.org/show_bug.cgi?id=507

Cc: Jaben Carsey <jaben.carsey@intel.com>
Cc: Yonghong Zhu <yonghong.zhu@intel.com>
Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
Contributed-under: TianoCore Contribution Agreement 1.1
Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
---
 2_unicode_strings_file_format.md |  9 ++++++---
 README.md                        | 27 ++++++++++++++-------------
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/2_unicode_strings_file_format.md b/2_unicode_strings_file_format.md
index 0150c85..7a4a019 100644
--- a/2_unicode_strings_file_format.md
+++ b/2_unicode_strings_file_format.md
@@ -33,7 +33,8 @@
 
 EDK II Unicode files are used for mapping token names to localized strings that  are identified by an RFC4646 language code. The format for storing EDK II -Unicode files is UTF-16LE. The character content must be UCS-2.
+Unicode files on disk is UTF-8 (without a BOM character) or UTF-16LE 
+(with a BOM character). The character content must be UCS-2.
 
 Strings ends are determined by the first of the following items found:
 
@@ -44,11 +45,13 @@ Strings ends are determined by the first of the following items found:
 
 Comments may appear anywhere within the string file.
 
-All the files must begin with a Unicode BOM character.
+All UTF-16LE files must begin with a Unicode BOM character.
+All UTF-8 files must not begin with a Unicode BOM character.
 
 **********
 **NOTE:** Please make sure you select an editor that supports UCS-2 characters -that can be stored in a UTF-16LE file.
+that can be stored in either a UTF-8 (without a BOM character) or a 
+UTF-16LE file (with a BOM character).
 **********
 
 ## 2.1 Common EBNF
diff --git a/README.md b/README.md
index 63842a1..015aef1 100644
--- a/README.md
+++ b/README.md
@@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. All rights reserved.
 
 ### Revision History
 
-| Revision          | Description                                                                              | Date            |
-| ----------------- | ---------------------------------------------------------------------------------------- | --------------- |
-| 1.0               | Initial Release.                                                                         | February 2014   |
-| 1.1               | Updated EBNF to follow syntax specified in EBNF by the ANTLR project.                    | August 2014     |
-|                   | Added content related to EDK II Meta-Data Unicode files.                                 |                 |
-|                   | Restructured document.                                                                   |                 |
-|                   | Removed security and C format GUID definitions, not required for HII or other UNI files. |                 |
-|                   | Removed invalid escape code sequences.                                                   |                 |
-| 1.2               | Added optional font formatting                                                           | September 2014  |
-| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`                                     | April 2015      |
-| 1.3               | Added: Syntax for non-ascii characters inside quoted strings.                            | March 2016      |
-|                   | Removed: Info on specific consumers (.INF & .DEC) removed.                               |                 |
-| 1.4               | Convert to GitBook format                                                                | March 2017      |
+| Revision          | Description                                                                                                            | Date            |
+| ----------------- | ---------------------------------------------------------------------------------------------------------------------- | --------------- |
+| 1.0               | Initial Release.                                                                                                       | February 2014   |
+| 1.1               | Updated EBNF to follow syntax specified in EBNF by the ANTLR project.                                                  | August 2014     |
+|                   | Added content related to EDK II Meta-Data Unicode files.                                                               |                 |
+|                   | Restructured document.                                                                                                 |                 |
+|                   | Removed security and C format GUID definitions, not required for HII or other UNI files.                               |                 |
+|                   | Removed invalid escape code sequences.                                                                                 |                 |
+| 1.2               | Added optional font formatting                                                                                         | September 2014  |
+| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`                                                                   | April 2015      |
+| 1.3               | Added: Syntax for non-ascii characters inside quoted strings.                                                          | March 2016      |
+|                   | Removed: Info on specific consumers (.INF & .DEC) removed.                                                             |                 |
+| 1.4               | Convert to GitBook format                                                                                              | April 2017      |
+|                   | [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507) UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |                 |
--
2.6.3.windows.1

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Carsey, Jaben 7 years, 6 months ago
Tim, 

Doesn't that assumption/behavior violate the current spec?
"All the files must begin with a Unicode BOM character."

-Jaben

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 9:15 AM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-
> devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to
> be UTF-8 without a BOM
> Importance: High
> 
> Mike --
> 
> This breaks our existing build tools, which assume that a file without a BOM is
> UTF-16.
> 
> Tim
> 
> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
> Michael Kinney
> Sent: Tuesday, April 25, 2017 6:07 PM
> To: edk2-devel@lists.01.org
> Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw
> <kevin.w.shaw@intel.com>
> Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be
> UTF-8 without a BOM
> 
> https://bugzilla.tianocore.org/show_bug.cgi?id=507
> 
> Cc: Jaben Carsey <jaben.carsey@intel.com>
> Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> Contributed-under: TianoCore Contribution Agreement 1.1
> Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> ---
>  2_unicode_strings_file_format.md |  9 ++++++---
>  README.md                        | 27 ++++++++++++++-------------
>  2 files changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/2_unicode_strings_file_format.md
> b/2_unicode_strings_file_format.md
> index 0150c85..7a4a019 100644
> --- a/2_unicode_strings_file_format.md
> +++ b/2_unicode_strings_file_format.md
> @@ -33,7 +33,8 @@
> 
>  EDK II Unicode files are used for mapping token names to localized strings
> that  are identified by an RFC4646 language code. The format for storing EDK
> II -Unicode files is UTF-16LE. The character content must be UCS-2.
> +Unicode files on disk is UTF-8 (without a BOM character) or UTF-16LE
> +(with a BOM character). The character content must be UCS-2.
> 
>  Strings ends are determined by the first of the following items found:
> 
> @@ -44,11 +45,13 @@ Strings ends are determined by the first of the
> following items found:
> 
>  Comments may appear anywhere within the string file.
> 
> -All the files must begin with a Unicode BOM character.
> +All UTF-16LE files must begin with a Unicode BOM character.
> +All UTF-8 files must not begin with a Unicode BOM character.
> 
>  **********
>  **NOTE:** Please make sure you select an editor that supports UCS-2
> characters -that can be stored in a UTF-16LE file.
> +that can be stored in either a UTF-8 (without a BOM character) or a
> +UTF-16LE file (with a BOM character).
>  **********
> 
>  ## 2.1 Common EBNF
> diff --git a/README.md b/README.md
> index 63842a1..015aef1 100644
> --- a/README.md
> +++ b/README.md
> @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. All rights
> reserved.
> 
>  ### Revision History
> 
> -| Revision          | Description                                                                              | Date
> |
> -| ----------------- | -----------------------------------------------------------------------
> ----------------- | --------------- |
> -| 1.0               | Initial Release.                                                                         | February
> 2014   |
> -| 1.1               | Updated EBNF to follow syntax specified in EBNF by the ANTLR
> project.                    | August 2014     |
> -|                   | Added content related to EDK II Meta-Data Unicode files.
> |                 |
> -|                   | Restructured document.                                                                   |
> |
> -|                   | Removed security and C format GUID definitions, not required
> for HII or other UNI files. |                 |
> -|                   | Removed invalid escape code sequences.
> |                 |
> -| 1.2               | Added optional font formatting                                                           |
> September 2014  |
> -| 1.2 Errata A      | Correct misspelling of:
> `STR_PROPERTIES_MODULE_NAME`                                     | April 2015      |
> -| 1.3               | Added: Syntax for non-ascii characters inside quoted strings.
> | March 2016      |
> -|                   | Removed: Info on specific consumers (.INF & .DEC) removed.
> |                 |
> -| 1.4               | Convert to GitBook format                                                                |
> March 2017      |
> +| Revision          | Description
> | Date            |
> +| ----------------- | ----------------------------------------------------------------------
> ------------------------------------------------ | --------------- |
> +| 1.0               | Initial Release.
> | February 2014   |
> +| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> ANTLR project.                                                  | August 2014     |
> +|                   | Added content related to EDK II Meta-Data Unicode files.
> |                 |
> +|                   | Restructured document.
> |                 |
> +|                   | Removed security and C format GUID definitions, not required
> for HII or other UNI files.                               |                 |
> +|                   | Removed invalid escape code sequences.
> |                 |
> +| 1.2               | Added optional font formatting
> | September 2014  |
> +| 1.2 Errata A      | Correct misspelling of:
> `STR_PROPERTIES_MODULE_NAME`                                                                   | April
> 2015      |
> +| 1.3               | Added: Syntax for non-ascii characters inside quoted strings.
> | March 2016      |
> +|                   | Removed: Info on specific consumers (.INF & .DEC) removed.
> |                 |
> +| 1.4               | Convert to GitBook format
> | April 2017      |
> +|                   | [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |                 |
> --
> 2.6.3.windows.1
> 
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Tim Lewis 7 years, 6 months ago
The original UNI specifications (for example, the Multi-String .UNI File Format Specification, February 2014, Revision 1.0) did not require it, and the fact is that tools accept files without the BOM happily today.

I believe that requiring the BOM is a good step forward, but assuming UTF-8 when one is not present won't help the vast quantities of existing UNI files out there.

Tim

-----Original Message-----
From: Carsey, Jaben [mailto:jaben.carsey@intel.com] 
Sent: Wednesday, April 26, 2017 10:45 AM
To: Tim Lewis <tim.lewis@insyde.com>; Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel@lists.01.org
Cc: Shaw, Kevin W <kevin.w.shaw@intel.com>
Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM

Tim, 

Doesn't that assumption/behavior violate the current spec?
"All the files must begin with a Unicode BOM character."

-Jaben

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 9:15 AM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; edk2- 
> devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be UTF-8 without a BOM
> Importance: High
> 
> Mike --
> 
> This breaks our existing build tools, which assume that a file without 
> a BOM is UTF-16.
> 
> Tim
> 
> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of 
> Michael Kinney
> Sent: Tuesday, April 25, 2017 6:07 PM
> To: edk2-devel@lists.01.org
> Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw 
> <kevin.w.shaw@intel.com>
> Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk 
> to be
> UTF-8 without a BOM
> 
> https://bugzilla.tianocore.org/show_bug.cgi?id=507
> 
> Cc: Jaben Carsey <jaben.carsey@intel.com>
> Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> Contributed-under: TianoCore Contribution Agreement 1.1
> Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> ---
>  2_unicode_strings_file_format.md |  9 ++++++---
>  README.md                        | 27 ++++++++++++++-------------
>  2 files changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/2_unicode_strings_file_format.md
> b/2_unicode_strings_file_format.md
> index 0150c85..7a4a019 100644
> --- a/2_unicode_strings_file_format.md
> +++ b/2_unicode_strings_file_format.md
> @@ -33,7 +33,8 @@
> 
>  EDK II Unicode files are used for mapping token names to localized 
> strings that  are identified by an RFC4646 language code. The format 
> for storing EDK II -Unicode files is UTF-16LE. The character content must be UCS-2.
> +Unicode files on disk is UTF-8 (without a BOM character) or UTF-16LE 
> +(with a BOM character). The character content must be UCS-2.
> 
>  Strings ends are determined by the first of the following items found:
> 
> @@ -44,11 +45,13 @@ Strings ends are determined by the first of the 
> following items found:
> 
>  Comments may appear anywhere within the string file.
> 
> -All the files must begin with a Unicode BOM character.
> +All UTF-16LE files must begin with a Unicode BOM character.
> +All UTF-8 files must not begin with a Unicode BOM character.
> 
>  **********
>  **NOTE:** Please make sure you select an editor that supports UCS-2 
> characters -that can be stored in a UTF-16LE file.
> +that can be stored in either a UTF-8 (without a BOM character) or a 
> +UTF-16LE file (with a BOM character).
>  **********
> 
>  ## 2.1 Common EBNF
> diff --git a/README.md b/README.md
> index 63842a1..015aef1 100644
> --- a/README.md
> +++ b/README.md
> @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. All 
> rights reserved.
> 
>  ### Revision History
> 
> -| Revision          | Description                                                                              | Date
> |
> -| ----------------- | 
> -| -------------------------------------------------------------------
> -| ----
> ----------------- | --------------- |
> -| 1.0               | Initial Release.                                                                         | February
> 2014   |
> -| 1.1               | Updated EBNF to follow syntax specified in EBNF by the ANTLR
> project.                    | August 2014     |
> -|                   | Added content related to EDK II Meta-Data Unicode files.
> |                 |
> -|                   | Restructured document.                                                                   |
> |
> -|                   | Removed security and C format GUID definitions, 
> -| not required
> for HII or other UNI files. |                 |
> -|                   | Removed invalid escape code sequences.
> |                 |
> -| 1.2               | Added optional font formatting                                                           |
> September 2014  |
> -| 1.2 Errata A      | Correct misspelling of:
> `STR_PROPERTIES_MODULE_NAME`                                     | April 2015      |
> -| 1.3               | Added: Syntax for non-ascii characters inside quoted strings.
> | March 2016      |
> -|                   | Removed: Info on specific consumers (.INF & .DEC) removed.
> |                 |
> -| 1.4               | Convert to GitBook format                                                                |
> March 2017      |
> +| Revision          | Description
> | Date            |
> +| ----------------- | 
> +| -------------------------------------------------------------------
> +| ---
> ------------------------------------------------ | --------------- |
> +| 1.0               | Initial Release.
> | February 2014   |
> +| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> ANTLR project.                                                  | August 2014     |
> +|                   | Added content related to EDK II Meta-Data Unicode files.
> |                 |
> +|                   | Restructured document.
> |                 |
> +|                   | Removed security and C format GUID definitions, 
> +| not required
> for HII or other UNI files.                               |                 |
> +|                   | Removed invalid escape code sequences.
> |                 |
> +| 1.2               | Added optional font formatting
> | September 2014  |
> +| 1.2 Errata A      | Correct misspelling of:
> `STR_PROPERTIES_MODULE_NAME`                                                                   | April
> 2015      |
> +| 1.3               | Added: Syntax for non-ascii characters inside quoted strings.
> | March 2016      |
> +|                   | Removed: Info on specific consumers (.INF & .DEC) removed.
> |                 |
> +| 1.4               | Convert to GitBook format
> | April 2017      |
> +|                   | 
> +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |                 |
> --
> 2.6.3.windows.1
> 
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Kinney, Michael D 7 years, 6 months ago
Tim,

The document change request under review here is against this 1.3 spec.

Here is the document history on this topic I have been able to find. 

The Multi-String .UNI File Format Specification Version 1.3, March 2016
https://github.com/tianocore/tianocore.github.io/wiki/EDK%20II%20Specifications
has the following 2 statements in CH 2:

* The format for storing EDK II Unicode files is UTF-16LE
* All the files must begin with a Unicode BOM character.

The Multi-String .UNI File Format Specification Version 1.2 Errata A, April 2015
https://github.com/tianocore/tianocore.github.io/wiki/EDK-II-Specifications-Archived
has the same 2 statements in CH 2:

* The format for storing EDK II Unicode files is UTF-16LE
* All the files must begin with a Unicode BOM character.


The Multi-String .UNI File Format Specification, Revision 1.0, February 2014
http://cran.org.uk/edk2/docs/specs/UNI_File_Spec_1_0.pdf
has the following statements in CH2:

* All the files must begin with the binary character, 0xFEFF (big-endian).

I do not see any versions of the .UNI spec that do not require a BOM.

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 10:53 AM
> To: Carsey, Jaben <jaben.carsey@intel.com>; Kinney, Michael D
> <michael.d.kinney@intel.com>; edk2-devel@lists.01.org
> Cc: Shaw, Kevin W <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be
> UTF-8 without a BOM
> 
> The original UNI specifications (for example, the Multi-String .UNI File Format
> Specification, February 2014, Revision 1.0) did not require it, and the fact is
> that tools accept files without the BOM happily today.
> 
> I believe that requiring the BOM is a good step forward, but assuming UTF-8 when
> one is not present won't help the vast quantities of existing UNI files out
> there.
> 
> Tim
> 
> -----Original Message-----
> From: Carsey, Jaben [mailto:jaben.carsey@intel.com]
> Sent: Wednesday, April 26, 2017 10:45 AM
> To: Tim Lewis <tim.lewis@insyde.com>; Kinney, Michael D
> <michael.d.kinney@intel.com>; edk2-devel@lists.01.org
> Cc: Shaw, Kevin W <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be
> UTF-8 without a BOM
> 
> Tim,
> 
> Doesn't that assumption/behavior violate the current spec?
> "All the files must begin with a Unicode BOM character."
> 
> -Jaben
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 9:15 AM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-
> > devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > disk to be UTF-8 without a BOM
> > Importance: High
> >
> > Mike --
> >
> > This breaks our existing build tools, which assume that a file without
> > a BOM is UTF-16.
> >
> > Tim
> >
> > -----Original Message-----
> > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
> > Michael Kinney
> > Sent: Tuesday, April 25, 2017 6:07 PM
> > To: edk2-devel@lists.01.org
> > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw
> > <kevin.w.shaw@intel.com>
> > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk
> > to be
> > UTF-8 without a BOM
> >
> > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> >
> > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > Contributed-under: TianoCore Contribution Agreement 1.1
> > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > ---
> >  2_unicode_strings_file_format.md |  9 ++++++---
> >  README.md                        | 27 ++++++++++++++-------------
> >  2 files changed, 20 insertions(+), 16 deletions(-)
> >
> > diff --git a/2_unicode_strings_file_format.md
> > b/2_unicode_strings_file_format.md
> > index 0150c85..7a4a019 100644
> > --- a/2_unicode_strings_file_format.md
> > +++ b/2_unicode_strings_file_format.md
> > @@ -33,7 +33,8 @@
> >
> >  EDK II Unicode files are used for mapping token names to localized
> > strings that  are identified by an RFC4646 language code. The format
> > for storing EDK II -Unicode files is UTF-16LE. The character content must be
> UCS-2.
> > +Unicode files on disk is UTF-8 (without a BOM character) or UTF-16LE
> > +(with a BOM character). The character content must be UCS-2.
> >
> >  Strings ends are determined by the first of the following items found:
> >
> > @@ -44,11 +45,13 @@ Strings ends are determined by the first of the
> > following items found:
> >
> >  Comments may appear anywhere within the string file.
> >
> > -All the files must begin with a Unicode BOM character.
> > +All UTF-16LE files must begin with a Unicode BOM character.
> > +All UTF-8 files must not begin with a Unicode BOM character.
> >
> >  **********
> >  **NOTE:** Please make sure you select an editor that supports UCS-2
> > characters -that can be stored in a UTF-16LE file.
> > +that can be stored in either a UTF-8 (without a BOM character) or a
> > +UTF-16LE file (with a BOM character).
> >  **********
> >
> >  ## 2.1 Common EBNF
> > diff --git a/README.md b/README.md
> > index 63842a1..015aef1 100644
> > --- a/README.md
> > +++ b/README.md
> > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. All
> > rights reserved.
> >
> >  ### Revision History
> >
> > -| Revision          | Description
> | Date
> > |
> > -| ----------------- |
> > -| -------------------------------------------------------------------
> > -| ----
> > ----------------- | --------------- |
> > -| 1.0               | Initial Release.
> | February
> > 2014   |
> > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> ANTLR
> > project.                    | August 2014     |
> > -|                   | Added content related to EDK II Meta-Data Unicode files.
> > |                 |
> > -|                   | Restructured document.
> |
> > |
> > -|                   | Removed security and C format GUID definitions,
> > -| not required
> > for HII or other UNI files. |                 |
> > -|                   | Removed invalid escape code sequences.
> > |                 |
> > -| 1.2               | Added optional font formatting
> |
> > September 2014  |
> > -| 1.2 Errata A      | Correct misspelling of:
> > `STR_PROPERTIES_MODULE_NAME`                                     | April 2015
> |
> > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> strings.
> > | March 2016      |
> > -|                   | Removed: Info on specific consumers (.INF & .DEC)
> removed.
> > |                 |
> > -| 1.4               | Convert to GitBook format
> |
> > March 2017      |
> > +| Revision          | Description
> > | Date            |
> > +| ----------------- |
> > +| -------------------------------------------------------------------
> > +| ---
> > ------------------------------------------------ | --------------- |
> > +| 1.0               | Initial Release.
> > | February 2014   |
> > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > ANTLR project.                                                  | August 2014
> |
> > +|                   | Added content related to EDK II Meta-Data Unicode files.
> > |                 |
> > +|                   | Restructured document.
> > |                 |
> > +|                   | Removed security and C format GUID definitions,
> > +| not required
> > for HII or other UNI files.                               |                 |
> > +|                   | Removed invalid escape code sequences.
> > |                 |
> > +| 1.2               | Added optional font formatting
> > | September 2014  |
> > +| 1.2 Errata A      | Correct misspelling of:
> > `STR_PROPERTIES_MODULE_NAME`
> | April
> > 2015      |
> > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> strings.
> > | March 2016      |
> > +|                   | Removed: Info on specific consumers (.INF & .DEC)
> removed.
> > |                 |
> > +| 1.4               | Convert to GitBook format
> > | April 2017      |
> > +|                   |
> > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |                 |
> > --
> > 2.6.3.windows.1
> >
> > _______________________________________________
> > edk2-devel mailing list
> > edk2-devel@lists.01.org
> > https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Kinney, Michael D 7 years, 6 months ago
Hi Tim,

This is not a request for a new change.  Instead, the intent of this document
change is to update the document to reflect the implemented behavior of the
EDK II tools.  The EDK II tool updates to add UTF-8 file support were completed
with the patches listed below.  Notice that the main one for normal build
support was checked in almost 2 years ago. 

BaseTools - UniClassObject - 6/23/2015
* https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253fbd5119670f75c6
* https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17b7e80b91f816eda

BaseTools - ECC - 12/29/2015
* https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88afb3faa71ddde4d6

BaseTools - UPT - 4/25/2016
* https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e9368b6b697bbf327af

This was intended to be a 100% backwards compatible change.

All .uni files in the EDK II project in UTF-16LE format have always use a BOM.
Please checkout UDK2015 or older UDKs and you will see all .uni files start
with 0xff 0xfe.

Thanks,

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 9:15 AM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be
> UTF-8 without a BOM
> 
> Mike --
> 
> This breaks our existing build tools, which assume that a file without a BOM is
> UTF-16.
> 
> Tim
> 
> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Michael
> Kinney
> Sent: Tuesday, April 25, 2017 6:07 PM
> To: edk2-devel@lists.01.org
> Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw <kevin.w.shaw@intel.com>
> Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-
> 8 without a BOM
> 
> https://bugzilla.tianocore.org/show_bug.cgi?id=507
> 
> Cc: Jaben Carsey <jaben.carsey@intel.com>
> Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> Contributed-under: TianoCore Contribution Agreement 1.1
> Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> ---
>  2_unicode_strings_file_format.md |  9 ++++++---
>  README.md                        | 27 ++++++++++++++-------------
>  2 files changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/2_unicode_strings_file_format.md b/2_unicode_strings_file_format.md
> index 0150c85..7a4a019 100644
> --- a/2_unicode_strings_file_format.md
> +++ b/2_unicode_strings_file_format.md
> @@ -33,7 +33,8 @@
> 
>  EDK II Unicode files are used for mapping token names to localized strings that
> are identified by an RFC4646 language code. The format for storing EDK II -
> Unicode files is UTF-16LE. The character content must be UCS-2.
> +Unicode files on disk is UTF-8 (without a BOM character) or UTF-16LE
> +(with a BOM character). The character content must be UCS-2.
> 
>  Strings ends are determined by the first of the following items found:
> 
> @@ -44,11 +45,13 @@ Strings ends are determined by the first of the following
> items found:
> 
>  Comments may appear anywhere within the string file.
> 
> -All the files must begin with a Unicode BOM character.
> +All UTF-16LE files must begin with a Unicode BOM character.
> +All UTF-8 files must not begin with a Unicode BOM character.
> 
>  **********
>  **NOTE:** Please make sure you select an editor that supports UCS-2 characters -
> that can be stored in a UTF-16LE file.
> +that can be stored in either a UTF-8 (without a BOM character) or a
> +UTF-16LE file (with a BOM character).
>  **********
> 
>  ## 2.1 Common EBNF
> diff --git a/README.md b/README.md
> index 63842a1..015aef1 100644
> --- a/README.md
> +++ b/README.md
> @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. All rights
> reserved.
> 
>  ### Revision History
> 
> -| Revision          | Description
> | Date            |
> -| ----------------- | ----------------------------------------------------------
> ------------------------------ | --------------- |
> -| 1.0               | Initial Release.
> | February 2014   |
> -| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> ANTLR project.                    | August 2014     |
> -|                   | Added content related to EDK II Meta-Data Unicode files.
> |                 |
> -|                   | Restructured document.
> |                 |
> -|                   | Removed security and C format GUID definitions, not
> required for HII or other UNI files. |                 |
> -|                   | Removed invalid escape code sequences.
> |                 |
> -| 1.2               | Added optional font formatting
> | September 2014  |
> -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> | April 2015      |
> -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> strings.                            | March 2016      |
> -|                   | Removed: Info on specific consumers (.INF & .DEC) removed.
> |                 |
> -| 1.4               | Convert to GitBook format
> | March 2017      |
> +| Revision          | Description
> | Date            |
> +| ----------------- | ----------------------------------------------------------
> ------------------------------------------------------------ | --------------- |
> +| 1.0               | Initial Release.
> | February 2014   |
> +| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> ANTLR project.                                                  | August 2014
> |
> +|                   | Added content related to EDK II Meta-Data Unicode files.
> |                 |
> +|                   | Restructured document.
> |                 |
> +|                   | Removed security and C format GUID definitions, not
> required for HII or other UNI files.                               |
> |
> +|                   | Removed invalid escape code sequences.
> |                 |
> +| 1.2               | Added optional font formatting
> | September 2014  |
> +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> | April 2015      |
> +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> strings.                                                          | March 2016
> |
> +|                   | Removed: Info on specific consumers (.INF & .DEC) removed.
> |                 |
> +| 1.4               | Convert to GitBook format
> | April 2017      |
> +|                   | [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |                 |
> --
> 2.6.3.windows.1
> 
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Tim Lewis 7 years, 6 months ago
Mike --

I understand that EDK2 has decided to add BOM markers two years ago. Adding a BOM didn't change the default. The problem is (a) there are still hundreds of files extant in our codebase which were created prior to the 2015 changes and still in use, and (b) this change is not backward compatible for these files. 

Tim

-----Original Message-----
From: Kinney, Michael D [mailto:michael.d.kinney@intel.com] 
Sent: Wednesday, April 26, 2017 11:11 AM
To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D <michael.d.kinney@intel.com>
Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM

Hi Tim,

This is not a request for a new change.  Instead, the intent of this document change is to update the document to reflect the implemented behavior of the EDK II tools.  The EDK II tool updates to add UTF-8 file support were completed with the patches listed below.  Notice that the main one for normal build support was checked in almost 2 years ago. 

BaseTools - UniClassObject - 6/23/2015
* https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253fbd5119670f75c6
* https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17b7e80b91f816eda

BaseTools - ECC - 12/29/2015
* https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88afb3faa71ddde4d6

BaseTools - UPT - 4/25/2016
* https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e9368b6b697bbf327af

This was intended to be a 100% backwards compatible change.

All .uni files in the EDK II project in UTF-16LE format have always use a BOM.
Please checkout UDK2015 or older UDKs and you will see all .uni files start with 0xff 0xfe.

Thanks,

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 9:15 AM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Mike --
> 
> This breaks our existing build tools, which assume that a file without 
> a BOM is UTF-16.
> 
> Tim
> 
> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of 
> Michael Kinney
> Sent: Tuesday, April 25, 2017 6:07 PM
> To: edk2-devel@lists.01.org
> Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw 
> <kevin.w.shaw@intel.com>
> Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk 
> to be UTF-
> 8 without a BOM
> 
> https://bugzilla.tianocore.org/show_bug.cgi?id=507
> 
> Cc: Jaben Carsey <jaben.carsey@intel.com>
> Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> Contributed-under: TianoCore Contribution Agreement 1.1
> Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> ---
>  2_unicode_strings_file_format.md |  9 ++++++---
>  README.md                        | 27 ++++++++++++++-------------
>  2 files changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/2_unicode_strings_file_format.md 
> b/2_unicode_strings_file_format.md
> index 0150c85..7a4a019 100644
> --- a/2_unicode_strings_file_format.md
> +++ b/2_unicode_strings_file_format.md
> @@ -33,7 +33,8 @@
> 
>  EDK II Unicode files are used for mapping token names to localized 
> strings that are identified by an RFC4646 language code. The format 
> for storing EDK II - Unicode files is UTF-16LE. The character content must be UCS-2.
> +Unicode files on disk is UTF-8 (without a BOM character) or UTF-16LE 
> +(with a BOM character). The character content must be UCS-2.
> 
>  Strings ends are determined by the first of the following items found:
> 
> @@ -44,11 +45,13 @@ Strings ends are determined by the first of the 
> following items found:
> 
>  Comments may appear anywhere within the string file.
> 
> -All the files must begin with a Unicode BOM character.
> +All UTF-16LE files must begin with a Unicode BOM character.
> +All UTF-8 files must not begin with a Unicode BOM character.
> 
>  **********
>  **NOTE:** Please make sure you select an editor that supports UCS-2 
> characters - that can be stored in a UTF-16LE file.
> +that can be stored in either a UTF-8 (without a BOM character) or a 
> +UTF-16LE file (with a BOM character).
>  **********
> 
>  ## 2.1 Common EBNF
> diff --git a/README.md b/README.md
> index 63842a1..015aef1 100644
> --- a/README.md
> +++ b/README.md
> @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. All 
> rights reserved.
> 
>  ### Revision History
> 
> -| Revision          | Description
> | Date            |
> -| ----------------- | 
> -| ----------------------------------------------------------
> ------------------------------ | --------------- |
> -| 1.0               | Initial Release.
> | February 2014   |
> -| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> ANTLR project.                    | August 2014     |
> -|                   | Added content related to EDK II Meta-Data Unicode files.
> |                 |
> -|                   | Restructured document.
> |                 |
> -|                   | Removed security and C format GUID definitions, 
> -| not
> required for HII or other UNI files. |                 |
> -|                   | Removed invalid escape code sequences.
> |                 |
> -| 1.2               | Added optional font formatting
> | September 2014  |
> -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> | April 2015      |
> -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> strings.                            | March 2016      |
> -|                   | Removed: Info on specific consumers (.INF & .DEC) removed.
> |                 |
> -| 1.4               | Convert to GitBook format
> | March 2017      |
> +| Revision          | Description
> | Date            |
> +| ----------------- | 
> +| ----------------------------------------------------------
> ------------------------------------------------------------ | 
> --------------- |
> +| 1.0               | Initial Release.
> | February 2014   |
> +| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> ANTLR project.                                                  | August 2014
> |
> +|                   | Added content related to EDK II Meta-Data Unicode files.
> |                 |
> +|                   | Restructured document.
> |                 |
> +|                   | Removed security and C format GUID definitions, 
> +| not
> required for HII or other UNI files.                               |
> |
> +|                   | Removed invalid escape code sequences.
> |                 |
> +| 1.2               | Added optional font formatting
> | September 2014  |
> +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> | April 2015      |
> +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> strings.                                                          | March 2016
> |
> +|                   | Removed: Info on specific consumers (.INF & .DEC) removed.
> |                 |
> +| 1.4               | Convert to GitBook format
> | April 2017      |
> +|                   | 
> +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |                 |
> --
> 2.6.3.windows.1
> 
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Kinney, Michael D 7 years, 6 months ago
Tim,

If you look at the entire file history of the EDK II, you will see 
that the BOM has always been present in the UTF-16LE formatted files.

The build tools were updated in 2015 to *add* support for UTF-8 file.
The .uni files in the EDK II project were then converted from UTF-16LE
with a BOM to UTF-8 without a BOM.  This provided an easier developer
experience when using GIT to do email patch review of .uni files.

It is possible I am missing something here.  Can you please provide 
a pointer to the EDK II commit(s) where BOMs were added to UTF-16LE 
.uni files.

Thanks,

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 11:34 AM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be
> UTF-8 without a BOM
> 
> Mike --
> 
> I understand that EDK2 has decided to add BOM markers two years ago. Adding a BOM
> didn't change the default. The problem is (a) there are still hundreds of files
> extant in our codebase which were created prior to the 2015 changes and still in
> use, and (b) this change is not backward compatible for these files.
> 
> Tim
> 
> -----Original Message-----
> From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> Sent: Wednesday, April 26, 2017 11:11 AM
> To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D
> <michael.d.kinney@intel.com>
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be
> UTF-8 without a BOM
> 
> Hi Tim,
> 
> This is not a request for a new change.  Instead, the intent of this document
> change is to update the document to reflect the implemented behavior of the EDK
> II tools.  The EDK II tool updates to add UTF-8 file support were completed with
> the patches listed below.  Notice that the main one for normal build support was
> checked in almost 2 years ago.
> 
> BaseTools - UniClassObject - 6/23/2015
> *
> https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253fbd5119670f75c6
> *
> https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17b7e80b91f816eda
> 
> BaseTools - ECC - 12/29/2015
> *
> https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88afb3faa71ddde4d6
> 
> BaseTools - UPT - 4/25/2016
> *
> https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e9368b6b697bbf327af
> 
> This was intended to be a 100% backwards compatible change.
> 
> All .uni files in the EDK II project in UTF-16LE format have always use a BOM.
> Please checkout UDK2015 or older UDKs and you will see all .uni files start with
> 0xff 0xfe.
> 
> Thanks,
> 
> Mike
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 9:15 AM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > edk2-devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > disk to be
> > UTF-8 without a BOM
> >
> > Mike --
> >
> > This breaks our existing build tools, which assume that a file without
> > a BOM is UTF-16.
> >
> > Tim
> >
> > -----Original Message-----
> > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
> > Michael Kinney
> > Sent: Tuesday, April 25, 2017 6:07 PM
> > To: edk2-devel@lists.01.org
> > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw
> > <kevin.w.shaw@intel.com>
> > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk
> > to be UTF-
> > 8 without a BOM
> >
> > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> >
> > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > Contributed-under: TianoCore Contribution Agreement 1.1
> > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > ---
> >  2_unicode_strings_file_format.md |  9 ++++++---
> >  README.md                        | 27 ++++++++++++++-------------
> >  2 files changed, 20 insertions(+), 16 deletions(-)
> >
> > diff --git a/2_unicode_strings_file_format.md
> > b/2_unicode_strings_file_format.md
> > index 0150c85..7a4a019 100644
> > --- a/2_unicode_strings_file_format.md
> > +++ b/2_unicode_strings_file_format.md
> > @@ -33,7 +33,8 @@
> >
> >  EDK II Unicode files are used for mapping token names to localized
> > strings that are identified by an RFC4646 language code. The format
> > for storing EDK II - Unicode files is UTF-16LE. The character content must be
> UCS-2.
> > +Unicode files on disk is UTF-8 (without a BOM character) or UTF-16LE
> > +(with a BOM character). The character content must be UCS-2.
> >
> >  Strings ends are determined by the first of the following items found:
> >
> > @@ -44,11 +45,13 @@ Strings ends are determined by the first of the
> > following items found:
> >
> >  Comments may appear anywhere within the string file.
> >
> > -All the files must begin with a Unicode BOM character.
> > +All UTF-16LE files must begin with a Unicode BOM character.
> > +All UTF-8 files must not begin with a Unicode BOM character.
> >
> >  **********
> >  **NOTE:** Please make sure you select an editor that supports UCS-2
> > characters - that can be stored in a UTF-16LE file.
> > +that can be stored in either a UTF-8 (without a BOM character) or a
> > +UTF-16LE file (with a BOM character).
> >  **********
> >
> >  ## 2.1 Common EBNF
> > diff --git a/README.md b/README.md
> > index 63842a1..015aef1 100644
> > --- a/README.md
> > +++ b/README.md
> > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. All
> > rights reserved.
> >
> >  ### Revision History
> >
> > -| Revision          | Description
> > | Date            |
> > -| ----------------- |
> > -| ----------------------------------------------------------
> > ------------------------------ | --------------- |
> > -| 1.0               | Initial Release.
> > | February 2014   |
> > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > ANTLR project.                    | August 2014     |
> > -|                   | Added content related to EDK II Meta-Data Unicode files.
> > |                 |
> > -|                   | Restructured document.
> > |                 |
> > -|                   | Removed security and C format GUID definitions,
> > -| not
> > required for HII or other UNI files. |                 |
> > -|                   | Removed invalid escape code sequences.
> > |                 |
> > -| 1.2               | Added optional font formatting
> > | September 2014  |
> > -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > | April 2015      |
> > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > strings.                            | March 2016      |
> > -|                   | Removed: Info on specific consumers (.INF & .DEC)
> removed.
> > |                 |
> > -| 1.4               | Convert to GitBook format
> > | March 2017      |
> > +| Revision          | Description
> > | Date            |
> > +| ----------------- |
> > +| ----------------------------------------------------------
> > ------------------------------------------------------------ |
> > --------------- |
> > +| 1.0               | Initial Release.
> > | February 2014   |
> > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > ANTLR project.                                                  | August 2014
> > |
> > +|                   | Added content related to EDK II Meta-Data Unicode files.
> > |                 |
> > +|                   | Restructured document.
> > |                 |
> > +|                   | Removed security and C format GUID definitions,
> > +| not
> > required for HII or other UNI files.                               |
> > |
> > +|                   | Removed invalid escape code sequences.
> > |                 |
> > +| 1.2               | Added optional font formatting
> > | September 2014  |
> > +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > | April 2015      |
> > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > strings.                                                          | March 2016
> > |
> > +|                   | Removed: Info on specific consumers (.INF & .DEC)
> removed.
> > |                 |
> > +| 1.4               | Convert to GitBook format
> > | April 2017      |
> > +|                   |
> > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |                 |
> > --
> > 2.6.3.windows.1
> >
> > _______________________________________________
> > edk2-devel mailing list
> > edk2-devel@lists.01.org
> > https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Tim Lewis 7 years, 6 months ago
Mike --

This is not about files in the EDK II repository. This is about files created based on the spec, and created with other sets of tools. Go back to early 2015, to the Build spec (1.22, etc.), Appendix G, which is where the UNI stuff used to live.

The point is: files which worked before, and, at worst, generated a warning before, now are interpreted incorrectly even though they have correct data.

Making ASCII (or UTF-8) the default without a BOM is the breaking change.

Tim 

-----Original Message-----
From: Kinney, Michael D [mailto:michael.d.kinney@intel.com] 
Sent: Wednesday, April 26, 2017 11:47 AM
To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D <michael.d.kinney@intel.com>
Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM

Tim,

If you look at the entire file history of the EDK II, you will see that the BOM has always been present in the UTF-16LE formatted files.

The build tools were updated in 2015 to *add* support for UTF-8 file.
The .uni files in the EDK II project were then converted from UTF-16LE with a BOM to UTF-8 without a BOM.  This provided an easier developer experience when using GIT to do email patch review of .uni files.

It is possible I am missing something here.  Can you please provide a pointer to the EDK II commit(s) where BOMs were added to UTF-16LE .uni files.

Thanks,

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 11:34 AM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Mike --
> 
> I understand that EDK2 has decided to add BOM markers two years ago. 
> Adding a BOM didn't change the default. The problem is (a) there are 
> still hundreds of files extant in our codebase which were created 
> prior to the 2015 changes and still in use, and (b) this change is not backward compatible for these files.
> 
> Tim
> 
> -----Original Message-----
> From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> Sent: Wednesday, April 26, 2017 11:11 AM
> To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, 
> Michael D <michael.d.kinney@intel.com>
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Hi Tim,
> 
> This is not a request for a new change.  Instead, the intent of this 
> document change is to update the document to reflect the implemented 
> behavior of the EDK II tools.  The EDK II tool updates to add UTF-8 
> file support were completed with the patches listed below.  Notice 
> that the main one for normal build support was checked in almost 2 years ago.
> 
> BaseTools - UniClassObject - 6/23/2015
> *
> https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253fbd5
> 119670f75c6
> *
> https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17b7e8
> 0b91f816eda
> 
> BaseTools - ECC - 12/29/2015
> *
> https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88afb3f
> aa71ddde4d6
> 
> BaseTools - UPT - 4/25/2016
> *
> https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e9368b6b
> 697bbf327af
> 
> This was intended to be a 100% backwards compatible change.
> 
> All .uni files in the EDK II project in UTF-16LE format have always use a BOM.
> Please checkout UDK2015 or older UDKs and you will see all .uni files 
> start with 0xff 0xfe.
> 
> Thanks,
> 
> Mike
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 9:15 AM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > edk2-devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > on disk to be
> > UTF-8 without a BOM
> >
> > Mike --
> >
> > This breaks our existing build tools, which assume that a file 
> > without a BOM is UTF-16.
> >
> > Tim
> >
> > -----Original Message-----
> > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf 
> > Of Michael Kinney
> > Sent: Tuesday, April 25, 2017 6:07 PM
> > To: edk2-devel@lists.01.org
> > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw 
> > <kevin.w.shaw@intel.com>
> > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> > disk to be UTF-
> > 8 without a BOM
> >
> > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> >
> > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > Contributed-under: TianoCore Contribution Agreement 1.1
> > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > ---
> >  2_unicode_strings_file_format.md |  9 ++++++---
> >  README.md                        | 27 ++++++++++++++-------------
> >  2 files changed, 20 insertions(+), 16 deletions(-)
> >
> > diff --git a/2_unicode_strings_file_format.md
> > b/2_unicode_strings_file_format.md
> > index 0150c85..7a4a019 100644
> > --- a/2_unicode_strings_file_format.md
> > +++ b/2_unicode_strings_file_format.md
> > @@ -33,7 +33,8 @@
> >
> >  EDK II Unicode files are used for mapping token names to localized 
> > strings that are identified by an RFC4646 language code. The format 
> > for storing EDK II - Unicode files is UTF-16LE. The character 
> > content must be
> UCS-2.
> > +Unicode files on disk is UTF-8 (without a BOM character) or 
> > +UTF-16LE (with a BOM character). The character content must be UCS-2.
> >
> >  Strings ends are determined by the first of the following items found:
> >
> > @@ -44,11 +45,13 @@ Strings ends are determined by the first of the 
> > following items found:
> >
> >  Comments may appear anywhere within the string file.
> >
> > -All the files must begin with a Unicode BOM character.
> > +All UTF-16LE files must begin with a Unicode BOM character.
> > +All UTF-8 files must not begin with a Unicode BOM character.
> >
> >  **********
> >  **NOTE:** Please make sure you select an editor that supports UCS-2 
> > characters - that can be stored in a UTF-16LE file.
> > +that can be stored in either a UTF-8 (without a BOM character) or a 
> > +UTF-16LE file (with a BOM character).
> >  **********
> >
> >  ## 2.1 Common EBNF
> > diff --git a/README.md b/README.md
> > index 63842a1..015aef1 100644
> > --- a/README.md
> > +++ b/README.md
> > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. All 
> > rights reserved.
> >
> >  ### Revision History
> >
> > -| Revision          | Description
> > | Date            |
> > -| ----------------- |
> > -| ----------------------------------------------------------
> > ------------------------------ | --------------- |
> > -| 1.0               | Initial Release.
> > | February 2014   |
> > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > ANTLR project.                    | August 2014     |
> > -|                   | Added content related to EDK II Meta-Data Unicode files.
> > |                 |
> > -|                   | Restructured document.
> > |                 |
> > -|                   | Removed security and C format GUID 
> > -| definitions, not
> > required for HII or other UNI files. |                 |
> > -|                   | Removed invalid escape code sequences.
> > |                 |
> > -| 1.2               | Added optional font formatting
> > | September 2014  |
> > -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > | April 2015      |
> > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > strings.                            | March 2016      |
> > -|                   | Removed: Info on specific consumers (.INF & 
> > -| .DEC)
> removed.
> > |                 |
> > -| 1.4               | Convert to GitBook format
> > | March 2017      |
> > +| Revision          | Description
> > | Date            |
> > +| ----------------- |
> > +| ----------------------------------------------------------
> > ------------------------------------------------------------ |
> > --------------- |
> > +| 1.0               | Initial Release.
> > | February 2014   |
> > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > ANTLR project.                                                  | August 2014
> > |
> > +|                   | Added content related to EDK II Meta-Data Unicode files.
> > |                 |
> > +|                   | Restructured document.
> > |                 |
> > +|                   | Removed security and C format GUID 
> > +| definitions, not
> > required for HII or other UNI files.                               |
> > |
> > +|                   | Removed invalid escape code sequences.
> > |                 |
> > +| 1.2               | Added optional font formatting
> > | September 2014  |
> > +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > | April 2015      |
> > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > strings.                                                          | March 2016
> > |
> > +|                   | Removed: Info on specific consumers (.INF & 
> > +| .DEC)
> removed.
> > |                 |
> > +| 1.4               | Convert to GitBook format
> > | April 2017      |
> > +|                   |
> > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |                 |
> > --
> > 2.6.3.windows.1
> >
> > _______________________________________________
> > edk2-devel mailing list
> > edk2-devel@lists.01.org
> > https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Kinney, Michael D 7 years, 6 months ago
Hi Tim,

The recommendation for UTF-8 usage is to not use a BOM, which is why
no BOM for UTF-8 was selected for EDK II.

The current task is to update docs to match the current tool behavior.

The EDK II repos on GitHub have .uni files in UTF-8 format without
a BOM to support easier patch review.

There are ways to use GIT features to auto-convert .uni files 
when pulling content from EDK II repos and pushing commits.  
That may or may not help with the specific issue you are raising.

If you have ideas on a tool change request to EDK II that would 
provide compatibility with current EDK II tool behavior and support
UTF-16LE without a BOM, then let's work that through in a Bugzilla
feature request.  If we find a solution, we can update the docs and
tools again.

Do you have any objections to updating the UNI Spec to match the 
current tool behavior?

Thanks,

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 11:54 AM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be
> UTF-8 without a BOM
> 
> Mike --
> 
> This is not about files in the EDK II repository. This is about files created
> based on the spec, and created with other sets of tools. Go back to early 2015,
> to the Build spec (1.22, etc.), Appendix G, which is where the UNI stuff used to
> live.
> 
> The point is: files which worked before, and, at worst, generated a warning
> before, now are interpreted incorrectly even though they have correct data.
> 
> Making ASCII (or UTF-8) the default without a BOM is the breaking change.
> 
> Tim
> 
> -----Original Message-----
> From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> Sent: Wednesday, April 26, 2017 11:47 AM
> To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D
> <michael.d.kinney@intel.com>
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be
> UTF-8 without a BOM
> 
> Tim,
> 
> If you look at the entire file history of the EDK II, you will see that the BOM
> has always been present in the UTF-16LE formatted files.
> 
> The build tools were updated in 2015 to *add* support for UTF-8 file.
> The .uni files in the EDK II project were then converted from UTF-16LE with a BOM
> to UTF-8 without a BOM.  This provided an easier developer experience when using
> GIT to do email patch review of .uni files.
> 
> It is possible I am missing something here.  Can you please provide a pointer to
> the EDK II commit(s) where BOMs were added to UTF-16LE .uni files.
> 
> Thanks,
> 
> Mike
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 11:34 AM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > edk2-devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > disk to be
> > UTF-8 without a BOM
> >
> > Mike --
> >
> > I understand that EDK2 has decided to add BOM markers two years ago.
> > Adding a BOM didn't change the default. The problem is (a) there are
> > still hundreds of files extant in our codebase which were created
> > prior to the 2015 changes and still in use, and (b) this change is not backward
> compatible for these files.
> >
> > Tim
> >
> > -----Original Message-----
> > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > Sent: Wednesday, April 26, 2017 11:11 AM
> > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney,
> > Michael D <michael.d.kinney@intel.com>
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > disk to be
> > UTF-8 without a BOM
> >
> > Hi Tim,
> >
> > This is not a request for a new change.  Instead, the intent of this
> > document change is to update the document to reflect the implemented
> > behavior of the EDK II tools.  The EDK II tool updates to add UTF-8
> > file support were completed with the patches listed below.  Notice
> > that the main one for normal build support was checked in almost 2 years ago.
> >
> > BaseTools - UniClassObject - 6/23/2015
> > *
> > https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253fbd5
> > 119670f75c6
> > *
> > https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17b7e8
> > 0b91f816eda
> >
> > BaseTools - ECC - 12/29/2015
> > *
> > https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88afb3f
> > aa71ddde4d6
> >
> > BaseTools - UPT - 4/25/2016
> > *
> > https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e9368b6b
> > 697bbf327af
> >
> > This was intended to be a 100% backwards compatible change.
> >
> > All .uni files in the EDK II project in UTF-16LE format have always use a BOM.
> > Please checkout UDK2015 or older UDKs and you will see all .uni files
> > start with 0xff 0xfe.
> >
> > Thanks,
> >
> > Mike
> >
> > > -----Original Message-----
> > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > Sent: Wednesday, April 26, 2017 9:15 AM
> > > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > > edk2-devel@lists.01.org
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Mike --
> > >
> > > This breaks our existing build tools, which assume that a file
> > > without a BOM is UTF-16.
> > >
> > > Tim
> > >
> > > -----Original Message-----
> > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf
> > > Of Michael Kinney
> > > Sent: Tuesday, April 25, 2017 6:07 PM
> > > To: edk2-devel@lists.01.org
> > > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw
> > > <kevin.w.shaw@intel.com>
> > > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > > disk to be UTF-
> > > 8 without a BOM
> > >
> > > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> > >
> > > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > > Contributed-under: TianoCore Contribution Agreement 1.1
> > > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > > ---
> > >  2_unicode_strings_file_format.md |  9 ++++++---
> > >  README.md                        | 27 ++++++++++++++-------------
> > >  2 files changed, 20 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/2_unicode_strings_file_format.md
> > > b/2_unicode_strings_file_format.md
> > > index 0150c85..7a4a019 100644
> > > --- a/2_unicode_strings_file_format.md
> > > +++ b/2_unicode_strings_file_format.md
> > > @@ -33,7 +33,8 @@
> > >
> > >  EDK II Unicode files are used for mapping token names to localized
> > > strings that are identified by an RFC4646 language code. The format
> > > for storing EDK II - Unicode files is UTF-16LE. The character
> > > content must be
> > UCS-2.
> > > +Unicode files on disk is UTF-8 (without a BOM character) or
> > > +UTF-16LE (with a BOM character). The character content must be UCS-2.
> > >
> > >  Strings ends are determined by the first of the following items found:
> > >
> > > @@ -44,11 +45,13 @@ Strings ends are determined by the first of the
> > > following items found:
> > >
> > >  Comments may appear anywhere within the string file.
> > >
> > > -All the files must begin with a Unicode BOM character.
> > > +All UTF-16LE files must begin with a Unicode BOM character.
> > > +All UTF-8 files must not begin with a Unicode BOM character.
> > >
> > >  **********
> > >  **NOTE:** Please make sure you select an editor that supports UCS-2
> > > characters - that can be stored in a UTF-16LE file.
> > > +that can be stored in either a UTF-8 (without a BOM character) or a
> > > +UTF-16LE file (with a BOM character).
> > >  **********
> > >
> > >  ## 2.1 Common EBNF
> > > diff --git a/README.md b/README.md
> > > index 63842a1..015aef1 100644
> > > --- a/README.md
> > > +++ b/README.md
> > > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. All
> > > rights reserved.
> > >
> > >  ### Revision History
> > >
> > > -| Revision          | Description
> > > | Date            |
> > > -| ----------------- |
> > > -| ----------------------------------------------------------
> > > ------------------------------ | --------------- |
> > > -| 1.0               | Initial Release.
> > > | February 2014   |
> > > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > > ANTLR project.                    | August 2014     |
> > > -|                   | Added content related to EDK II Meta-Data Unicode
> files.
> > > |                 |
> > > -|                   | Restructured document.
> > > |                 |
> > > -|                   | Removed security and C format GUID
> > > -| definitions, not
> > > required for HII or other UNI files. |                 |
> > > -|                   | Removed invalid escape code sequences.
> > > |                 |
> > > -| 1.2               | Added optional font formatting
> > > | September 2014  |
> > > -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > | April 2015      |
> > > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > strings.                            | March 2016      |
> > > -|                   | Removed: Info on specific consumers (.INF &
> > > -| .DEC)
> > removed.
> > > |                 |
> > > -| 1.4               | Convert to GitBook format
> > > | March 2017      |
> > > +| Revision          | Description
> > > | Date            |
> > > +| ----------------- |
> > > +| ----------------------------------------------------------
> > > ------------------------------------------------------------ |
> > > --------------- |
> > > +| 1.0               | Initial Release.
> > > | February 2014   |
> > > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > > ANTLR project.                                                  | August 2014
> > > |
> > > +|                   | Added content related to EDK II Meta-Data Unicode
> files.
> > > |                 |
> > > +|                   | Restructured document.
> > > |                 |
> > > +|                   | Removed security and C format GUID
> > > +| definitions, not
> > > required for HII or other UNI files.                               |
> > > |
> > > +|                   | Removed invalid escape code sequences.
> > > |                 |
> > > +| 1.2               | Added optional font formatting
> > > | September 2014  |
> > > +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > | April 2015      |
> > > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > strings.                                                          | March
> 2016
> > > |
> > > +|                   | Removed: Info on specific consumers (.INF &
> > > +| .DEC)
> > removed.
> > > |                 |
> > > +| 1.4               | Convert to GitBook format
> > > | April 2017      |
> > > +|                   |
> > > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |
> |
> > > --
> > > 2.6.3.windows.1
> > >
> > > _______________________________________________
> > > edk2-devel mailing list
> > > edk2-devel@lists.01.org
> > > https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Tim Lewis 7 years, 6 months ago
Mike --

I would prefer to update the docs to match actual industry practice. EDK2 is not the universe. 

Insyde has been using UNI files well before my time here (> 5 years). The fact that recent specifications or EDK2 tools (2 years) added BOM support it does not remove the backward compatibility issue.

The Unicode specification usage of "not recommended" is referring specifically to its usage for byte-order. The full sentence (from 2.6) is: "Use of a BOM is neither required nor recommended [for byte order determination] for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature" Editorial comment mine. In this case, the BOM marker would appear as a UTF-8 signature. This would distinguish it from ASCII or any of the multi-byte encoding schemes used.

Tim 

-----Original Message-----
From: Kinney, Michael D [mailto:michael.d.kinney@intel.com] 
Sent: Wednesday, April 26, 2017 3:47 PM
To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D <michael.d.kinney@intel.com>
Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM

Hi Tim,

The recommendation for UTF-8 usage is to not use a BOM, which is why no BOM for UTF-8 was selected for EDK II.

The current task is to update docs to match the current tool behavior.

The EDK II repos on GitHub have .uni files in UTF-8 format without a BOM to support easier patch review.

There are ways to use GIT features to auto-convert .uni files when pulling content from EDK II repos and pushing commits.  
That may or may not help with the specific issue you are raising.

If you have ideas on a tool change request to EDK II that would provide compatibility with current EDK II tool behavior and support UTF-16LE without a BOM, then let's work that through in a Bugzilla feature request.  If we find a solution, we can update the docs and tools again.

Do you have any objections to updating the UNI Spec to match the current tool behavior?

Thanks,

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 11:54 AM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Mike --
> 
> This is not about files in the EDK II repository. This is about files 
> created based on the spec, and created with other sets of tools. Go 
> back to early 2015, to the Build spec (1.22, etc.), Appendix G, which 
> is where the UNI stuff used to live.
> 
> The point is: files which worked before, and, at worst, generated a 
> warning before, now are interpreted incorrectly even though they have correct data.
> 
> Making ASCII (or UTF-8) the default without a BOM is the breaking change.
> 
> Tim
> 
> -----Original Message-----
> From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> Sent: Wednesday, April 26, 2017 11:47 AM
> To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, 
> Michael D <michael.d.kinney@intel.com>
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Tim,
> 
> If you look at the entire file history of the EDK II, you will see 
> that the BOM has always been present in the UTF-16LE formatted files.
> 
> The build tools were updated in 2015 to *add* support for UTF-8 file.
> The .uni files in the EDK II project were then converted from UTF-16LE 
> with a BOM to UTF-8 without a BOM.  This provided an easier developer 
> experience when using GIT to do email patch review of .uni files.
> 
> It is possible I am missing something here.  Can you please provide a 
> pointer to the EDK II commit(s) where BOMs were added to UTF-16LE .uni files.
> 
> Thanks,
> 
> Mike
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 11:34 AM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > edk2-devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > on disk to be
> > UTF-8 without a BOM
> >
> > Mike --
> >
> > I understand that EDK2 has decided to add BOM markers two years ago.
> > Adding a BOM didn't change the default. The problem is (a) there are 
> > still hundreds of files extant in our codebase which were created 
> > prior to the 2015 changes and still in use, and (b) this change is 
> > not backward
> compatible for these files.
> >
> > Tim
> >
> > -----Original Message-----
> > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > Sent: Wednesday, April 26, 2017 11:11 AM
> > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; 
> > Kinney, Michael D <michael.d.kinney@intel.com>
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > on disk to be
> > UTF-8 without a BOM
> >
> > Hi Tim,
> >
> > This is not a request for a new change.  Instead, the intent of this 
> > document change is to update the document to reflect the implemented 
> > behavior of the EDK II tools.  The EDK II tool updates to add UTF-8 
> > file support were completed with the patches listed below.  Notice 
> > that the main one for normal build support was checked in almost 2 years ago.
> >
> > BaseTools - UniClassObject - 6/23/2015
> > *
> > https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253fb
> > d5
> > 119670f75c6
> > *
> > https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17b7
> > e8
> > 0b91f816eda
> >
> > BaseTools - ECC - 12/29/2015
> > *
> > https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88afb
> > 3f
> > aa71ddde4d6
> >
> > BaseTools - UPT - 4/25/2016
> > *
> > https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e9368b
> > 6b
> > 697bbf327af
> >
> > This was intended to be a 100% backwards compatible change.
> >
> > All .uni files in the EDK II project in UTF-16LE format have always use a BOM.
> > Please checkout UDK2015 or older UDKs and you will see all .uni 
> > files start with 0xff 0xfe.
> >
> > Thanks,
> >
> > Mike
> >
> > > -----Original Message-----
> > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > Sent: Wednesday, April 26, 2017 9:15 AM
> > > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > > edk2-devel@lists.01.org
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Mike --
> > >
> > > This breaks our existing build tools, which assume that a file 
> > > without a BOM is UTF-16.
> > >
> > > Tim
> > >
> > > -----Original Message-----
> > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On 
> > > Behalf Of Michael Kinney
> > > Sent: Tuesday, April 25, 2017 6:07 PM
> > > To: edk2-devel@lists.01.org
> > > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw 
> > > <kevin.w.shaw@intel.com>
> > > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> > > disk to be UTF-
> > > 8 without a BOM
> > >
> > > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> > >
> > > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > > Contributed-under: TianoCore Contribution Agreement 1.1
> > > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > > ---
> > >  2_unicode_strings_file_format.md |  9 ++++++---
> > >  README.md                        | 27 ++++++++++++++-------------
> > >  2 files changed, 20 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/2_unicode_strings_file_format.md
> > > b/2_unicode_strings_file_format.md
> > > index 0150c85..7a4a019 100644
> > > --- a/2_unicode_strings_file_format.md
> > > +++ b/2_unicode_strings_file_format.md
> > > @@ -33,7 +33,8 @@
> > >
> > >  EDK II Unicode files are used for mapping token names to 
> > > localized strings that are identified by an RFC4646 language code. 
> > > The format for storing EDK II - Unicode files is UTF-16LE. The 
> > > character content must be
> > UCS-2.
> > > +Unicode files on disk is UTF-8 (without a BOM character) or 
> > > +UTF-16LE (with a BOM character). The character content must be UCS-2.
> > >
> > >  Strings ends are determined by the first of the following items found:
> > >
> > > @@ -44,11 +45,13 @@ Strings ends are determined by the first of 
> > > the following items found:
> > >
> > >  Comments may appear anywhere within the string file.
> > >
> > > -All the files must begin with a Unicode BOM character.
> > > +All UTF-16LE files must begin with a Unicode BOM character.
> > > +All UTF-8 files must not begin with a Unicode BOM character.
> > >
> > >  **********
> > >  **NOTE:** Please make sure you select an editor that supports 
> > > UCS-2 characters - that can be stored in a UTF-16LE file.
> > > +that can be stored in either a UTF-8 (without a BOM character) or 
> > > +a UTF-16LE file (with a BOM character).
> > >  **********
> > >
> > >  ## 2.1 Common EBNF
> > > diff --git a/README.md b/README.md index 63842a1..015aef1 100644
> > > --- a/README.md
> > > +++ b/README.md
> > > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. 
> > > All rights reserved.
> > >
> > >  ### Revision History
> > >
> > > -| Revision          | Description
> > > | Date            |
> > > -| ----------------- |
> > > -| ----------------------------------------------------------
> > > ------------------------------ | --------------- |
> > > -| 1.0               | Initial Release.
> > > | February 2014   |
> > > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > > ANTLR project.                    | August 2014     |
> > > -|                   | Added content related to EDK II Meta-Data 
> > > -| Unicode
> files.
> > > |                 |
> > > -|                   | Restructured document.
> > > |                 |
> > > -|                   | Removed security and C format GUID 
> > > -| definitions, not
> > > required for HII or other UNI files. |                 |
> > > -|                   | Removed invalid escape code sequences.
> > > |                 |
> > > -| 1.2               | Added optional font formatting
> > > | September 2014  |
> > > -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > | April 2015      |
> > > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > strings.                            | March 2016      |
> > > -|                   | Removed: Info on specific consumers (.INF &
> > > -| .DEC)
> > removed.
> > > |                 |
> > > -| 1.4               | Convert to GitBook format
> > > | March 2017      |
> > > +| Revision          | Description
> > > | Date            |
> > > +| ----------------- |
> > > +| ----------------------------------------------------------
> > > ------------------------------------------------------------ |
> > > --------------- |
> > > +| 1.0               | Initial Release.
> > > | February 2014   |
> > > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > > ANTLR project.                                                  | August 2014
> > > |
> > > +|                   | Added content related to EDK II Meta-Data 
> > > +| Unicode
> files.
> > > |                 |
> > > +|                   | Restructured document.
> > > |                 |
> > > +|                   | Removed security and C format GUID 
> > > +| definitions, not
> > > required for HII or other UNI files.                               |
> > > |
> > > +|                   | Removed invalid escape code sequences.
> > > |                 |
> > > +| 1.2               | Added optional font formatting
> > > | September 2014  |
> > > +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > | April 2015      |
> > > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > strings.                                                          | March
> 2016
> > > |
> > > +|                   | Removed: Info on specific consumers (.INF &
> > > +| .DEC)
> > removed.
> > > |                 |
> > > +| 1.4               | Convert to GitBook format
> > > | April 2017      |
> > > +|                   |
> > > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |
> |
> > > --
> > > 2.6.3.windows.1
> > >
> > > _______________________________________________
> > > edk2-devel mailing list
> > > edk2-devel@lists.01.org
> > > https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Kinney, Michael D 7 years, 6 months ago
Hi Tim,

For UTF-16 files on disk with no BOM, do you follow the
big-endian assumption as documented in the Unicode
Specification Section 3.10, D98?

http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 4:13 PM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be
> UTF-8 without a BOM
> 
> Mike --
> 
> I would prefer to update the docs to match actual industry practice. EDK2 is not
> the universe.
> 
> Insyde has been using UNI files well before my time here (> 5 years). The fact
> that recent specifications or EDK2 tools (2 years) added BOM support it does not
> remove the backward compatibility issue.
> 
> The Unicode specification usage of "not recommended" is referring specifically to
> its usage for byte-order. The full sentence (from 2.6) is: "Use of a BOM is
> neither required nor recommended [for byte order determination] for UTF-8, but
> may be encountered in contexts where UTF-8 data is converted from other encoding
> forms that use a BOM or where the BOM is used as a UTF-8 signature" Editorial
> comment mine. In this case, the BOM marker would appear as a UTF-8 signature.
> This would distinguish it from ASCII or any of the multi-byte encoding schemes
> used.
> 
> Tim
> 
> -----Original Message-----
> From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> Sent: Wednesday, April 26, 2017 3:47 PM
> To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D
> <michael.d.kinney@intel.com>
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be
> UTF-8 without a BOM
> 
> Hi Tim,
> 
> The recommendation for UTF-8 usage is to not use a BOM, which is why no BOM for
> UTF-8 was selected for EDK II.
> 
> The current task is to update docs to match the current tool behavior.
> 
> The EDK II repos on GitHub have .uni files in UTF-8 format without a BOM to
> support easier patch review.
> 
> There are ways to use GIT features to auto-convert .uni files when pulling
> content from EDK II repos and pushing commits.
> That may or may not help with the specific issue you are raising.
> 
> If you have ideas on a tool change request to EDK II that would provide
> compatibility with current EDK II tool behavior and support UTF-16LE without a
> BOM, then let's work that through in a Bugzilla feature request.  If we find a
> solution, we can update the docs and tools again.
> 
> Do you have any objections to updating the UNI Spec to match the current tool
> behavior?
> 
> Thanks,
> 
> Mike
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 11:54 AM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > edk2-devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > disk to be
> > UTF-8 without a BOM
> >
> > Mike --
> >
> > This is not about files in the EDK II repository. This is about files
> > created based on the spec, and created with other sets of tools. Go
> > back to early 2015, to the Build spec (1.22, etc.), Appendix G, which
> > is where the UNI stuff used to live.
> >
> > The point is: files which worked before, and, at worst, generated a
> > warning before, now are interpreted incorrectly even though they have correct
> data.
> >
> > Making ASCII (or UTF-8) the default without a BOM is the breaking change.
> >
> > Tim
> >
> > -----Original Message-----
> > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > Sent: Wednesday, April 26, 2017 11:47 AM
> > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney,
> > Michael D <michael.d.kinney@intel.com>
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > disk to be
> > UTF-8 without a BOM
> >
> > Tim,
> >
> > If you look at the entire file history of the EDK II, you will see
> > that the BOM has always been present in the UTF-16LE formatted files.
> >
> > The build tools were updated in 2015 to *add* support for UTF-8 file.
> > The .uni files in the EDK II project were then converted from UTF-16LE
> > with a BOM to UTF-8 without a BOM.  This provided an easier developer
> > experience when using GIT to do email patch review of .uni files.
> >
> > It is possible I am missing something here.  Can you please provide a
> > pointer to the EDK II commit(s) where BOMs were added to UTF-16LE .uni files.
> >
> > Thanks,
> >
> > Mike
> >
> > > -----Original Message-----
> > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > Sent: Wednesday, April 26, 2017 11:34 AM
> > > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > > edk2-devel@lists.01.org
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Mike --
> > >
> > > I understand that EDK2 has decided to add BOM markers two years ago.
> > > Adding a BOM didn't change the default. The problem is (a) there are
> > > still hundreds of files extant in our codebase which were created
> > > prior to the 2015 changes and still in use, and (b) this change is
> > > not backward
> > compatible for these files.
> > >
> > > Tim
> > >
> > > -----Original Message-----
> > > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > > Sent: Wednesday, April 26, 2017 11:11 AM
> > > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org;
> > > Kinney, Michael D <michael.d.kinney@intel.com>
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Hi Tim,
> > >
> > > This is not a request for a new change.  Instead, the intent of this
> > > document change is to update the document to reflect the implemented
> > > behavior of the EDK II tools.  The EDK II tool updates to add UTF-8
> > > file support were completed with the patches listed below.  Notice
> > > that the main one for normal build support was checked in almost 2 years ago.
> > >
> > > BaseTools - UniClassObject - 6/23/2015
> > > *
> > > https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253fb
> > > d5
> > > 119670f75c6
> > > *
> > > https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17b7
> > > e8
> > > 0b91f816eda
> > >
> > > BaseTools - ECC - 12/29/2015
> > > *
> > > https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88afb
> > > 3f
> > > aa71ddde4d6
> > >
> > > BaseTools - UPT - 4/25/2016
> > > *
> > > https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e9368b
> > > 6b
> > > 697bbf327af
> > >
> > > This was intended to be a 100% backwards compatible change.
> > >
> > > All .uni files in the EDK II project in UTF-16LE format have always use a
> BOM.
> > > Please checkout UDK2015 or older UDKs and you will see all .uni
> > > files start with 0xff 0xfe.
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> > > > -----Original Message-----
> > > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > > Sent: Wednesday, April 26, 2017 9:15 AM
> > > > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > > > edk2-devel@lists.01.org
> > > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > > <kevin.w.shaw@intel.com>
> > > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > > on disk to be
> > > > UTF-8 without a BOM
> > > >
> > > > Mike --
> > > >
> > > > This breaks our existing build tools, which assume that a file
> > > > without a BOM is UTF-16.
> > > >
> > > > Tim
> > > >
> > > > -----Original Message-----
> > > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On
> > > > Behalf Of Michael Kinney
> > > > Sent: Tuesday, April 25, 2017 6:07 PM
> > > > To: edk2-devel@lists.01.org
> > > > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw
> > > > <kevin.w.shaw@intel.com>
> > > > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > > > disk to be UTF-
> > > > 8 without a BOM
> > > >
> > > > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> > > >
> > > > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > > > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > > > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > > > Contributed-under: TianoCore Contribution Agreement 1.1
> > > > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > > > ---
> > > >  2_unicode_strings_file_format.md |  9 ++++++---
> > > >  README.md                        | 27 ++++++++++++++-------------
> > > >  2 files changed, 20 insertions(+), 16 deletions(-)
> > > >
> > > > diff --git a/2_unicode_strings_file_format.md
> > > > b/2_unicode_strings_file_format.md
> > > > index 0150c85..7a4a019 100644
> > > > --- a/2_unicode_strings_file_format.md
> > > > +++ b/2_unicode_strings_file_format.md
> > > > @@ -33,7 +33,8 @@
> > > >
> > > >  EDK II Unicode files are used for mapping token names to
> > > > localized strings that are identified by an RFC4646 language code.
> > > > The format for storing EDK II - Unicode files is UTF-16LE. The
> > > > character content must be
> > > UCS-2.
> > > > +Unicode files on disk is UTF-8 (without a BOM character) or
> > > > +UTF-16LE (with a BOM character). The character content must be UCS-2.
> > > >
> > > >  Strings ends are determined by the first of the following items found:
> > > >
> > > > @@ -44,11 +45,13 @@ Strings ends are determined by the first of
> > > > the following items found:
> > > >
> > > >  Comments may appear anywhere within the string file.
> > > >
> > > > -All the files must begin with a Unicode BOM character.
> > > > +All UTF-16LE files must begin with a Unicode BOM character.
> > > > +All UTF-8 files must not begin with a Unicode BOM character.
> > > >
> > > >  **********
> > > >  **NOTE:** Please make sure you select an editor that supports
> > > > UCS-2 characters - that can be stored in a UTF-16LE file.
> > > > +that can be stored in either a UTF-8 (without a BOM character) or
> > > > +a UTF-16LE file (with a BOM character).
> > > >  **********
> > > >
> > > >  ## 2.1 Common EBNF
> > > > diff --git a/README.md b/README.md index 63842a1..015aef1 100644
> > > > --- a/README.md
> > > > +++ b/README.md
> > > > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation.
> > > > All rights reserved.
> > > >
> > > >  ### Revision History
> > > >
> > > > -| Revision          | Description
> > > > | Date            |
> > > > -| ----------------- |
> > > > -| ----------------------------------------------------------
> > > > ------------------------------ | --------------- |
> > > > -| 1.0               | Initial Release.
> > > > | February 2014   |
> > > > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by
> the
> > > > ANTLR project.                    | August 2014     |
> > > > -|                   | Added content related to EDK II Meta-Data
> > > > -| Unicode
> > files.
> > > > |                 |
> > > > -|                   | Restructured document.
> > > > |                 |
> > > > -|                   | Removed security and C format GUID
> > > > -| definitions, not
> > > > required for HII or other UNI files. |                 |
> > > > -|                   | Removed invalid escape code sequences.
> > > > |                 |
> > > > -| 1.2               | Added optional font formatting
> > > > | September 2014  |
> > > > -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > > | April 2015      |
> > > > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > > strings.                            | March 2016      |
> > > > -|                   | Removed: Info on specific consumers (.INF &
> > > > -| .DEC)
> > > removed.
> > > > |                 |
> > > > -| 1.4               | Convert to GitBook format
> > > > | March 2017      |
> > > > +| Revision          | Description
> > > > | Date            |
> > > > +| ----------------- |
> > > > +| ----------------------------------------------------------
> > > > ------------------------------------------------------------ |
> > > > --------------- |
> > > > +| 1.0               | Initial Release.
> > > > | February 2014   |
> > > > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by
> the
> > > > ANTLR project.                                                  | August
> 2014
> > > > |
> > > > +|                   | Added content related to EDK II Meta-Data
> > > > +| Unicode
> > files.
> > > > |                 |
> > > > +|                   | Restructured document.
> > > > |                 |
> > > > +|                   | Removed security and C format GUID
> > > > +| definitions, not
> > > > required for HII or other UNI files.                               |
> > > > |
> > > > +|                   | Removed invalid escape code sequences.
> > > > |                 |
> > > > +| 1.2               | Added optional font formatting
> > > > | September 2014  |
> > > > +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > > | April 2015      |
> > > > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > > strings.                                                          | March
> > 2016
> > > > |
> > > > +|                   | Removed: Info on specific consumers (.INF &
> > > > +| .DEC)
> > > removed.
> > > > |                 |
> > > > +| 1.4               | Convert to GitBook format
> > > > | April 2017      |
> > > > +|                   |
> > > > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > > > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |
> > |
> > > > --
> > > > 2.6.3.windows.1
> > > >
> > > > _______________________________________________
> > > > edk2-devel mailing list
> > > > edk2-devel@lists.01.org
> > > > https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Tim Lewis 7 years, 6 months ago
Mike --

No, the meta-data (in this case, file extension .uni) was used by tools to determine the format of the file contents, as described in section 2.6. Little-endian, UCS-2 was assumed.

"When a higher-level protocol supplies mechanisms for handling the endianness of integral data types, it is not necessary to use Unicode encoding schemes or the byte order mark. In those cases Unicode text is simply a sequence of integral data types."

Of course, the tools had to be updated to accommodate different build systems, and even alternate encodings. But this doesn't remove the previous behavior.
 
Tim



-----Original Message-----
From: Kinney, Michael D [mailto:michael.d.kinney@intel.com] 
Sent: Wednesday, April 26, 2017 5:02 PM
To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D <michael.d.kinney@intel.com>
Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM

Hi Tim,

For UTF-16 files on disk with no BOM, do you follow the big-endian assumption as documented in the Unicode Specification Section 3.10, D98?

http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 4:13 PM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Mike --
> 
> I would prefer to update the docs to match actual industry practice. 
> EDK2 is not the universe.
> 
> Insyde has been using UNI files well before my time here (> 5 years). 
> The fact that recent specifications or EDK2 tools (2 years) added BOM 
> support it does not remove the backward compatibility issue.
> 
> The Unicode specification usage of "not recommended" is referring 
> specifically to its usage for byte-order. The full sentence (from 2.6) 
> is: "Use of a BOM is neither required nor recommended [for byte order 
> determination] for UTF-8, but may be encountered in contexts where 
> UTF-8 data is converted from other encoding forms that use a BOM or 
> where the BOM is used as a UTF-8 signature" Editorial comment mine. In this case, the BOM marker would appear as a UTF-8 signature.
> This would distinguish it from ASCII or any of the multi-byte encoding 
> schemes used.
> 
> Tim
> 
> -----Original Message-----
> From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> Sent: Wednesday, April 26, 2017 3:47 PM
> To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, 
> Michael D <michael.d.kinney@intel.com>
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Hi Tim,
> 
> The recommendation for UTF-8 usage is to not use a BOM, which is why 
> no BOM for
> UTF-8 was selected for EDK II.
> 
> The current task is to update docs to match the current tool behavior.
> 
> The EDK II repos on GitHub have .uni files in UTF-8 format without a 
> BOM to support easier patch review.
> 
> There are ways to use GIT features to auto-convert .uni files when 
> pulling content from EDK II repos and pushing commits.
> That may or may not help with the specific issue you are raising.
> 
> If you have ideas on a tool change request to EDK II that would 
> provide compatibility with current EDK II tool behavior and support 
> UTF-16LE without a BOM, then let's work that through in a Bugzilla 
> feature request.  If we find a solution, we can update the docs and tools again.
> 
> Do you have any objections to updating the UNI Spec to match the 
> current tool behavior?
> 
> Thanks,
> 
> Mike
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 11:54 AM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > edk2-devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > on disk to be
> > UTF-8 without a BOM
> >
> > Mike --
> >
> > This is not about files in the EDK II repository. This is about 
> > files created based on the spec, and created with other sets of 
> > tools. Go back to early 2015, to the Build spec (1.22, etc.), 
> > Appendix G, which is where the UNI stuff used to live.
> >
> > The point is: files which worked before, and, at worst, generated a 
> > warning before, now are interpreted incorrectly even though they 
> > have correct
> data.
> >
> > Making ASCII (or UTF-8) the default without a BOM is the breaking change.
> >
> > Tim
> >
> > -----Original Message-----
> > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > Sent: Wednesday, April 26, 2017 11:47 AM
> > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; 
> > Kinney, Michael D <michael.d.kinney@intel.com>
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > on disk to be
> > UTF-8 without a BOM
> >
> > Tim,
> >
> > If you look at the entire file history of the EDK II, you will see 
> > that the BOM has always been present in the UTF-16LE formatted files.
> >
> > The build tools were updated in 2015 to *add* support for UTF-8 file.
> > The .uni files in the EDK II project were then converted from 
> > UTF-16LE with a BOM to UTF-8 without a BOM.  This provided an easier 
> > developer experience when using GIT to do email patch review of .uni files.
> >
> > It is possible I am missing something here.  Can you please provide 
> > a pointer to the EDK II commit(s) where BOMs were added to UTF-16LE .uni files.
> >
> > Thanks,
> >
> > Mike
> >
> > > -----Original Message-----
> > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > Sent: Wednesday, April 26, 2017 11:34 AM
> > > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > > edk2-devel@lists.01.org
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Mike --
> > >
> > > I understand that EDK2 has decided to add BOM markers two years ago.
> > > Adding a BOM didn't change the default. The problem is (a) there 
> > > are still hundreds of files extant in our codebase which were 
> > > created prior to the 2015 changes and still in use, and (b) this 
> > > change is not backward
> > compatible for these files.
> > >
> > > Tim
> > >
> > > -----Original Message-----
> > > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > > Sent: Wednesday, April 26, 2017 11:11 AM
> > > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; 
> > > Kinney, Michael D <michael.d.kinney@intel.com>
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Hi Tim,
> > >
> > > This is not a request for a new change.  Instead, the intent of 
> > > this document change is to update the document to reflect the 
> > > implemented behavior of the EDK II tools.  The EDK II tool updates 
> > > to add UTF-8 file support were completed with the patches listed 
> > > below.  Notice that the main one for normal build support was checked in almost 2 years ago.
> > >
> > > BaseTools - UniClassObject - 6/23/2015
> > > *
> > > https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253
> > > fb
> > > d5
> > > 119670f75c6
> > > *
> > > https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17
> > > b7
> > > e8
> > > 0b91f816eda
> > >
> > > BaseTools - ECC - 12/29/2015
> > > *
> > > https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88a
> > > fb
> > > 3f
> > > aa71ddde4d6
> > >
> > > BaseTools - UPT - 4/25/2016
> > > *
> > > https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e936
> > > 8b
> > > 6b
> > > 697bbf327af
> > >
> > > This was intended to be a 100% backwards compatible change.
> > >
> > > All .uni files in the EDK II project in UTF-16LE format have 
> > > always use a
> BOM.
> > > Please checkout UDK2015 or older UDKs and you will see all .uni 
> > > files start with 0xff 0xfe.
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> > > > -----Original Message-----
> > > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > > Sent: Wednesday, April 26, 2017 9:15 AM
> > > > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > > > edk2-devel@lists.01.org
> > > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > > > <kevin.w.shaw@intel.com>
> > > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni 
> > > > files on disk to be
> > > > UTF-8 without a BOM
> > > >
> > > > Mike --
> > > >
> > > > This breaks our existing build tools, which assume that a file 
> > > > without a BOM is UTF-16.
> > > >
> > > > Tim
> > > >
> > > > -----Original Message-----
> > > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On 
> > > > Behalf Of Michael Kinney
> > > > Sent: Tuesday, April 25, 2017 6:07 PM
> > > > To: edk2-devel@lists.01.org
> > > > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw 
> > > > <kevin.w.shaw@intel.com>
> > > > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > > > on disk to be UTF-
> > > > 8 without a BOM
> > > >
> > > > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> > > >
> > > > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > > > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > > > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > > > Contributed-under: TianoCore Contribution Agreement 1.1
> > > > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > > > ---
> > > >  2_unicode_strings_file_format.md |  9 ++++++---
> > > >  README.md                        | 27 ++++++++++++++-------------
> > > >  2 files changed, 20 insertions(+), 16 deletions(-)
> > > >
> > > > diff --git a/2_unicode_strings_file_format.md
> > > > b/2_unicode_strings_file_format.md
> > > > index 0150c85..7a4a019 100644
> > > > --- a/2_unicode_strings_file_format.md
> > > > +++ b/2_unicode_strings_file_format.md
> > > > @@ -33,7 +33,8 @@
> > > >
> > > >  EDK II Unicode files are used for mapping token names to 
> > > > localized strings that are identified by an RFC4646 language code.
> > > > The format for storing EDK II - Unicode files is UTF-16LE. The 
> > > > character content must be
> > > UCS-2.
> > > > +Unicode files on disk is UTF-8 (without a BOM character) or 
> > > > +UTF-16LE (with a BOM character). The character content must be UCS-2.
> > > >
> > > >  Strings ends are determined by the first of the following items found:
> > > >
> > > > @@ -44,11 +45,13 @@ Strings ends are determined by the first of 
> > > > the following items found:
> > > >
> > > >  Comments may appear anywhere within the string file.
> > > >
> > > > -All the files must begin with a Unicode BOM character.
> > > > +All UTF-16LE files must begin with a Unicode BOM character.
> > > > +All UTF-8 files must not begin with a Unicode BOM character.
> > > >
> > > >  **********
> > > >  **NOTE:** Please make sure you select an editor that supports
> > > > UCS-2 characters - that can be stored in a UTF-16LE file.
> > > > +that can be stored in either a UTF-8 (without a BOM character) 
> > > > +or a UTF-16LE file (with a BOM character).
> > > >  **********
> > > >
> > > >  ## 2.1 Common EBNF
> > > > diff --git a/README.md b/README.md index 63842a1..015aef1 100644
> > > > --- a/README.md
> > > > +++ b/README.md
> > > > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation.
> > > > All rights reserved.
> > > >
> > > >  ### Revision History
> > > >
> > > > -| Revision          | Description
> > > > | Date            |
> > > > -| ----------------- |
> > > > -| ----------------------------------------------------------
> > > > ------------------------------ | --------------- |
> > > > -| 1.0               | Initial Release.
> > > > | February 2014   |
> > > > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by
> the
> > > > ANTLR project.                    | August 2014     |
> > > > -|                   | Added content related to EDK II Meta-Data 
> > > > -| Unicode
> > files.
> > > > |                 |
> > > > -|                   | Restructured document.
> > > > |                 |
> > > > -|                   | Removed security and C format GUID 
> > > > -| definitions, not
> > > > required for HII or other UNI files. |                 |
> > > > -|                   | Removed invalid escape code sequences.
> > > > |                 |
> > > > -| 1.2               | Added optional font formatting
> > > > | September 2014  |
> > > > -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > > | April 2015      |
> > > > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > > strings.                            | March 2016      |
> > > > -|                   | Removed: Info on specific consumers (.INF 
> > > > -| &
> > > > -| .DEC)
> > > removed.
> > > > |                 |
> > > > -| 1.4               | Convert to GitBook format
> > > > | March 2017      |
> > > > +| Revision          | Description
> > > > | Date            |
> > > > +| ----------------- |
> > > > +| ----------------------------------------------------------
> > > > ------------------------------------------------------------ |
> > > > --------------- |
> > > > +| 1.0               | Initial Release.
> > > > | February 2014   |
> > > > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by
> the
> > > > ANTLR project.                                                  | August
> 2014
> > > > |
> > > > +|                   | Added content related to EDK II Meta-Data 
> > > > +| Unicode
> > files.
> > > > |                 |
> > > > +|                   | Restructured document.
> > > > |                 |
> > > > +|                   | Removed security and C format GUID 
> > > > +| definitions, not
> > > > required for HII or other UNI files.                               |
> > > > |
> > > > +|                   | Removed invalid escape code sequences.
> > > > |                 |
> > > > +| 1.2               | Added optional font formatting
> > > > | September 2014  |
> > > > +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > > | April 2015      |
> > > > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > > strings.                                                          | March
> > 2016
> > > > |
> > > > +|                   | Removed: Info on specific consumers (.INF 
> > > > +| &
> > > > +| .DEC)
> > > removed.
> > > > |                 |
> > > > +| 1.4               | Convert to GitBook format
> > > > | April 2017      |
> > > > +|                   |
> > > > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > > > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |
> > |
> > > > --
> > > > 2.6.3.windows.1
> > > >
> > > > _______________________________________________
> > > > edk2-devel mailing list
> > > > edk2-devel@lists.01.org
> > > > https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Tim Lewis 7 years, 6 months ago
Mike --

After an internal review, we have found that there are fewer files than previously thought affected by this change. 

So we have no objections to updating the UNI Spec to match the current EDK2 tool behavior?

Thanks,

Tim

-----Original Message-----
From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Tim Lewis
Sent: Wednesday, April 26, 2017 5:27 PM
To: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel@lists.01.org
Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
Subject: Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM

Mike --

No, the meta-data (in this case, file extension .uni) was used by tools to determine the format of the file contents, as described in section 2.6. Little-endian, UCS-2 was assumed.

"When a higher-level protocol supplies mechanisms for handling the endianness of integral data types, it is not necessary to use Unicode encoding schemes or the byte order mark. In those cases Unicode text is simply a sequence of integral data types."

Of course, the tools had to be updated to accommodate different build systems, and even alternate encodings. But this doesn't remove the previous behavior.
 
Tim



-----Original Message-----
From: Kinney, Michael D [mailto:michael.d.kinney@intel.com] 
Sent: Wednesday, April 26, 2017 5:02 PM
To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D <michael.d.kinney@intel.com>
Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM

Hi Tim,

For UTF-16 files on disk with no BOM, do you follow the big-endian assumption as documented in the Unicode Specification Section 3.10, D98?

http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 4:13 PM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Mike --
> 
> I would prefer to update the docs to match actual industry practice. 
> EDK2 is not the universe.
> 
> Insyde has been using UNI files well before my time here (> 5 years). 
> The fact that recent specifications or EDK2 tools (2 years) added BOM 
> support it does not remove the backward compatibility issue.
> 
> The Unicode specification usage of "not recommended" is referring 
> specifically to its usage for byte-order. The full sentence (from 2.6) 
> is: "Use of a BOM is neither required nor recommended [for byte order 
> determination] for UTF-8, but may be encountered in contexts where 
> UTF-8 data is converted from other encoding forms that use a BOM or 
> where the BOM is used as a UTF-8 signature" Editorial comment mine. In this case, the BOM marker would appear as a UTF-8 signature.
> This would distinguish it from ASCII or any of the multi-byte encoding 
> schemes used.
> 
> Tim
> 
> -----Original Message-----
> From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> Sent: Wednesday, April 26, 2017 3:47 PM
> To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, 
> Michael D <michael.d.kinney@intel.com>
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Hi Tim,
> 
> The recommendation for UTF-8 usage is to not use a BOM, which is why 
> no BOM for
> UTF-8 was selected for EDK II.
> 
> The current task is to update docs to match the current tool behavior.
> 
> The EDK II repos on GitHub have .uni files in UTF-8 format without a 
> BOM to support easier patch review.
> 
> There are ways to use GIT features to auto-convert .uni files when 
> pulling content from EDK II repos and pushing commits.
> That may or may not help with the specific issue you are raising.
> 
> If you have ideas on a tool change request to EDK II that would 
> provide compatibility with current EDK II tool behavior and support 
> UTF-16LE without a BOM, then let's work that through in a Bugzilla 
> feature request.  If we find a solution, we can update the docs and tools again.
> 
> Do you have any objections to updating the UNI Spec to match the 
> current tool behavior?
> 
> Thanks,
> 
> Mike
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 11:54 AM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > edk2-devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > on disk to be
> > UTF-8 without a BOM
> >
> > Mike --
> >
> > This is not about files in the EDK II repository. This is about 
> > files created based on the spec, and created with other sets of 
> > tools. Go back to early 2015, to the Build spec (1.22, etc.), 
> > Appendix G, which is where the UNI stuff used to live.
> >
> > The point is: files which worked before, and, at worst, generated a 
> > warning before, now are interpreted incorrectly even though they 
> > have correct
> data.
> >
> > Making ASCII (or UTF-8) the default without a BOM is the breaking change.
> >
> > Tim
> >
> > -----Original Message-----
> > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > Sent: Wednesday, April 26, 2017 11:47 AM
> > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; 
> > Kinney, Michael D <michael.d.kinney@intel.com>
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > on disk to be
> > UTF-8 without a BOM
> >
> > Tim,
> >
> > If you look at the entire file history of the EDK II, you will see 
> > that the BOM has always been present in the UTF-16LE formatted files.
> >
> > The build tools were updated in 2015 to *add* support for UTF-8 file.
> > The .uni files in the EDK II project were then converted from 
> > UTF-16LE with a BOM to UTF-8 without a BOM.  This provided an easier 
> > developer experience when using GIT to do email patch review of .uni files.
> >
> > It is possible I am missing something here.  Can you please provide 
> > a pointer to the EDK II commit(s) where BOMs were added to UTF-16LE .uni files.
> >
> > Thanks,
> >
> > Mike
> >
> > > -----Original Message-----
> > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > Sent: Wednesday, April 26, 2017 11:34 AM
> > > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > > edk2-devel@lists.01.org
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Mike --
> > >
> > > I understand that EDK2 has decided to add BOM markers two years ago.
> > > Adding a BOM didn't change the default. The problem is (a) there 
> > > are still hundreds of files extant in our codebase which were 
> > > created prior to the 2015 changes and still in use, and (b) this 
> > > change is not backward
> > compatible for these files.
> > >
> > > Tim
> > >
> > > -----Original Message-----
> > > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > > Sent: Wednesday, April 26, 2017 11:11 AM
> > > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; 
> > > Kinney, Michael D <michael.d.kinney@intel.com>
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Hi Tim,
> > >
> > > This is not a request for a new change.  Instead, the intent of 
> > > this document change is to update the document to reflect the 
> > > implemented behavior of the EDK II tools.  The EDK II tool updates 
> > > to add UTF-8 file support were completed with the patches listed 
> > > below.  Notice that the main one for normal build support was checked in almost 2 years ago.
> > >
> > > BaseTools - UniClassObject - 6/23/2015
> > > *
> > > https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253
> > > fb
> > > d5
> > > 119670f75c6
> > > *
> > > https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17
> > > b7
> > > e8
> > > 0b91f816eda
> > >
> > > BaseTools - ECC - 12/29/2015
> > > *
> > > https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88a
> > > fb
> > > 3f
> > > aa71ddde4d6
> > >
> > > BaseTools - UPT - 4/25/2016
> > > *
> > > https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e936
> > > 8b
> > > 6b
> > > 697bbf327af
> > >
> > > This was intended to be a 100% backwards compatible change.
> > >
> > > All .uni files in the EDK II project in UTF-16LE format have 
> > > always use a
> BOM.
> > > Please checkout UDK2015 or older UDKs and you will see all .uni 
> > > files start with 0xff 0xfe.
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> > > > -----Original Message-----
> > > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > > Sent: Wednesday, April 26, 2017 9:15 AM
> > > > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > > > edk2-devel@lists.01.org
> > > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > > > <kevin.w.shaw@intel.com>
> > > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni 
> > > > files on disk to be
> > > > UTF-8 without a BOM
> > > >
> > > > Mike --
> > > >
> > > > This breaks our existing build tools, which assume that a file 
> > > > without a BOM is UTF-16.
> > > >
> > > > Tim
> > > >
> > > > -----Original Message-----
> > > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On 
> > > > Behalf Of Michael Kinney
> > > > Sent: Tuesday, April 25, 2017 6:07 PM
> > > > To: edk2-devel@lists.01.org
> > > > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw 
> > > > <kevin.w.shaw@intel.com>
> > > > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > > > on disk to be UTF-
> > > > 8 without a BOM
> > > >
> > > > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> > > >
> > > > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > > > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > > > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > > > Contributed-under: TianoCore Contribution Agreement 1.1
> > > > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > > > ---
> > > >  2_unicode_strings_file_format.md |  9 ++++++---
> > > >  README.md                        | 27 ++++++++++++++-------------
> > > >  2 files changed, 20 insertions(+), 16 deletions(-)
> > > >
> > > > diff --git a/2_unicode_strings_file_format.md
> > > > b/2_unicode_strings_file_format.md
> > > > index 0150c85..7a4a019 100644
> > > > --- a/2_unicode_strings_file_format.md
> > > > +++ b/2_unicode_strings_file_format.md
> > > > @@ -33,7 +33,8 @@
> > > >
> > > >  EDK II Unicode files are used for mapping token names to 
> > > > localized strings that are identified by an RFC4646 language code.
> > > > The format for storing EDK II - Unicode files is UTF-16LE. The 
> > > > character content must be
> > > UCS-2.
> > > > +Unicode files on disk is UTF-8 (without a BOM character) or 
> > > > +UTF-16LE (with a BOM character). The character content must be UCS-2.
> > > >
> > > >  Strings ends are determined by the first of the following items found:
> > > >
> > > > @@ -44,11 +45,13 @@ Strings ends are determined by the first of 
> > > > the following items found:
> > > >
> > > >  Comments may appear anywhere within the string file.
> > > >
> > > > -All the files must begin with a Unicode BOM character.
> > > > +All UTF-16LE files must begin with a Unicode BOM character.
> > > > +All UTF-8 files must not begin with a Unicode BOM character.
> > > >
> > > >  **********
> > > >  **NOTE:** Please make sure you select an editor that supports
> > > > UCS-2 characters - that can be stored in a UTF-16LE file.
> > > > +that can be stored in either a UTF-8 (without a BOM character) 
> > > > +or a UTF-16LE file (with a BOM character).
> > > >  **********
> > > >
> > > >  ## 2.1 Common EBNF
> > > > diff --git a/README.md b/README.md index 63842a1..015aef1 100644
> > > > --- a/README.md
> > > > +++ b/README.md
> > > > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation.
> > > > All rights reserved.
> > > >
> > > >  ### Revision History
> > > >
> > > > -| Revision          | Description
> > > > | Date            |
> > > > -| ----------------- |
> > > > -| ----------------------------------------------------------
> > > > ------------------------------ | --------------- |
> > > > -| 1.0               | Initial Release.
> > > > | February 2014   |
> > > > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by
> the
> > > > ANTLR project.                    | August 2014     |
> > > > -|                   | Added content related to EDK II Meta-Data 
> > > > -| Unicode
> > files.
> > > > |                 |
> > > > -|                   | Restructured document.
> > > > |                 |
> > > > -|                   | Removed security and C format GUID 
> > > > -| definitions, not
> > > > required for HII or other UNI files. |                 |
> > > > -|                   | Removed invalid escape code sequences.
> > > > |                 |
> > > > -| 1.2               | Added optional font formatting
> > > > | September 2014  |
> > > > -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > > | April 2015      |
> > > > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > > strings.                            | March 2016      |
> > > > -|                   | Removed: Info on specific consumers (.INF 
> > > > -| &
> > > > -| .DEC)
> > > removed.
> > > > |                 |
> > > > -| 1.4               | Convert to GitBook format
> > > > | March 2017      |
> > > > +| Revision          | Description
> > > > | Date            |
> > > > +| ----------------- |
> > > > +| ----------------------------------------------------------
> > > > ------------------------------------------------------------ |
> > > > --------------- |
> > > > +| 1.0               | Initial Release.
> > > > | February 2014   |
> > > > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by
> the
> > > > ANTLR project.                                                  | August
> 2014
> > > > |
> > > > +|                   | Added content related to EDK II Meta-Data 
> > > > +| Unicode
> > files.
> > > > |                 |
> > > > +|                   | Restructured document.
> > > > |                 |
> > > > +|                   | Removed security and C format GUID 
> > > > +| definitions, not
> > > > required for HII or other UNI files.                               |
> > > > |
> > > > +|                   | Removed invalid escape code sequences.
> > > > |                 |
> > > > +| 1.2               | Added optional font formatting
> > > > | September 2014  |
> > > > +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > > | April 2015      |
> > > > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > > strings.                                                          | March
> > 2016
> > > > |
> > > > +|                   | Removed: Info on specific consumers (.INF 
> > > > +| &
> > > > +| .DEC)
> > > removed.
> > > > |                 |
> > > > +| 1.4               | Convert to GitBook format
> > > > | April 2017      |
> > > > +|                   |
> > > > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > > > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |
> > |
> > > > --
> > > > 2.6.3.windows.1
> > > >
> > > > _______________________________________________
> > > > edk2-devel mailing list
> > > > edk2-devel@lists.01.org
> > > > https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel
Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Posted by Kinney, Michael D 7 years, 6 months ago
Tim,

Thanks for the additional review on this topic.

I will push the UNI spec update.

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Friday, April 28, 2017 9:48 AM
> To: Tim Lewis <tim.lewis@insyde.com>; Kinney, Michael D <michael.d.kinney@intel.com>;
> edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8
> without a BOM
> 
> Mike --
> 
> After an internal review, we have found that there are fewer files than previously
> thought affected by this change.
> 
> So we have no objections to updating the UNI Spec to match the current EDK2 tool
> behavior?
> 
> Thanks,
> 
> Tim
> 
> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Tim Lewis
> Sent: Wednesday, April 26, 2017 5:27 PM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
> Subject: Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8
> without a BOM
> 
> Mike --
> 
> No, the meta-data (in this case, file extension .uni) was used by tools to determine
> the format of the file contents, as described in section 2.6. Little-endian, UCS-2 was
> assumed.
> 
> "When a higher-level protocol supplies mechanisms for handling the endianness of
> integral data types, it is not necessary to use Unicode encoding schemes or the byte
> order mark. In those cases Unicode text is simply a sequence of integral data types."
> 
> Of course, the tools had to be updated to accommodate different build systems, and
> even alternate encodings. But this doesn't remove the previous behavior.
> 
> Tim
> 
> 
> 
> -----Original Message-----
> From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> Sent: Wednesday, April 26, 2017 5:02 PM
> To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D
> <michael.d.kinney@intel.com>
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8
> without a BOM
> 
> Hi Tim,
> 
> For UTF-16 files on disk with no BOM, do you follow the big-endian assumption as
> documented in the Unicode Specification Section 3.10, D98?
> 
> http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf
> 
> Mike
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 4:13 PM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > edk2-devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > disk to be
> > UTF-8 without a BOM
> >
> > Mike --
> >
> > I would prefer to update the docs to match actual industry practice.
> > EDK2 is not the universe.
> >
> > Insyde has been using UNI files well before my time here (> 5 years).
> > The fact that recent specifications or EDK2 tools (2 years) added BOM
> > support it does not remove the backward compatibility issue.
> >
> > The Unicode specification usage of "not recommended" is referring
> > specifically to its usage for byte-order. The full sentence (from 2.6)
> > is: "Use of a BOM is neither required nor recommended [for byte order
> > determination] for UTF-8, but may be encountered in contexts where
> > UTF-8 data is converted from other encoding forms that use a BOM or
> > where the BOM is used as a UTF-8 signature" Editorial comment mine. In this case,
> the BOM marker would appear as a UTF-8 signature.
> > This would distinguish it from ASCII or any of the multi-byte encoding
> > schemes used.
> >
> > Tim
> >
> > -----Original Message-----
> > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > Sent: Wednesday, April 26, 2017 3:47 PM
> > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney,
> > Michael D <michael.d.kinney@intel.com>
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > disk to be
> > UTF-8 without a BOM
> >
> > Hi Tim,
> >
> > The recommendation for UTF-8 usage is to not use a BOM, which is why
> > no BOM for
> > UTF-8 was selected for EDK II.
> >
> > The current task is to update docs to match the current tool behavior.
> >
> > The EDK II repos on GitHub have .uni files in UTF-8 format without a
> > BOM to support easier patch review.
> >
> > There are ways to use GIT features to auto-convert .uni files when
> > pulling content from EDK II repos and pushing commits.
> > That may or may not help with the specific issue you are raising.
> >
> > If you have ideas on a tool change request to EDK II that would
> > provide compatibility with current EDK II tool behavior and support
> > UTF-16LE without a BOM, then let's work that through in a Bugzilla
> > feature request.  If we find a solution, we can update the docs and tools again.
> >
> > Do you have any objections to updating the UNI Spec to match the
> > current tool behavior?
> >
> > Thanks,
> >
> > Mike
> >
> > > -----Original Message-----
> > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > Sent: Wednesday, April 26, 2017 11:54 AM
> > > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > > edk2-devel@lists.01.org
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Mike --
> > >
> > > This is not about files in the EDK II repository. This is about
> > > files created based on the spec, and created with other sets of
> > > tools. Go back to early 2015, to the Build spec (1.22, etc.),
> > > Appendix G, which is where the UNI stuff used to live.
> > >
> > > The point is: files which worked before, and, at worst, generated a
> > > warning before, now are interpreted incorrectly even though they
> > > have correct
> > data.
> > >
> > > Making ASCII (or UTF-8) the default without a BOM is the breaking change.
> > >
> > > Tim
> > >
> > > -----Original Message-----
> > > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > > Sent: Wednesday, April 26, 2017 11:47 AM
> > > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org;
> > > Kinney, Michael D <michael.d.kinney@intel.com>
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Tim,
> > >
> > > If you look at the entire file history of the EDK II, you will see
> > > that the BOM has always been present in the UTF-16LE formatted files.
> > >
> > > The build tools were updated in 2015 to *add* support for UTF-8 file.
> > > The .uni files in the EDK II project were then converted from
> > > UTF-16LE with a BOM to UTF-8 without a BOM.  This provided an easier
> > > developer experience when using GIT to do email patch review of .uni files.
> > >
> > > It is possible I am missing something here.  Can you please provide
> > > a pointer to the EDK II commit(s) where BOMs were added to UTF-16LE .uni files.
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> > > > -----Original Message-----
> > > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > > Sent: Wednesday, April 26, 2017 11:34 AM
> > > > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > > > edk2-devel@lists.01.org
> > > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > > <kevin.w.shaw@intel.com>
> > > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > > on disk to be
> > > > UTF-8 without a BOM
> > > >
> > > > Mike --
> > > >
> > > > I understand that EDK2 has decided to add BOM markers two years ago.
> > > > Adding a BOM didn't change the default. The problem is (a) there
> > > > are still hundreds of files extant in our codebase which were
> > > > created prior to the 2015 changes and still in use, and (b) this
> > > > change is not backward
> > > compatible for these files.
> > > >
> > > > Tim
> > > >
> > > > -----Original Message-----
> > > > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > > > Sent: Wednesday, April 26, 2017 11:11 AM
> > > > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org;
> > > > Kinney, Michael D <michael.d.kinney@intel.com>
> > > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > > <kevin.w.shaw@intel.com>
> > > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > > on disk to be
> > > > UTF-8 without a BOM
> > > >
> > > > Hi Tim,
> > > >
> > > > This is not a request for a new change.  Instead, the intent of
> > > > this document change is to update the document to reflect the
> > > > implemented behavior of the EDK II tools.  The EDK II tool updates
> > > > to add UTF-8 file support were completed with the patches listed
> > > > below.  Notice that the main one for normal build support was checked in almost
> 2 years ago.
> > > >
> > > > BaseTools - UniClassObject - 6/23/2015
> > > > *
> > > > https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253
> > > > fb
> > > > d5
> > > > 119670f75c6
> > > > *
> > > > https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17
> > > > b7
> > > > e8
> > > > 0b91f816eda
> > > >
> > > > BaseTools - ECC - 12/29/2015
> > > > *
> > > > https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88a
> > > > fb
> > > > 3f
> > > > aa71ddde4d6
> > > >
> > > > BaseTools - UPT - 4/25/2016
> > > > *
> > > > https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e936
> > > > 8b
> > > > 6b
> > > > 697bbf327af
> > > >
> > > > This was intended to be a 100% backwards compatible change.
> > > >
> > > > All .uni files in the EDK II project in UTF-16LE format have
> > > > always use a
> > BOM.
> > > > Please checkout UDK2015 or older UDKs and you will see all .uni
> > > > files start with 0xff 0xfe.
> > > >
> > > > Thanks,
> > > >
> > > > Mike
> > > >
> > > > > -----Original Message-----
> > > > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > > > Sent: Wednesday, April 26, 2017 9:15 AM
> > > > > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > > > > edk2-devel@lists.01.org
> > > > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > > > <kevin.w.shaw@intel.com>
> > > > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni
> > > > > files on disk to be
> > > > > UTF-8 without a BOM
> > > > >
> > > > > Mike --
> > > > >
> > > > > This breaks our existing build tools, which assume that a file
> > > > > without a BOM is UTF-16.
> > > > >
> > > > > Tim
> > > > >
> > > > > -----Original Message-----
> > > > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On
> > > > > Behalf Of Michael Kinney
> > > > > Sent: Tuesday, April 25, 2017 6:07 PM
> > > > > To: edk2-devel@lists.01.org
> > > > > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw
> > > > > <kevin.w.shaw@intel.com>
> > > > > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > > > on disk to be UTF-
> > > > > 8 without a BOM
> > > > >
> > > > > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> > > > >
> > > > > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > > > > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > > > > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > > > > Contributed-under: TianoCore Contribution Agreement 1.1
> > > > > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > > > > ---
> > > > >  2_unicode_strings_file_format.md |  9 ++++++---
> > > > >  README.md                        | 27 ++++++++++++++-------------
> > > > >  2 files changed, 20 insertions(+), 16 deletions(-)
> > > > >
> > > > > diff --git a/2_unicode_strings_file_format.md
> > > > > b/2_unicode_strings_file_format.md
> > > > > index 0150c85..7a4a019 100644
> > > > > --- a/2_unicode_strings_file_format.md
> > > > > +++ b/2_unicode_strings_file_format.md
> > > > > @@ -33,7 +33,8 @@
> > > > >
> > > > >  EDK II Unicode files are used for mapping token names to
> > > > > localized strings that are identified by an RFC4646 language code.
> > > > > The format for storing EDK II - Unicode files is UTF-16LE. The
> > > > > character content must be
> > > > UCS-2.
> > > > > +Unicode files on disk is UTF-8 (without a BOM character) or
> > > > > +UTF-16LE (with a BOM character). The character content must be UCS-2.
> > > > >
> > > > >  Strings ends are determined by the first of the following items found:
> > > > >
> > > > > @@ -44,11 +45,13 @@ Strings ends are determined by the first of
> > > > > the following items found:
> > > > >
> > > > >  Comments may appear anywhere within the string file.
> > > > >
> > > > > -All the files must begin with a Unicode BOM character.
> > > > > +All UTF-16LE files must begin with a Unicode BOM character.
> > > > > +All UTF-8 files must not begin with a Unicode BOM character.
> > > > >
> > > > >  **********
> > > > >  **NOTE:** Please make sure you select an editor that supports
> > > > > UCS-2 characters - that can be stored in a UTF-16LE file.
> > > > > +that can be stored in either a UTF-8 (without a BOM character)
> > > > > +or a UTF-16LE file (with a BOM character).
> > > > >  **********
> > > > >
> > > > >  ## 2.1 Common EBNF
> > > > > diff --git a/README.md b/README.md index 63842a1..015aef1 100644
> > > > > --- a/README.md
> > > > > +++ b/README.md
> > > > > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation.
> > > > > All rights reserved.
> > > > >
> > > > >  ### Revision History
> > > > >
> > > > > -| Revision          | Description
> > > > > | Date            |
> > > > > -| ----------------- |
> > > > > -| ----------------------------------------------------------
> > > > > ------------------------------ | --------------- |
> > > > > -| 1.0               | Initial Release.
> > > > > | February 2014   |
> > > > > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by
> > the
> > > > > ANTLR project.                    | August 2014     |
> > > > > -|                   | Added content related to EDK II Meta-Data
> > > > > -| Unicode
> > > files.
> > > > > |                 |
> > > > > -|                   | Restructured document.
> > > > > |                 |
> > > > > -|                   | Removed security and C format GUID
> > > > > -| definitions, not
> > > > > required for HII or other UNI files. |                 |
> > > > > -|                   | Removed invalid escape code sequences.
> > > > > |                 |
> > > > > -| 1.2               | Added optional font formatting
> > > > > | September 2014  |
> > > > > -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > > > | April 2015      |
> > > > > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > > > strings.                            | March 2016      |
> > > > > -|                   | Removed: Info on specific consumers (.INF
> > > > > -| &
> > > > > -| .DEC)
> > > > removed.
> > > > > |                 |
> > > > > -| 1.4               | Convert to GitBook format
> > > > > | March 2017      |
> > > > > +| Revision          | Description
> > > > > | Date            |
> > > > > +| ----------------- |
> > > > > +| ----------------------------------------------------------
> > > > > ------------------------------------------------------------ |
> > > > > --------------- |
> > > > > +| 1.0               | Initial Release.
> > > > > | February 2014   |
> > > > > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by
> > the
> > > > > ANTLR project.                                                  | August
> > 2014
> > > > > |
> > > > > +|                   | Added content related to EDK II Meta-Data
> > > > > +| Unicode
> > > files.
> > > > > |                 |
> > > > > +|                   | Restructured document.
> > > > > |                 |
> > > > > +|                   | Removed security and C format GUID
> > > > > +| definitions, not
> > > > > required for HII or other UNI files.                               |
> > > > > |
> > > > > +|                   | Removed invalid escape code sequences.
> > > > > |                 |
> > > > > +| 1.2               | Added optional font formatting
> > > > > | September 2014  |
> > > > > +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > > > | April 2015      |
> > > > > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > > > strings.                                                          | March
> > > 2016
> > > > > |
> > > > > +|                   | Removed: Info on specific consumers (.INF
> > > > > +| &
> > > > > +| .DEC)
> > > > removed.
> > > > > |                 |
> > > > > +| 1.4               | Convert to GitBook format
> > > > > | April 2017      |
> > > > > +|                   |
> > > > > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > > > > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |
> > > |
> > > > > --
> > > > > 2.6.3.windows.1
> > > > >
> > > > > _______________________________________________
> > > > > edk2-devel mailing list
> > > > > edk2-devel@lists.01.org
> > > > > https://lists.01.org/mailman/listinfo/edk2-devel
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel