回复: [edk2-devel] [PATCH] BaseTools: fix decoding issue in file operation

fengyunhua posted 1 patch 3 years, 5 months ago
Failed in applying to current master (apply log)
BaseTools/Source/Python/Common/LongFilePathSupport.py | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
回复: [edk2-devel] [PATCH] BaseTools: fix decoding issue in file operation
Posted by fengyunhua 3 years, 5 months ago
Tested-by: Yunhua Feng <fengyunhua@byosoft.com.cn>


-----邮件原件-----
发件人: bounce+27952+66316+5049190+8953120@groups.io <bounce+27952+66316+5049190+8953120@groups.io> 代表 Wang, Jian J
发送时间: 2020年10月16日 15:41
收件人: devel@edk2.groups.io
抄送: Bob Feng <bob.c.feng@intel.com>; Liming Gao <gaoliming@byosoft.com.cn>; Yuwei Chen <yuwei.chen@intel.com>
主题: [edk2-devel] [PATCH] BaseTools: fix decoding issue in file operation

The build tool reports failure upon file read, such as calling trim
to clean preprocessed source files, if the tool is running on OS with
non-western code-page and the source file has non-ascii characters.

Even if utf-8 has also problem when encountering some characters
encoded in cp1252 (such 0x92, 0x96, 0xa0, etc).

Currently, the safest way to read file in python code is using
'latin-1' (iso-8859-1) because it uses every byte between 00-FF
and then won't cause encoding/decoding issue. It behaves almost
the same as reading file in binary mode.



cp1252 is similar to latin-1 but it doesn't support encoding '\x80'

to '\xff' and doesn't support decoding following bytes:



  '\x81', '\x8d', '\x8f', '\x90', '\x9d'


So if there're utf-8/16 encoded characters in file, it will fail

sometimes.



Refer to following links for details:

  https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)

  https://en.wikipedia.org/wiki/Windows-1252

  https://kb.iu.edu/d/aepu

  https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html


One can use following python code to verify this.

for i in range(0x100):
    try:
        chr(i).encode('latin-1')
    except:
        print("    %s cannot encode %02x" % ('latin-1', i))

for i in range(0x100):
    try:
        b = bytes([i])
        b.decode('latin-1')
    except:
        print("    %s cannot decode %02x" % ('latin-1', i))

This patch add code to enforce using 'latin-1' as encoding argument
of open() in function OpenLongFilePath(), if the open mode is for
text file only. This can solve the file decoding issue completely.


The possible related BZs:

    https://bugzilla.tianocore.org/show_bug.cgi?id=1434

    https://bugzilla.tianocore.org/show_bug.cgi?id=1637

    https://bugzilla.tianocore.org/show_bug.cgi?id=2578

    https://bugzilla.tianocore.org/show_bug.cgi?id=2709

    https://bugzilla.tianocore.org/show_bug.cgi?id=2829


Cc: Bob Feng <bob.c.feng@intel.com>
Cc: Liming Gao <gaoliming@byosoft.com.cn>
Cc: Yuwei Chen <yuwei.chen@intel.com>
Signed-off-by: Jian J Wang <jian.j.wang@intel.com>
---
 BaseTools/Source/Python/Common/LongFilePathSupport.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/BaseTools/Source/Python/Common/LongFilePathSupport.py b/BaseTools/Source/Python/Common/LongFilePathSupport.py
index 38c4396544..c8dce077f2 100644
--- a/BaseTools/Source/Python/Common/LongFilePathSupport.py
+++ b/BaseTools/Source/Python/Common/LongFilePathSupport.py
@@ -30,7 +30,8 @@ def LongFilePath(FileName):
 # wrap open to support opening a long file path

 #

 def OpenLongFilePath(FileName, Mode='r', Buffer= -1):

-    return open(LongFilePath(FileName), Mode, Buffer)

+    Encoding = None if 'b' in Mode else 'latin-1'

+    return open(LongFilePath(FileName), Mode, Buffer, Encoding)

 

 def CodecOpenLongFilePath(Filename, Mode='rb', Encoding=None, Errors='strict', Buffering=1):

     return codecs.open(LongFilePath(Filename), Mode, Encoding, Errors, Buffering)

-- 
2.24.0.windows.2



-=-=-=-=-=-=
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#66316): https://edk2.groups.io/g/devel/message/66316
Mute This Topic: https://groups.io/mt/77546105/5049190
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [fengyunhua@byosoft.com.cn]
-=-=-=-=-=-=






-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#66384): https://edk2.groups.io/g/devel/message/66384
Mute This Topic: https://groups.io/mt/77654194/1787277
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [importer@patchew.org]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [edk2-devel] [PATCH] BaseTools: fix decoding issue in file operation
Posted by Bob Feng 3 years, 5 months ago
This patch is incompatible with python2. 

https://docs.python.org/2.7/library/functions.html#open
open(name[, mode[, buffering]])

In Python2, open has no the Encoding argument


Thanks,
Bob

-----Original Message-----
From: fengyunhua <fengyunhua@byosoft.com.cn> 
Sent: Monday, October 19, 2020 4:55 PM
To: devel@edk2.groups.io; Wang, Jian J <jian.j.wang@intel.com>
Cc: Feng, Bob C <bob.c.feng@intel.com>; 'Liming Gao' <gaoliming@byosoft.com.cn>; Chen, Christine <yuwei.chen@intel.com>
Subject: 回复: [edk2-devel] [PATCH] BaseTools: fix decoding issue in file operation

Tested-by: Yunhua Feng <fengyunhua@byosoft.com.cn>


-----邮件原件-----
发件人: bounce+27952+66316+5049190+8953120@groups.io <bounce+27952+66316+5049190+8953120@groups.io> 代表 Wang, Jian J
发送时间: 2020年10月16日 15:41
收件人: devel@edk2.groups.io
抄送: Bob Feng <bob.c.feng@intel.com>; Liming Gao <gaoliming@byosoft.com.cn>; Yuwei Chen <yuwei.chen@intel.com>
主题: [edk2-devel] [PATCH] BaseTools: fix decoding issue in file operation

The build tool reports failure upon file read, such as calling trim to clean preprocessed source files, if the tool is running on OS with non-western code-page and the source file has non-ascii characters.

Even if utf-8 has also problem when encountering some characters encoded in cp1252 (such 0x92, 0x96, 0xa0, etc).

Currently, the safest way to read file in python code is using 'latin-1' (iso-8859-1) because it uses every byte between 00-FF and then won't cause encoding/decoding issue. It behaves almost the same as reading file in binary mode.



cp1252 is similar to latin-1 but it doesn't support encoding '\x80'

to '\xff' and doesn't support decoding following bytes:



  '\x81', '\x8d', '\x8f', '\x90', '\x9d'


So if there're utf-8/16 encoded characters in file, it will fail

sometimes.



Refer to following links for details:

  https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)

  https://en.wikipedia.org/wiki/Windows-1252

  https://kb.iu.edu/d/aepu

  https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html


One can use following python code to verify this.

for i in range(0x100):
    try:
        chr(i).encode('latin-1')
    except:
        print("    %s cannot encode %02x" % ('latin-1', i))

for i in range(0x100):
    try:
        b = bytes([i])
        b.decode('latin-1')
    except:
        print("    %s cannot decode %02x" % ('latin-1', i))

This patch add code to enforce using 'latin-1' as encoding argument of open() in function OpenLongFilePath(), if the open mode is for text file only. This can solve the file decoding issue completely.


The possible related BZs:

    https://bugzilla.tianocore.org/show_bug.cgi?id=1434

    https://bugzilla.tianocore.org/show_bug.cgi?id=1637

    https://bugzilla.tianocore.org/show_bug.cgi?id=2578

    https://bugzilla.tianocore.org/show_bug.cgi?id=2709

    https://bugzilla.tianocore.org/show_bug.cgi?id=2829


Cc: Bob Feng <bob.c.feng@intel.com>
Cc: Liming Gao <gaoliming@byosoft.com.cn>
Cc: Yuwei Chen <yuwei.chen@intel.com>
Signed-off-by: Jian J Wang <jian.j.wang@intel.com>
---
 BaseTools/Source/Python/Common/LongFilePathSupport.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/BaseTools/Source/Python/Common/LongFilePathSupport.py b/BaseTools/Source/Python/Common/LongFilePathSupport.py
index 38c4396544..c8dce077f2 100644
--- a/BaseTools/Source/Python/Common/LongFilePathSupport.py
+++ b/BaseTools/Source/Python/Common/LongFilePathSupport.py
@@ -30,7 +30,8 @@ def LongFilePath(FileName):
 # wrap open to support opening a long file path

 #

 def OpenLongFilePath(FileName, Mode='r', Buffer= -1):

-    return open(LongFilePath(FileName), Mode, Buffer)

+    Encoding = None if 'b' in Mode else 'latin-1'

+    return open(LongFilePath(FileName), Mode, Buffer, Encoding)

 

 def CodecOpenLongFilePath(Filename, Mode='rb', Encoding=None, Errors='strict', Buffering=1):

     return codecs.open(LongFilePath(Filename), Mode, Encoding, Errors, Buffering)

--
2.24.0.windows.2



-=-=-=-=-=-=
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#66316): https://edk2.groups.io/g/devel/message/66316
Mute This Topic: https://groups.io/mt/77546105/5049190
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [fengyunhua@byosoft.com.cn]
-=-=-=-=-=-=






-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#66446): https://edk2.groups.io/g/devel/message/66446
Mute This Topic: https://groups.io/mt/77675642/1787277
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [importer@patchew.org]
-=-=-=-=-=-=-=-=-=-=-=-