charsets: fix handling of 0x00 bytes at the end of GSM encoded strings

When a GSM-7 encoded string is packed, the process of packing the septets into bytes may end up with one last byte holding the last bit of the last septet. When this situation happens, the last byte will end up with the 7 remaining bits set to 0. When this packed string is unpacked, the logic to unpack will unpack those last 7 bits as an additional septet, with the value 0x00. The 0x00 value encoded in GSM-7 is the '@' character, EXCEPT when this character is found at the end of the string, in which case the value should be considered as NUL and trigger the end of string already. So, fix the conversion logic between GSM-7 and UTF-8, so that whenever we find the 0x00 character at the end of the string, we ignore it instead of adding a bogus '@' trailing character. This commit fixes the "/MM/charsets/gsm7/default-chars" unit test after having it updated to perform the full conversion cycle: UTF-8 -> packed GSM7 -> UTF-8
author: Aleksander Morgado <aleksander@aleksander.es> 2020-01-24 09:57:54 +0100
committer: Aleksander Morgado <aleksander@aleksander.es> 2020-01-24 09:57:54 +0100
commit: 2406a5519fb3004f424a880d534121aa6469ad76 (patch)
tree: 51ca3e7543354ad31a3e998e9e78d27cd5fc3fc6 /src
parent: 6c80577bec3ad6b565682dc1836881350efc4ec3 (diff)
1 files changed, 22 insertions, 0 deletions
diff --git a/src/mm-charsets.c b/src/mm-charsets.c
index 023dcf82..31c4a85e 100644
--- a/src/mm-charsets.c
+++ b/src/mm-charsets.c
@@ -424,6 +424,28 @@ mm_charset_gsm_unpacked_to_utf8 (const guint8 *gsm, guint32 len)
         guint8 uchars[4];
         guint8 ulen;
 
+        /*
+         * 	0x00 is NULL (when followed only by 0x00 up to the
+         * 	end of (fixed byte length) message, possibly also up to
+         * 	FORM FEED.  But 0x00 is also the code for COMMERCIAL AT
+         * 	when some other character (CARRIAGE RETURN if nothing else)
+         * 	comes after the 0x00.
+         *  http://unicode.org/Public/MAPPINGS/ETSI/GSM0338.TXT
+         *
+         * So, if we find a '@' (0x00) and all the next chars after that
+         * are also 0x00, we can consider the string finished already.
+         */
+        if (gsm[i] == 0x00) {
+            gsize j;
+
+            for (j = i + 1; j < len; j++) {
+                if (gsm[j] != 0x00)
+                    break;
+            }
+            if (j == len)
+                break;
+        }
+
         if (gsm[i] == GSM_ESCAPE_CHAR) {
             /* Extended alphabet, decode next char */
             ulen = gsm_ext_char_to_utf8 (gsm[i+1], uchars);
author	Aleksander Morgado <aleksander@aleksander.es>	2020-01-24 09:57:54 +0100
committer	Aleksander Morgado <aleksander@aleksander.es>	2020-01-24 09:57:54 +0100
commit	2406a5519fb3004f424a880d534121aa6469ad76 (patch)
tree	51ca3e7543354ad31a3e998e9e78d27cd5fc3fc6 /src
parent	6c80577bec3ad6b565682dc1836881350efc4ec3 (diff)