diff options
author | Leo Ufimtsev | 2018-06-13 21:54:20 +0000 |
---|---|---|
committer | Leo Ufimtsev | 2018-06-14 16:51:43 +0000 |
commit | 1823ab237d69270276b2e681b68b46bf881f6abf (patch) | |
tree | 336fbe79972786c379789cd1dee6aac1c7909021 /bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse | |
parent | 1df1be7c4bb508960a7308b328f53bdbc9cb769f (diff) | |
download | eclipse.platform.swt-1823ab237d69270276b2e681b68b46bf881f6abf.tar.gz eclipse.platform.swt-1823ab237d69270276b2e681b68b46bf881f6abf.tar.xz eclipse.platform.swt-1823ab237d69270276b2e681b68b46bf881f6abf.zip |
Bug 535392 [Webkit2] Browser.getText() returns wrong decoding when
setText() contains utf (code point >127) characters
Problem:
- Webkit sometimes returns ASCII encoding and sometimes UTF-16LE
- ASCII is returned when all characters are ascii, UFT-16LE is
returned if at least one character is non-ascii.
- At the binary level, it is not possible to tell encodings apart
- webkit2 currently (v2.20) doesn't have api to get encoding of data.
- Byte order mark is not provided with return value.
Solution:
- I wrote a in-house-made SWT heuristic to try and figure out encoding.
This works well for strings that are 2+ characters and OK for most
single characters.
- Hopefully this is a temporary solution until Webkit2gtk developers
give us access to encoding api. (I will request in 535392).
Testing:
- Added testing suite for converter.
- Tested via snippet.
- Tested via given SWT bot project that sets/gets text of 'find' man
page.
- All SWT jUnit tests work fine.
- Child eclipse seems to work fine.
Considerations & potential issues with this patch:
- This patch contains a lot of double/triple byte UTF characters,
if some of our build tools have issues with special UTF characters,
then they may have issues.
Bug: https://bugs.eclipse.org/bugs/show_bug.cgi?id=535392
Change-Id: I9be1f679676dfacf6415d0ae702cea41aa30cf7b
Signed-off-by: Leo Ufimtsev <lufimtse@redhat.com>
Diffstat (limited to 'bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse')
-rw-r--r-- | bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse/swt/internal/Converter.java | 217 |
1 files changed, 206 insertions, 11 deletions
diff --git a/bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse/swt/internal/Converter.java b/bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse/swt/internal/Converter.java index d0ca0677c6..1e5c67ef7e 100644 --- a/bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse/swt/internal/Converter.java +++ b/bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse/swt/internal/Converter.java @@ -11,33 +11,105 @@ package org.eclipse.swt.internal; +import java.io.*; +import java.nio.*; +import java.nio.charset.*; + import org.eclipse.swt.internal.gtk.*; /** + * About this class: + * ################# * This class implements the conversions between unicode characters - * and the <em>platform supported</em> representation for characters. - * <p> + * and the platform supported representation for characters. + * * Note that, unicode characters which can not be found in the platform * encoding will be converted to an arbitrary platform specific character. * - * Note: - * Regular JNI String conversion usually uses a modified UTF-8, see: - * https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 - * And in JNI, normally (env*)->GetStringUTFChars(..) is used to convert a javaString into a C string. See: - * http://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetStringUTFChars + * This class is tested via: org.eclipse.swt.tests.gtk.Test_GtkTextEncoding + * + * About JNI & string conversion: + * ############################# + * - Regular JNI String conversion usually uses a modified UTF-8, see: https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 + * - And in JNI, normally (env*)->GetStringUTFChars(..) is used to convert a javaString into a C string. + * See: http://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetStringUTFChars + * * However, the modified UTF-8 only works well with C system functions as it doesn't contain embedded nulls * and is null terminated. + * * But because the modified UTF-8 only supports up to 3 bytes (and not up to 4 as regular UTF-8), characters * that require 4 bytes (e.g emojos) are not translated properly from Java to C. - * To work around this issue, we convert the Java string to a byte array on the Java side manually and then - * pass it to C. See: - * http://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni + * + * To work around this issue, we convert the Java string to a byte array on the Java side manually and then pass it to C. + * See: http://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni * * Note: * Java uses UTF-16 Wide characters internally to represent a string. * C uses UTF-8 Multibyte characters (null terminated) to represent a string. * - * </p> + * About encoding on Linux/Gtk & it's relevance to SWT: + * #################################################### + * + * UTF-* = variable length encoding. + * + * UTF-8 = minimum is 8 bits, max is 6 bytes, but rarely goes beyond 4 bytes. Gtk & most of web uses this. + * UTF-16 = minimum is 16 bits. Java's string are stored this way. + * UTF-16 can be + * Big Endian : 65 = 00000000 01000001 # Human friendly, reads left to right. + * Little Endian : 65 = 01000001 00000000 # Intel x86 and also AMD64 / x86-64 series of processors use the little-endian [1] + * # i.e, we in SWT often have to deal with UTF-16 LE + * Some terminology: + * - "Code point" is the numerical value of unicode character. + * - All of UTF-* have the same letter to code-point mapping, + * but UTF-8/16/32 have different "back-ends". + * + * Illustration: + * (char) = (code point) = (back end). + * A = 65 = 01000001 UTF-8 + * = 00000000 01000001 UTF-16 BE + * = 01000001 00000000 UTF-16 LE + * + * - Byte Order Marks (BOM) are a few bytes at the start of a *file* indicating which endianess is used. + * Problem: Gtk/webkit often don't give us BOM's. + * (further reading *3) + * + * - We can reliably encode character to a backend (A -> UTF-8/16), but the other way round is + * guess work since byte order marks are often missing and UTF-16 bits are technically valid UTF-8. + * (see Converter.heuristic for details). + * We could improve our heuristic by using something like http://jchardet.sourceforge.net/. + * + * - Glib has some conversion functions: + * g_utf16_to_utf8 + * g_utf8_to_utf16 + * + * - So does java: (e.g null terminated UTF-8) + * ("myString" + '\0').getBytes(StandardCharsets.UTF-8) + * + * - I suggest using Java functions where possible to avoid memory leaks. + * (Yes, they happen and are big-pain-in-the-ass to find https://bugs.eclipse.org/bugs/show_bug.cgi?id=533995) + * + * + * Learning about encoding: + * ######################### + * I suggest the following 3 videos to understand ASCII/UTF-8/UTF-16[LE|BE]/UTF-32 encoding: + * Overview: https://www.youtube.com/watch?v=MijmeoH9LT4 + * Details: + * Part-1: https://www.youtube.com/watch?v=B1Sf1IhA0j4 + * Part-2: https://www.youtube.com/watch?v=-oYfv794R9s + * Part-3: https://www.youtube.com/watch?v=vLBtrd9Ar28 + * + * Also read all of this: + * http://kunststube.net/encoding/ + * and this: + * https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ + * + * And lastly, good utf-8 reference: https://en.wikipedia.org/wiki/UTF-8#Description + * + * You should now be a master of encoding. I wish you luck on your journey. + * + * [1] https://en.wikipedia.org/wiki/Endianness + * [2] https://en.wikipedia.org/wiki/Byte_order_mark + * [3] BOM's: http://unicode.org/faq/utf_bom.html#BOM */ public final class Converter { public static final byte [] NullByteArray = new byte [1]; @@ -182,4 +254,127 @@ public static char mbcsToWcs (char ch) { return result [0]; } +/** + * Given a byte array with unknown encoding, try to decode it via (relatively simple) heuristic. + * This is useful when we're not provided the encoding by OS/library.<br> + * + * Current implementation only supports standard java charsets but can be extended as needed. + * This method could be improved by using http://jchardet.sourceforge.net/ <br> + * + * Run time is O(a * n) where a is a constant that varies depending on the size of input n, but roughly 1-20) + * + * @param bytes raw bits from the OS. + * @return String based on the most pop + */ +public static String byteToStringViaHeuristic(byte [] bytes) { + /* + * Technical notes: + * - Given a sequence of bytes, UTF-8 and UTF-16 cannot determined deterministically (1*). + * - However, UTF-16 has a lot of null bytes when code points are mostly in the 0-255 range (using only 2nd byte), + * a byte sequence with many null bytes is likely UTF-16. + * - Valid UTF-8 technically can contain null bytes, but it's rare. + * + * Some times it can get confused if it receives two non-null bytes. e.g Ё = (UTF-16 [01,04]) + * It can either mean a valid set of UTF-8 characters or a single UTF-16 character. + * This issue typically only occurs for very short sequences 1-5 characters of very special characters). + * Improving the heuristic for such corner cases is complicated. We'd have to implement a mechanism + * that would be aware of character frequencies and assign a score to the probability of each mapping. + * + * [1] https://softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file + */ + // Base cases + if ((bytes.length == 0) || + (bytes.length == 1 && bytes[0] == 0)) { + return ""; + } + + // Test if it's valid UTF-8. + // Note, ASCII is a subset of UTF-8. + try { + CharsetDecoder charDecoder = StandardCharsets.UTF_8.newDecoder(); + charDecoder.onMalformedInput(CodingErrorAction.REPORT); + charDecoder.onUnmappableCharacter(CodingErrorAction.REPORT); + String text = charDecoder.decode(ByteBuffer.wrap(bytes)).toString(); + + // No exception thrown means that we have valid UTF-8 "bit string". However, valid UTF-8 bit string doesn't mean it's the corect decoding. + // We have assert correctness via an educated guess + boolean probablyUTF8 = true; + + { + // Problem 1: It might be UTF-16 since at the binary level UTF-16 can be valid UTF-8. (null is a valid utf-8 character). + // Solution: Count nulls to try to guess if it's UTF-16. + // Verified via + // org.eclipse.swt.tests.gtk.Test_GtkConverter.test_HeuristicUTF16_letters() + // org.eclipse.swt.tests.gtk.Test_GtkConverter.test_HeuristicUTF16_letter() + double nullBytePercentageForUtf16 = 0.01; // if more than this % null bytes, then it's probably utf-16. + int nullCount = 0; + for (byte b : bytes) { + if (b == 0) + nullCount++; + } + double nullPercentage = (double) nullCount / (double) bytes.length; + if (nullPercentage > nullBytePercentageForUtf16) { + probablyUTF8 = false; + } + } + + // Problem 2: Valid UTF-8 bit string can map to invalid code points (i.e undefined unicode) + // Solution 2: verify that every character is a valid code point. + if (probablyUTF8) { + char [] chars = text.toCharArray(); + + for (int i = 0; i < chars.length; i++) { + int codePoint = Character.codePointAt(chars, i); + if (!Character.isValidCodePoint(codePoint)) { + probablyUTF8 = false; + break; + } + } + } + + // Problem 3: Short 2-byte sequences are very ambiguous. + // E.g Unicode Hyphen U+2010 '‐' ( which btw different from the ascii U+002D '-' Hyphen-Minus) + // can be miss-understood as 16 (Synchronous Idle) & 32 (Space). + // Solution: Unless we have two valid alphabet characters, it's probably a single utf-16 character. + // However, this leads to the problem that single non-alphabetic unicode characters are not recognized correctly. + // Below code is left in case recognizing alphabetic characters is of higher priority than exotic unicode once. +// if (probablyUTF8) { +// if (bytes.length == 2) { +// char [] chars = text.toCharArray(); +// for (int i = 0; i < chars.length; i++) { +// int codePoint = Character.codePointAt(chars, i); +// if (!Character.isAlphabetic(codePoint)) { +// probablyUTF8 = false; +// break; +// } +// } +// } +// } + + if (!probablyUTF8) { + return new String (bytes, StandardCharsets.UTF_16LE); + } else { + return text; + } + } catch (CharacterCodingException e) { + } + + // Invalid UTF-8. Try other character sets. + Charset [] commonWebCharSets = new Charset[] {StandardCharsets.UTF_16LE, StandardCharsets.ISO_8859_1, StandardCharsets.UTF_16BE, StandardCharsets.UTF_16}; + for (Charset setToTry : commonWebCharSets) { + try { + CharsetDecoder charDecoder = setToTry.newDecoder(); + charDecoder.onMalformedInput(CodingErrorAction.REPORT); + charDecoder.onUnmappableCharacter(CodingErrorAction.REPORT); + return charDecoder.decode(ByteBuffer.wrap(bytes)).toString(); + } catch (CharacterCodingException e) {} + } + + // Could not determine encoding. + // Return error string with stack trace to help users determine which function lead to a failed decoding. + StringWriter sw = new StringWriter(); + new Throwable("").printStackTrace(new PrintWriter(sw)); + return "SWT: Failed to decode byte buffer. Encoding is not ASCII/UTF-8/UTF-16[LE|BE|BOM]/ISO_8859_1. Stack trace:\n" + sw.toString(); +} + } |