Skip to main content
aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorLeo Ufimtsev2018-06-13 21:54:20 +0000
committerLeo Ufimtsev2018-06-14 16:51:43 +0000
commit1823ab237d69270276b2e681b68b46bf881f6abf (patch)
tree336fbe79972786c379789cd1dee6aac1c7909021 /bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse
parent1df1be7c4bb508960a7308b328f53bdbc9cb769f (diff)
downloadeclipse.platform.swt-1823ab237d69270276b2e681b68b46bf881f6abf.tar.gz
eclipse.platform.swt-1823ab237d69270276b2e681b68b46bf881f6abf.tar.xz
eclipse.platform.swt-1823ab237d69270276b2e681b68b46bf881f6abf.zip
Bug 535392 [Webkit2] Browser.getText() returns wrong decoding when
setText() contains utf (code point >127) characters Problem: - Webkit sometimes returns ASCII encoding and sometimes UTF-16LE - ASCII is returned when all characters are ascii, UFT-16LE is returned if at least one character is non-ascii. - At the binary level, it is not possible to tell encodings apart - webkit2 currently (v2.20) doesn't have api to get encoding of data. - Byte order mark is not provided with return value. Solution: - I wrote a in-house-made SWT heuristic to try and figure out encoding. This works well for strings that are 2+ characters and OK for most single characters. - Hopefully this is a temporary solution until Webkit2gtk developers give us access to encoding api. (I will request in 535392). Testing: - Added testing suite for converter. - Tested via snippet. - Tested via given SWT bot project that sets/gets text of 'find' man page. - All SWT jUnit tests work fine. - Child eclipse seems to work fine. Considerations & potential issues with this patch: - This patch contains a lot of double/triple byte UTF characters, if some of our build tools have issues with special UTF characters, then they may have issues. Bug: https://bugs.eclipse.org/bugs/show_bug.cgi?id=535392 Change-Id: I9be1f679676dfacf6415d0ae702cea41aa30cf7b Signed-off-by: Leo Ufimtsev <lufimtse@redhat.com>
Diffstat (limited to 'bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse')
-rw-r--r--bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse/swt/internal/Converter.java217
1 files changed, 206 insertions, 11 deletions
diff --git a/bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse/swt/internal/Converter.java b/bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse/swt/internal/Converter.java
index d0ca0677c6..1e5c67ef7e 100644
--- a/bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse/swt/internal/Converter.java
+++ b/bundles/org.eclipse.swt/Eclipse SWT/gtk/org/eclipse/swt/internal/Converter.java
@@ -11,33 +11,105 @@
package org.eclipse.swt.internal;
+import java.io.*;
+import java.nio.*;
+import java.nio.charset.*;
+
import org.eclipse.swt.internal.gtk.*;
/**
+ * About this class:
+ * #################
* This class implements the conversions between unicode characters
- * and the <em>platform supported</em> representation for characters.
- * <p>
+ * and the platform supported representation for characters.
+ *
* Note that, unicode characters which can not be found in the platform
* encoding will be converted to an arbitrary platform specific character.
*
- * Note:
- * Regular JNI String conversion usually uses a modified UTF-8, see:
- * https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
- * And in JNI, normally (env*)->GetStringUTFChars(..) is used to convert a javaString into a C string. See:
- * http://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetStringUTFChars
+ * This class is tested via: org.eclipse.swt.tests.gtk.Test_GtkTextEncoding
+ *
+ * About JNI & string conversion:
+ * #############################
+ * - Regular JNI String conversion usually uses a modified UTF-8, see: https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
+ * - And in JNI, normally (env*)->GetStringUTFChars(..) is used to convert a javaString into a C string.
+ * See: http://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetStringUTFChars
+ *
* However, the modified UTF-8 only works well with C system functions as it doesn't contain embedded nulls
* and is null terminated.
+ *
* But because the modified UTF-8 only supports up to 3 bytes (and not up to 4 as regular UTF-8), characters
* that require 4 bytes (e.g emojos) are not translated properly from Java to C.
- * To work around this issue, we convert the Java string to a byte array on the Java side manually and then
- * pass it to C. See:
- * http://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni
+ *
+ * To work around this issue, we convert the Java string to a byte array on the Java side manually and then pass it to C.
+ * See: http://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni
*
* Note:
* Java uses UTF-16 Wide characters internally to represent a string.
* C uses UTF-8 Multibyte characters (null terminated) to represent a string.
*
- * </p>
+ * About encoding on Linux/Gtk & it's relevance to SWT:
+ * ####################################################
+ *
+ * UTF-* = variable length encoding.
+ *
+ * UTF-8 = minimum is 8 bits, max is 6 bytes, but rarely goes beyond 4 bytes. Gtk & most of web uses this.
+ * UTF-16 = minimum is 16 bits. Java's string are stored this way.
+ * UTF-16 can be
+ * Big Endian : 65 = 00000000 01000001 # Human friendly, reads left to right.
+ * Little Endian : 65 = 01000001 00000000 # Intel x86 and also AMD64 / x86-64 series of processors use the little-endian [1]
+ * # i.e, we in SWT often have to deal with UTF-16 LE
+ * Some terminology:
+ * - "Code point" is the numerical value of unicode character.
+ * - All of UTF-* have the same letter to code-point mapping,
+ * but UTF-8/16/32 have different "back-ends".
+ *
+ * Illustration:
+ * (char) = (code point) = (back end).
+ * A = 65 = 01000001 UTF-8
+ * = 00000000 01000001 UTF-16 BE
+ * = 01000001 00000000 UTF-16 LE
+ *
+ * - Byte Order Marks (BOM) are a few bytes at the start of a *file* indicating which endianess is used.
+ * Problem: Gtk/webkit often don't give us BOM's.
+ * (further reading *3)
+ *
+ * - We can reliably encode character to a backend (A -> UTF-8/16), but the other way round is
+ * guess work since byte order marks are often missing and UTF-16 bits are technically valid UTF-8.
+ * (see Converter.heuristic for details).
+ * We could improve our heuristic by using something like http://jchardet.sourceforge.net/.
+ *
+ * - Glib has some conversion functions:
+ * g_utf16_to_utf8
+ * g_utf8_to_utf16
+ *
+ * - So does java: (e.g null terminated UTF-8)
+ * ("myString" + '\0').getBytes(StandardCharsets.UTF-8)
+ *
+ * - I suggest using Java functions where possible to avoid memory leaks.
+ * (Yes, they happen and are big-pain-in-the-ass to find https://bugs.eclipse.org/bugs/show_bug.cgi?id=533995)
+ *
+ *
+ * Learning about encoding:
+ * #########################
+ * I suggest the following 3 videos to understand ASCII/UTF-8/UTF-16[LE|BE]/UTF-32 encoding:
+ * Overview: https://www.youtube.com/watch?v=MijmeoH9LT4
+ * Details:
+ * Part-1: https://www.youtube.com/watch?v=B1Sf1IhA0j4
+ * Part-2: https://www.youtube.com/watch?v=-oYfv794R9s
+ * Part-3: https://www.youtube.com/watch?v=vLBtrd9Ar28
+ *
+ * Also read all of this:
+ * http://kunststube.net/encoding/
+ * and this:
+ * https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
+ *
+ * And lastly, good utf-8 reference: https://en.wikipedia.org/wiki/UTF-8#Description
+ *
+ * You should now be a master of encoding. I wish you luck on your journey.
+ *
+ * [1] https://en.wikipedia.org/wiki/Endianness
+ * [2] https://en.wikipedia.org/wiki/Byte_order_mark
+ * [3] BOM's: http://unicode.org/faq/utf_bom.html#BOM
*/
public final class Converter {
public static final byte [] NullByteArray = new byte [1];
@@ -182,4 +254,127 @@ public static char mbcsToWcs (char ch) {
return result [0];
}
+/**
+ * Given a byte array with unknown encoding, try to decode it via (relatively simple) heuristic.
+ * This is useful when we're not provided the encoding by OS/library.<br>
+ *
+ * Current implementation only supports standard java charsets but can be extended as needed.
+ * This method could be improved by using http://jchardet.sourceforge.net/ <br>
+ *
+ * Run time is O(a * n) where a is a constant that varies depending on the size of input n, but roughly 1-20)
+ *
+ * @param bytes raw bits from the OS.
+ * @return String based on the most pop
+ */
+public static String byteToStringViaHeuristic(byte [] bytes) {
+ /*
+ * Technical notes:
+ * - Given a sequence of bytes, UTF-8 and UTF-16 cannot determined deterministically (1*).
+ * - However, UTF-16 has a lot of null bytes when code points are mostly in the 0-255 range (using only 2nd byte),
+ * a byte sequence with many null bytes is likely UTF-16.
+ * - Valid UTF-8 technically can contain null bytes, but it's rare.
+ *
+ * Some times it can get confused if it receives two non-null bytes. e.g Ё = (UTF-16 [01,04])
+ * It can either mean a valid set of UTF-8 characters or a single UTF-16 character.
+ * This issue typically only occurs for very short sequences 1-5 characters of very special characters).
+ * Improving the heuristic for such corner cases is complicated. We'd have to implement a mechanism
+ * that would be aware of character frequencies and assign a score to the probability of each mapping.
+ *
+ * [1] https://softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file
+ */
+ // Base cases
+ if ((bytes.length == 0) ||
+ (bytes.length == 1 && bytes[0] == 0)) {
+ return "";
+ }
+
+ // Test if it's valid UTF-8.
+ // Note, ASCII is a subset of UTF-8.
+ try {
+ CharsetDecoder charDecoder = StandardCharsets.UTF_8.newDecoder();
+ charDecoder.onMalformedInput(CodingErrorAction.REPORT);
+ charDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);
+ String text = charDecoder.decode(ByteBuffer.wrap(bytes)).toString();
+
+ // No exception thrown means that we have valid UTF-8 "bit string". However, valid UTF-8 bit string doesn't mean it's the corect decoding.
+ // We have assert correctness via an educated guess
+ boolean probablyUTF8 = true;
+
+ {
+ // Problem 1: It might be UTF-16 since at the binary level UTF-16 can be valid UTF-8. (null is a valid utf-8 character).
+ // Solution: Count nulls to try to guess if it's UTF-16.
+ // Verified via
+ // org.eclipse.swt.tests.gtk.Test_GtkConverter.test_HeuristicUTF16_letters()
+ // org.eclipse.swt.tests.gtk.Test_GtkConverter.test_HeuristicUTF16_letter()
+ double nullBytePercentageForUtf16 = 0.01; // if more than this % null bytes, then it's probably utf-16.
+ int nullCount = 0;
+ for (byte b : bytes) {
+ if (b == 0)
+ nullCount++;
+ }
+ double nullPercentage = (double) nullCount / (double) bytes.length;
+ if (nullPercentage > nullBytePercentageForUtf16) {
+ probablyUTF8 = false;
+ }
+ }
+
+ // Problem 2: Valid UTF-8 bit string can map to invalid code points (i.e undefined unicode)
+ // Solution 2: verify that every character is a valid code point.
+ if (probablyUTF8) {
+ char [] chars = text.toCharArray();
+
+ for (int i = 0; i < chars.length; i++) {
+ int codePoint = Character.codePointAt(chars, i);
+ if (!Character.isValidCodePoint(codePoint)) {
+ probablyUTF8 = false;
+ break;
+ }
+ }
+ }
+
+ // Problem 3: Short 2-byte sequences are very ambiguous.
+ // E.g Unicode Hyphen U+2010 '‐' ( which btw different from the ascii U+002D '-' Hyphen-Minus)
+ // can be miss-understood as 16 (Synchronous Idle) & 32 (Space).
+ // Solution: Unless we have two valid alphabet characters, it's probably a single utf-16 character.
+ // However, this leads to the problem that single non-alphabetic unicode characters are not recognized correctly.
+ // Below code is left in case recognizing alphabetic characters is of higher priority than exotic unicode once.
+// if (probablyUTF8) {
+// if (bytes.length == 2) {
+// char [] chars = text.toCharArray();
+// for (int i = 0; i < chars.length; i++) {
+// int codePoint = Character.codePointAt(chars, i);
+// if (!Character.isAlphabetic(codePoint)) {
+// probablyUTF8 = false;
+// break;
+// }
+// }
+// }
+// }
+
+ if (!probablyUTF8) {
+ return new String (bytes, StandardCharsets.UTF_16LE);
+ } else {
+ return text;
+ }
+ } catch (CharacterCodingException e) {
+ }
+
+ // Invalid UTF-8. Try other character sets.
+ Charset [] commonWebCharSets = new Charset[] {StandardCharsets.UTF_16LE, StandardCharsets.ISO_8859_1, StandardCharsets.UTF_16BE, StandardCharsets.UTF_16};
+ for (Charset setToTry : commonWebCharSets) {
+ try {
+ CharsetDecoder charDecoder = setToTry.newDecoder();
+ charDecoder.onMalformedInput(CodingErrorAction.REPORT);
+ charDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);
+ return charDecoder.decode(ByteBuffer.wrap(bytes)).toString();
+ } catch (CharacterCodingException e) {}
+ }
+
+ // Could not determine encoding.
+ // Return error string with stack trace to help users determine which function lead to a failed decoding.
+ StringWriter sw = new StringWriter();
+ new Throwable("").printStackTrace(new PrintWriter(sw));
+ return "SWT: Failed to decode byte buffer. Encoding is not ASCII/UTF-8/UTF-16[LE|BE|BOM]/ISO_8859_1. Stack trace:\n" + sw.toString();
+}
+
}

Back to the top