Advertisement

02.13.2003 at 10:47AM PST, ID: 20512969
[x]
Attachment Details

character encoding (unicode to utf-8) conversion problem

[x]
The Solution Rating System

With so many solutions, how can you tell which solutions are most likely to help you and which ones are not? To provide you with a tool to use, we rate our solutions based on various elements that most accurately determine if a solution is a quality solution. To explain what factors affect the solution rating, here are the elements we take into consideration when formulating our solution rating.

  • The Grade of the Solution
  • The Zone Rank of the Expert Providing the Solution
  • The Number of Author and Expert Comments
  • The Number of Experts Contributing
  • The Feedback of the Community

Your Input Matters
Because of the way the system is set up, the most important variable in this equation is you. As a member of Experts Exchange, you are able to cast your vote on the quality of the solutions in regard to how complete, accurate, helpful and easy to understand each solution is. When you provide your feedback, each rating is adjusted accordingly. So, if you see a solution that has a poor rating that you think is a good solution, let us know by rating it. As you do, the rating will be adjusted and will become more accurate for other members of our site.

If you have any suggestions that you would like to make for our rating system, please ask a question in the Suggestions Zone of Community Support.

Thank you!

5.8
Tags:

java, encoding

I have run into a problem that I can't seem to find a solution to.

my users are copying and pasting from MS-Word.  My DB is Oracle with its encoding set to "UTF-8".

Using Oracle's thin driver it automatically converts to the DB's default character set.

When Java tries to encode Unicode to UTF-8 and it runs into an unknown character (typically a character that is in the High Ascii range) it substitutes it with '?' or some other wierd character.

How do I prevent this.

I tried different encodings using a simple driver like:
class UnicodeConversionTest
{
    public static void main(String[] args)
    {
   try {
     String str = new String("`test3`");
     String utfStr = new String(str.getBytes("UTF-8"), "UTF-8");
     System.out.println("Converted:" + str + " to:" + utfStr);
   } catch (Exception e) {
       e.printStackTrace(System.out);
     }
    }
}

But that didn't work.  Then I tried a more elaborate conversion:
import sun.io.CharToByteConverter;
import sun.io.ByteToCharConverter;

public class UnicodeTest {
 public UnicodeTest() {
 }

 public static void main(String[] args) {

   UnicodeTest unicodeTest1 = new UnicodeTest();

   try {
     ByteToCharConverter fromUnicode = ByteToCharConverter.getConverter("US-ASCII");
     char[] subChars = { ' ' };
     fromUnicode.setSubstitutionMode(true);
     fromUnicode.setSubstitutionChars(subChars);
     String originalStr = new String("test3");
     char[] convertedChars = fromUnicode.convertAll(originalStr.getBytes());
     String convertedStr = new String(convertedChars);
     //String convertedStr = new String(originalStr.getBytes("US-ASCII"), "US-ASCII");
     System.out.println("String:" + originalStr + " converted to:" + convertedStr);
   } catch (Exception e) {
     e.printStackTrace(System.out);
   }
 }

I tried a variation of the second code snippet that inserts into the DB - just to see the results and it was a no go.

I don't want '?' replacing the uknown chars.  I would rather strip them or replace them with ' ' but I haven't been able to get that to work (using the second bit of code)

Any ideas on what I am doing wrong?

Thanx,
CJ
Answered By: orangehead911
Expert Since: 01/23/2003
Accepted Solutions: 680
orangehead911 has been an Expert for 5 years 11 months, during which he has posted 3227 comments and answered 680 questions. orangehead911 is just one of 1203 experts in the Java Programming Language Zone. 8 experts collaborated on this answer, which was graded an "A" by the asker.
 
 
20081119-EE-VQP-48