 |
BorlandTalk.com Borland discussion newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
Warrick Wilson Guest
|
Posted: Wed Feb 14, 2007 3:40 am Post subject: Handling Unicode data writen as \uXXXX? |
|
|
I'm maintaining a program that currently loads XML via MSXML. It consumes
XML files that contain a list of filenames. In order to support
international markets, the program that creates this list of filenames (that
program is in Java) was modified to handle "Unicode". It now will output
filenames in a form like (heavily dited = I hope I keep the relevant
portions):
<?xml version="1.0" encoding="UTF-8"?>
<root>
<fname>\u00d6\u00fa\u00d1\u00a7\u00d6\u00aa.JPG</fname>
</root>
However, when my Delphi program (Delphi 7 Pro currently) goes to load this,
I'm getting a filename of '\u00d6\u00fa\u00d1\u00a7\u00d6\u00aa.JPG', where
the expectation was "it's reading Unicode so it will convert the escaped
strings". That's not happening. Not surprisingly, the file with the long
name with slashes in it doesn't exist, and now my program is dead in the
water.
Can someone point me to a tutorial on how to handle these sorts of
situations? Can I somehow recover from this issue? I can move to a different
XML parser/handler, if needed. I tried a quick experiment with TXMLDocument
and got the same results.
Thanks. |
|
| Back to top |
|
 |
Remy Lebeau (TeamB) Guest
|
Posted: Wed Feb 14, 2007 3:48 am Post subject: Re: Handling Unicode data writen as \uXXXX? |
|
|
"Warrick Wilson" <warrickw (AT) mercuryonline (DOT) com> wrote in message
news:45d2305d$1 (AT) newsgroups (DOT) borland.com...
| Quote: | when my Delphi program (Delphi 7 Pro currently) goes to load this,
I'm getting a filename of
'\u00d6\u00fa\u00d1\u00a7\u00d6\u00aa.JPG',
where the expectation was "it's reading Unicode so it will convert
the
escaped strings". That's not happening.
|
How are you reading the XML to begin with?
Gambit |
|
| Back to top |
|
 |
Remy Lebeau (TeamB) Guest
|
Posted: Wed Feb 14, 2007 3:52 am Post subject: Re: Handling Unicode data writen as \uXXXX? |
|
|
"Warrick Wilson" <warrickw (AT) mercuryonline (DOT) com> wrote in message
news:45d2305d$1 (AT) newsgroups (DOT) borland.com...
| Quote: | However, when my Delphi program (Delphi 7 Pro currently) goes to
load this,
I'm getting a filename of
'\u00d6\u00fa\u00d1\u00a7\u00d6\u00aa.JPG', where
the expectation was "it's reading Unicode so it will convert the
escaped
strings". That's not happening.
|
The encoding is not occuring at the XML level to begin with, as XML
does not encode Uncode data that way, so the XML itself is not the
problem. The data is being encoded by the Java program before then
being put into the XML. So you will have to manually decode the data
after pulling it from the XML. Otherwise, the Java program should be
updated to properly output Unicode data in a standard XML manner (in
this case, encoding the data as UTF-8 and then putting it into the XML
as-is. our app can then decode the UTF-8 afterwards - the VCL has a
function for that).
Gambit |
|
| Back to top |
|
 |
Warrick Wilson Guest
|
Posted: Wed Feb 14, 2007 4:21 am Post subject: Re: Handling Unicode data writen as \uXXXX? |
|
|
"Remy Lebeau (TeamB)" <no.spam (AT) no (DOT) spam.com> wrote in message
news:45d2325d$1 (AT) newsgroups (DOT) borland.com...
| Quote: |
"Warrick Wilson" <warrickw (AT) mercuryonline (DOT) com> wrote in message
news:45d2305d$1 (AT) newsgroups (DOT) borland.com...
when my Delphi program (Delphi 7 Pro currently) goes to load this,
I'm getting a filename of
'\u00d6\u00fa\u00d1\u00a7\u00d6\u00aa.JPG',
where the expectation was "it's reading Unicode so it will convert
the
escaped strings". That's not happening.
How are you reading the XML to begin with?
|
I've yanked the relevant pieces from the current XML handling, which has
worked fine for several years (prior to the internationalization push). I've
removed all the error handling, etc... This is the basic couple of lines
that grab the file we get sent.
uses
MSXML_TLB;
DomDoc : TDOMDocument;
DomDoc := TDomDocument.Create ( nil );
DomDoc.load ( SrcXMLFile );
There's some use of the file later on, as you'd expect. As you mentioned in
your other post, the new file encoding is a result of whatever's happening
in the source program. I'll have to go talk to that developer and find out
what's getting used there as the encoding/decoding library or routine that's
producing the output I showed earlier.
Thanks for the fast feedback. |
|
| Back to top |
|
 |
Remy Lebeau (TeamB) Guest
|
Posted: Wed Feb 14, 2007 4:42 am Post subject: Re: Handling Unicode data writen as \uXXXX? |
|
|
"Warrick Wilson" <warrickw (AT) mercuryonline (DOT) com> wrote in message
news:45d239fa$1 (AT) newsgroups (DOT) borland.com...
| Quote: | DomDoc := TDomDocument.Create ( nil );
DomDoc.load ( SrcXMLFile );
|
That is showing how you are loading the raw XML data, but you did not
show how you are actually pulling values out of it afterwards.
Gambit |
|
| Back to top |
|
 |
Warrick Wilson Guest
|
Posted: Wed Feb 14, 2007 6:12 am Post subject: Re: Handling Unicode data writen as \uXXXX? |
|
|
"Remy Lebeau (TeamB)" <no.spam (AT) no (DOT) spam.com> wrote in message
news:45d23ee0$1 (AT) newsgroups (DOT) borland.com...
| Quote: |
"Warrick Wilson" <warrickw (AT) mercuryonline (DOT) com> wrote in message
news:45d239fa$1 (AT) newsgroups (DOT) borland.com...
DomDoc := TDomDocument.Create ( nil );
DomDoc.load ( SrcXMLFile );
That is showing how you are loading the raw XML data, but you did not
show how you are actually pulling values out of it afterwards.
|
The node that has the filename is represented as an IXMLDOMNode during a
parse of the list, and the Text part is written to a TTntStringList as a
WideString.
I'd post source code, but we have rules here about doing that, and this
stuff is spread across a few routines that handle nodes of different types
and having extra nodes for further descriptions. When I trace through the
code, the text portion of the node in question looks like
'\u0123\u0321.jpg'. It's basically doing what I told it to do, not what I
wanted (or expected) it to do.
A while back (again, a couple of years) the Delphi portion of the system was
overhauled and we added the Tntware components and changed to use
WideStrings, etc. We actually had a different front end - written in Delphi
then - that could pass Chinese filenames to this section fine. The new
wrinkle is the recent change in the Java front end we're using now (which
until it got "internationalized" was working fine, and works fine now as
long as the filename doesn't have Chinese characters).
I'm actually going to delve into the Java side of this either tonight or
tomorrow and see exactly what's getting done on that side. If it's something
"magical", then I may be able to unmagic it, or I may have to find some
alternate representation. I'd been told that this approach was the
"standard" way of doing things, and that XML parsers would handle it no
problem. I was told that by the guy who'd implemented the Java side. |
|
| Back to top |
|
 |
Remy Lebeau (TeamB) Guest
|
Posted: Wed Feb 14, 2007 7:55 am Post subject: Re: Handling Unicode data writen as \uXXXX? |
|
|
"Warrick Wilson" <warrickw (AT) mercuryonline (DOT) com> wrote in message
news:45d253f6$1 (AT) newsgroups (DOT) borland.com...
| Quote: | I'd been told that this approach was the "standard" way of
doing things, and that XML parsers would handle it no problem
|
You were told wrong. XML parsers DO NOT automatically parse the
contents of string values. They are only responsible for
encoding/decoding the string data in relation to the XML's encoding
scheme (in this case, "UTF-8"). The content of the overall XML must
match the declared encoding. In this situation, the Java code is
manually translating (not encoding) its Unicode data into a format
(namely, into plain Ascii) that happens to be compatible with UTF-8,
instead of actually encoding the Unicode data into true UTF-8. When
your code is reading back the XML values later on, it is receiving the
Ascii text that the Java code manually generated. There is nothing
XML parsers can do about that.
| Quote: | I was told that by the guy who'd implemented the Java side.
|
Then he either doesn't know what he is doing to begin with, or he is
using an XML library that is faulty.
Gambit |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|