BorlandTalk.com Forum Index BorlandTalk.com
Borland discussion newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Re: Function for checking binary content of a file FAILS...

 
Post new topic   Reply to topic    BorlandTalk.com Forum Index -> C++ Builder (Language C++)
View previous topic :: View next topic  
Author Message
Oren Halvani
Guest





PostPosted: Thu Aug 11, 2005 1:38 am    Post subject: Re: Function for checking binary content of a file FAILS... Reply with quote



"Remy Lebeau (TeamB)" <no.spam (AT) no (DOT) spam.com> schrieb im
Newsbeitrag news:42faa7bf$1 (AT) newsgroups (DOT) borland.com...

Quote:
If you are testing for only single-byte ASCII characters then you would
never be able to check for values > 255. Those characters have to be
encoded in either MBCS or Unicode since they do not fit in a single byte.

how should i should check for multibyte chars...? i never touched MBCS :-(

Quote:
That line does not check anything at all. That line simply moves the
current reading position to the end of the file, nothing else.

thanks for the info, i didn't wrote this function - so i didn't know
what actually FileSeek(F, 0, 2) does...

Quote:
As it should be, because Hebrew requires Unicode, not ASCII. Hebrew
characters are in the 0x0590-0x05FF range, many characters of which
contain
bytes that will fail your ASCII test.

OK, so i guess arabic, thai, korean and other eastern languages are also
multybyte..correct?
Remy, how can check files that containing those MB chars...? does VCL has a
special
datatype for it...? i will be not able to use:

typedef Set<BYTE, 0, 255> TByteSet;

because it singlebyte, correct...?
so what else is possible, if i cannot use BYTE datatype ??

Quote:
TByteSet OkChars = TByteSet() << 9 << 10 << 13 << 26;

Why are you including character 26? That is not a textual character.

look above, i didn't code it...

Quote:
Please post such files to the .attachments group instead, so that they can
be preserved in their original format. If you inline the text into a
message, the content of the text is effected by the encoding of the
message.


Gambit

OK, i've posted several files to the .attachments group, the title is:
"For Remy/Gambit ---- Non ASCII files..." can you have a look on it...?

Remy, which editor would you prefer to use to
view the original format..? i've tryed Notepad, EditPlus, UltraEdit, but
instead of the binary chars i've seen only "???" and other strange symbols..



Oren



Back to top
Remy Lebeau (TeamB)
Guest





PostPosted: Thu Aug 11, 2005 3:14 am    Post subject: Re: Function for checking binary content of a file FAILS... Reply with quote




"Oren Halvani" <NoSpam (AT) fdtd (DOT) com> wrote


Quote:
how should i should check for multibyte chars...?

You have to know the encoding scheme of the file ahead of time so that you
can decode its contents appropriately. Characters in one encoding scheme
have different meanings than characters in a different encoding scheme.

Unless the file in question specifies its own encoding at the beginning of
the content, or there is otherwise some other means of knowing what encoding
is being used (email and HTTP, for example, have headers that appear prior
to the actual data which describe what the data is), there is no standard
way to programmably determine a file's encoding scheme. XML, for example,
has an "encoding" attribute that can be specified before the actual XML data
is then processed. Likewise, HTML has ways of specifying its encoding
scheme as well. In the absense of such declarations, default encoding
schemes are assumed. Some other formats may have equivilent semantics.
Most formats do not, however.

Typically, programmers will take a small sampling of the file data from the
beginning of the file and then analyze the bytes to see if they appear to
follow any known patterns that the program can then apply to the rest of the
file. Some encoding schemes actually expect this and have been designed to
accomodate this approach. Many do not, though. And there are a lot of
encoding schemes available for textual data, which makes the job of
determinging the encoding scheme much harder for you if there is no readily
available identifer that accompanies the data. In which case, about the
only thing you can do is find a third-party library that already does all of
the hard work for you. You provide it with a buffer of data and it returns
to you its best guess as to which encoding was used for the data.

The issue becomes worse when an encoding scheme that uses single-byte
indexes into lookup tables are involved. For instance, these schemes map
values 0-255 into subset areas of the larger Unicode character set. Each
language maps into a different area of Unicode, such that character 0 in
language ABC represents a completely different Unicode character than
character 0 in language CDE. In this situation, knowing the language of the
content ahead of time is essential in order to know which areas of the
Unicode character set is involved.

Some systems, such as Apache servers, have even gone as far as including the
file's encoding scheme into the filename, for lack of any other place to put
it.

Worse case scenerio, you write your code to recognize several different
encoding schemes, and then just let the user decide which one to use when
processing a file. If the content fails to decode properly, then tell the
user to try a different scheme. Chances are, the user will usually have a
better idea of what is in a file then your program does.

Quote:
thanks for the info, i didn't wrote this function - so i didn't
know what actually FileSeek(F, 0, 2) does...

You would if you had read the VCL documentation for it.

Quote:
OK, so i guess arabic, thai, korean and other eastern languages
are also multybyte..correct?

Yes, thay all require Unicode.

Quote:
Remy, how can check files that containing those MB chars...?

Your current approach will not work for multi-byte characters.

Quote:
so what else is possible, if i cannot use BYTE datatype ??

You need to approach the issue from a completely different perspective.

Quote:
OK, i've posted several files to the .attachments group, the title is:
"For Remy/Gambit ---- Non ASCII files..." can you have a look
on it...?

None of those files contain multi-byte characters at all. If they were
supposed to, then you did not post the original data as-is.

Quote:
i've tryed Notepad, EditPlus, UltraEdit, but instead of the binary
chars i've seen only "???" and other strange symbols..

That is because you are trying to view multi-byte content in viewers that do
not support them, or you are using an OS that does not recognize Unicode so
the viewers have to dummy themselves down accordingly. You get ? characters
when Unicode characters are converted to single-byte ASCII characters that
cannot represent the original Unicode characters (for obvious reasons - they
are too big in value).


Gambit



Back to top
Oren Halvani
Guest





PostPosted: Thu Aug 11, 2005 3:23 am    Post subject: Re: Function for checking binary content of a file FAILS... Reply with quote



thank you very, very much for the detailed information,
i see it's not a easy way to solve what i was thinking was easy :-(

unicode-programming seems to be very complicated, i will need
to study more before impleminting multilanguage-support
in my programs...

anyway thanks alot Remy..



Oren



Back to top
Andrue Cope [TeamB]
Guest





PostPosted: Thu Aug 11, 2005 8:16 am    Post subject: Re: Function for checking binary content of a file FAILS... Reply with quote

Oren Halvani wrote:

Quote:
Remy, how can check files that containing those MB chars...?

Oh dear.

I'm not Remy but I do work alongside someone that wrote our text
indexing engine. That engine is designed to recognise 'text like
characteristics' in any datastream and extract and/or index them.

I can tell you now that it isn't easy. You have to implement a lot of
wheighting and decision based logic. Checking the ratio of vowels to
consenants to punctuation. Deciding if a punctuation is genuine or just
noise.

--
Andrue Cope [TeamB]
[Bicester, Uk]
http://info.borland.com/newsgroups/guide.html

Back to top
Oren Halvani
Guest





PostPosted: Thu Aug 11, 2005 9:32 am    Post subject: Re: Function for checking binary content of a file FAILS... Reply with quote

"Andrue Cope [TeamB]" <no.spam (AT) not (DOT) a.valid.address> schrieb im Newsbeitrag
news:42fb0941$1 (AT) newsgroups (DOT) borland.com...

Quote:
Oh dear.

I'm not Remy but I do work alongside someone that wrote our text
indexing engine. That engine is designed to recognise 'text like
characteristics' in any datastream and extract and/or index them.

I can tell you now that it isn't easy. You have to implement a lot of
wheighting and decision based logic. Checking the ratio of vowels to
consenants to punctuation. Deciding if a punctuation is genuine or just
noise.

--
Andrue Cope [TeamB]

thanks Andrue,

now i'm recognize what a huge task it is to solve this binary-file problem..
thanks for the info..



Oren

Back to top
Display posts from previous:   
Post new topic   Reply to topic    BorlandTalk.com Forum Index -> C++ Builder (Language C++) All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.