Welcome, Guest.
Please login or register.
Unzip 6.0 is missing option -O
Forum Login
Login Name: Create a new account
Password:     Forgot password

Info-ZIP Discussion Forum    Info-ZIP Bugs    UnZip Bugs  ›  Unzip 6.0 is missing option -O

Unzip 6.0 is missing option -O  This thread currently has 6718 views. Print
4 Pages « 1 2 3 4 » All Recommend Thread
csa
July 15, 2010, 9:40pm Report to Moderator
Baby Member
Posts: 7
Quoted from EG
So this does not use the -O option, but should automatically detect the current codepage and display the characters appropriately.  Hmm.  Sounds good.  It would have to bow out and let Unicode do its thing when that is enabled.  There should also be an option to disable it.  It also should be listed in the unzip -v list when present.

The autodetection will work by default for non-unicode names. As well I can add support for -O to override the autodetected with the specified encoding. It is possible to disable library through the RusXMMS configuration, but I can add as well another switch to disable it from unzip command line, just tell me the switch letter.

Quoted from EG
How does one get and install this library on Windows as well as Linux?  Is it available for Mac OS X (which it should as that might just use a Unix version).  (Given all the other things going on, figure I'd let you do the looking.)

Library is included in some of the major Linux distributions and there is builds avaialable for many others. I have tested build from sources on OSX, FreeBSD, and OpenSolaris. No problems there. At the moment there is no Windows builds, but I'll provide in 1-2 weeks if you will accept idea to try RusXMMS patch.
Quoted from EG
In the long run, it seems everyone should be migrating to zipping tools
that include UTF-8 and unzipping tools that can read it.  Then all this
is not needed.

Unfortunatelly, I can't agree that it would not be needed. There is a lot of zip files with non-unicode names and they will circulate forever. I don't even see any way how it is possible to prevent Windows users from producing more non-Unicode zip files with all variety of tools they are using. Exactly, the same situation we have with MP3 files. The ID3 v.2 with Unicode support is already 10 years old, but there are still a lot of broken MP3's with non-unicode encodings around.
Quoted from EG
If that sounds acceptable, we might be able to move forward on this,
assuming no one else in the Info-ZIP development group has issues with
this.

For me this sounds fine.
Logged
Private Message Reply: 30 - 47
Al Dunsmuir
July 15, 2010, 9:57pm Report to Moderator
Info-ZIP Team
Posts: 94
Ed,

Your solution for handling the file names sounds capable and workable.

Platforms such as Linux have dynamicly linked system runtime libraries which are LGPL with explicit
authorization for any program to use.    Statically binding LGPL code may introduce problems with making
precompiled zip/unzip available.  The original author(s) can dual licence with the InfoZIP licence and that is
what is really required before incorporating directly in the zip/unzip source base.

Al
Logged
Private Message Reply: 31 - 47
EG
July 15, 2010, 11:43pm Report to Moderator
Info-ZIP Team
Posts: 463
Quoted from Al Dunsmuir
The z/OS Language Environment (C runtime) already have extensive and fully functional iconv support.  It is the recommended method of doing character translation for many years.  Recent z/OS releases have added native support for Unicode, and iconv is part of that solution,

If we need a translation library, I prefer the iconv library as that is becoming mainstreamed as part of many OS and so is likely to be maintained long into the future.  But the autodetection capability of RusXMMS is tempting, as long as it's available on all the platforms it's needed on.

Quoted from Al Dunsmuir
Character translation is one of the core issues when adapting ASCII-centric tool such as zip and unzip to an EBCDIC-centric platform.  It is also important for translating between code pages within ASCII or EBCDIC where "extended" characters have different mappings.  For EBCDIC, these variant characters include such basic characters as '@', '$', [' and ']'.  This is why we need to be able to
control translation on each zip or unzip invocation on the mainframe.  This is for both file names, and file contents.

If used correctly, Unicode should take care of these needs.  Depending on wide character support on z/OS, you may need to use a library like iconv to translate to and from UTF-8.

Quoted from Al Dunsmuir
For text data, it makes sense to have an optional per-file attribute that identifies the character set of the file data. I suspect that a separate attribute identifying the line termination - LF, CR LF, NEL, or the previously discussed per-record length attribute is required.

With Unicode, a per entry code page attribute should not be needed, at least for migrating archives across platforms.  If z/OS needs the code page for other zipping and unzipping functions, then it could be stored in a new z/OS extra field.  But once the file is extracted from the archive and restored on the OS, the extra field information goes away unless it is used to restore something else on the OS.  You need to be more specific if code page information is needed.  To restore the file name, it seems the current platform code page (or a way to restore Unicode directly) and the UTF-8 file name are all that is needed.  Note that file names from other platforms might not follow any special conventions, so relying on a $ meaning something is dangerous.  You're better off putting such information into an extra field.

As for line termination, the assumption is each platform has a defined line termination.  Once that is known, there seems no need to pass it around.  Note that the platform the entry was created on is stored already.

The record types issue needs to be handled separately for this platform.  I made some suggestions in the other thread, but you guys need to decide how you want to proceed.

Quoted from Al Dunsmuir
Since the name for the file within the zip archive can be either plain ASCII or UTF-8, it appears that yet another per-file attribute is mandated.

There's already a flag for that.

Yeah, it does get complicated keeping track of all this stuff.
Logged
Private Message Reply: 32 - 47
EG
July 16, 2010, 12:04am Report to Moderator
Info-ZIP Team
Posts: 463
Quoted from csa
The autodetection will work by default for non-unicode names. As well I can add support for -O to override the autodetected with the specified encoding. It is possible to disable library through the RusXMMS configuration, but I can add as well another switch to disable it from unzip command line, just tell me the switch letter.

The command line code has been updated in UnZip 6.1a to handle long options, so what you got is somewhat obsolete.

Don't worry about that.  We can do the option stuff rather quickly.  Just focus on adding the capability.

Quoted from csa
Library is included in some of the major Linux distributions and there is builds avaialable for many others. I have tested build from sources on OSX, FreeBSD, and OpenSolaris. No problems there. At the moment there is no Windows builds, but I'll provide in 1-2 weeks if you will accept idea to try RusXMMS patch.

Where is the source code available?  I'm wondering how hard this would be for someone to port to a new platform like z/OS?

Quoted from csa
Unfortunatelly, I can't agree that it would not be needed. There is a lot of zip files with non-unicode names and they will circulate forever.

True, but eventually most of those can be converted over if the tools all support UTF-8.

Quoted from csa
I don't even see any way how it is possible to prevent Windows users from producing more non-Unicode zip files with all variety of tools they are using. Exactly, the same situation we have with MP3 files. The ID3 v.2 with Unicode support is already 10 years old, but there are still a lot of broken MP3's with non-unicode encodings around.

Yeah, like there's no telling when Windows Explorer would support it, or even when it will support the not-so-new large files standard.

Still, we need to assume this is only temporary and everyone will eventually move to UTF-8 aware tools as it's the best approach (at least I think so).
Logged
Private Message Reply: 33 - 47
csa
July 16, 2010, 12:43am Report to Moderator
Baby Member
Posts: 7
Quoted from EG
Don't worry about that.  We can do the option stuff rather quickly.  Just focus on adding the capability.

Ok. I'll update the patch to latest version over weekend.

Quoted from EG
Where is the source code available?  I'm wondering how hard this would be for someone to port to a new platform like z/OS?

Porting to the POSIX-complaint system should be no problem. The sources and binaries are available from http://RusXMMS.sf.net
The latest version is here: http://dside.dyndns.org/files/rusxmms/librcc-latest.tar.bz2
Quoted from EG
Still, we need to assume this is only temporary and everyone will eventually move to UTF-8 aware tools as it's the best approach (at least I think so).

I agree - Unicode is much better. Just the move will take quite a while
Logged
Private Message Reply: 34 - 47
EG
July 16, 2010, 4:16am Report to Moderator
Info-ZIP Team
Posts: 463
So what's the story on the libnatspec patch above?  Is this another to be considered?  Any other potential solutions out there?
Logged
Private Message Reply: 35 - 47
csa
July 16, 2010, 5:22am Report to Moderator
Baby Member
Posts: 7
Where I can get alpha of 6.1? I can find only 6.0 on sf.net and FTP.
Logged
Private Message Reply: 36 - 47
EG
July 16, 2010, 6:05am Report to Moderator
Info-ZIP Team
Posts: 463
That's the latest released code for UnZip.  When we post the UnZip 6.10a beta there should be an announcement in the announcements thread.  Still got a bunch of work to do before that goes out.  Probably best to wait until it all works and we post it officially.
Logged
Private Message Reply: 37 - 47
csa
July 16, 2010, 6:45am Report to Moderator
Baby Member
Posts: 7
So, then, will you prefer to have patch against 6.1b when it's out? Or I shall prepare it against latest available release as well?
Logged
Private Message Reply: 38 - 47
sms
July 16, 2010, 2:34pm Report to Moderator
Info-ZIP Team
Posts: 463
> If we need a translation library, I prefer the iconv library [...]

   Same here.

> Porting to the POSIX-complaint system should be no problem. [...]

   I took a (very) quick look at that source kit, and I would not bet
that building it on VMS (for example) would be so easy.  There does seem
to be some iconv stuff in VMS these days, however.
Logged Online
Private Message Reply: 39 - 47
csa
July 16, 2010, 2:48pm Report to Moderator
Baby Member
Posts: 7
Quoted from sms
I took a (very) quick look at that source kit, and I would not bet
that building it on VMS (for example) would be so easy.  There does seem
to be some iconv stuff in VMS these days, however.

It's pretty big, but most of the stuff is optional and can be execluded for some platforms. Basicaly, it needs LibXML and IConv (both libraries are existing for VMS) and includes some string manipulation code, everything else can be stripped out for certain builds.
Logged
Private Message Reply: 40 - 47
EG
July 16, 2010, 5:40pm Report to Moderator
Info-ZIP Team
Posts: 463
My thought is we should make sure a solution will work before committing to it.  Sounds like we need iconv for any of the solutions.  So the question then is can the RusXMMS solution work.  How available is LibXML?

I'd like to use the same library on all ports (that implement this).  So if we need a stripped-down library for some ports, we should create that and use it for all ports so the implementation more or less works the same on all ports.
Logged
Private Message Reply: 41 - 47
EG
July 16, 2010, 5:42pm Report to Moderator
Info-ZIP Team
Posts: 463
Quoted from csa
So, then, will you prefer to have patch against 6.1b when it's out? Or I shall prepare it against latest available release as well?

Either work against UnZip 6.00 or wait until UnZip 6.10a goes out.  I'm guessing the latter might happen in a couple weeks.
Logged
Private Message Reply: 42 - 47
Al Dunsmuir
July 18, 2010, 3:25pm Report to Moderator
Info-ZIP Team
Posts: 94
While the USS side of z/OS is POSIX complient, the MVS side is by definition not so.  Even in USS, it is by default an EBCDIC world and not ASCII, so UTF-8 is relevant for zip archive data and zip/inzip internals but not for a normal user.  Unlike an ASCII-based platforms, UTF-8 interfaces don't make a lot of sense when your terminal and files are EBCDIC-based.   Remember that even literal characters and strings generated by the z/OS compiler are EBCDIC-based by default.   

MVS dataset names are limited to 44-characters, with a very restricted syntax (segments of 1 to 8 characters separated by periods) and restricted character set (Upper case A-Z, $, #, @, and 0-9 (not in 1st char of a qualifier)).   And yes, those 3 extra characters are NLS-variant and have to be correctly mapped into the host codepage.  In partitioned datasets (PDS, or PDSE libraries) the member name is limited to 8 characters with the same character set rules.  This is why Josef and I have been talking about zip and unzip name mapping - while data translation may be relatively straight forward and portable, the transistion between the archive member and MVS dataset naming can be quite a jarring transition.  Right now, it only works well in limited cases.  The good news is that if we can come up with a good syntax, other folks may find that mapping useful too.

It is a really really bad idea to attempt to replace system functions in most cases.  The MVS zip and unzip functionality was broken because folks assumed that the underlying OS conformed to their experience with UNIX, Windows or DOS.  Even the folks familliar with z/OS USS made unwarrented assumptions.   Unfortunately, files and character handling are those areas that are most difficult to map properly to z/OS.  By default, the C runtime hides many warts but also limits performace and functionality.  To do a decent job in supporting the MVS platform in zip and unzip, we have to use the platform-specific runtime extensions (both OS and C).  C'est la vie.

In the case of iconv and Unicode, the z/OS runtime provides all the necessary tables and logic to handle the translation correctly and efficiently, and to support capturing those cases where input characters can not be successfully mapped to any output character.  The iconv functions themselves have a POSIX-compilant interface, but the internals call low-level z/OS functions as required.    zip and unzip trying to replicate that is a waste of limited resources for little practical gain.
PKZIP appears to have done a lot of work in supporting z/OS MVS files (as part of justifying their large licence fees).   I'll check out their public docs and see what I can gleen.
Logged
Private Message Reply: 43 - 47
Al Dunsmuir
July 18, 2010, 4:21pm Report to Moderator
Info-ZIP Team
Posts: 94
The assumption that each platform has a defined line termination may be justified in some cases, but not all.
Any assumption breaks down when the ZIP archive is moved across platforms.  Think about the case where
an archive is created on one platform, grown on one or more, and delivered to a final platform. 

In our case, I could see:
       z/OS MVS ->  z/OS USS -> AIX -> Linux -> Windows

Each of these 5 platforms has a different default line termination:
       record length prefix,  EBCDIC NL, ASCII LF, UTF-8 NEL, CRLF


Both of the z/OS platforms have a native mode that preserves their EBCDIC data.  Both z/OS
ports have to understand the other's format.    Currently one must decide to encode the text files
in an archive destined for a native-ASCII platform with the '-a' or '--ascii' option.  This translates
the file data to ASCII (currenly using a simple translate table) and uses the ASCII LF line term.
 

Guessing at the current line terminator for a given file seems error prone, where a per-file flag
would remove all doubt. 

Part of this is to ensure that unzip generates a reasonable message if a properly encoded archive
is encountered that can not be decoded (since there is no --ebcdic flag for ASCII-based unzips). 

I'll have to check next week to see what happens when a current native-MVS ZIP file is sent to
Linux or Windows. 

Al
Logged
Private Message Reply: 44 - 47
4 Pages « 1 2 3 4 » All Recommend Thread
Print

Info-ZIP Discussion Forum    Info-ZIP Bugs    UnZip Bugs  ›  Unzip 6.0 is missing option -O