So this does not use the -O option, but should automatically detect the current codepage and display the characters appropriately. Hmm. Sounds good. It would have to bow out and let Unicode do its thing when that is enabled. There should also be an option to disable it. It also should be listed in the unzip -v list when present.
The autodetection will work by default for non-unicode names. As well I can add support for -O to override the autodetected with the specified encoding. It is possible to disable library through the RusXMMS configuration, but I can add as well another switch to disable it from unzip command line, just tell me the switch letter.
How does one get and install this library on Windows as well as Linux? Is it available for Mac OS X (which it should as that might just use a Unix version). (Given all the other things going on, figure I'd let you do the looking.)
Library is included in some of the major Linux distributions and there is builds avaialable for many others. I have tested build from sources on OSX, FreeBSD, and OpenSolaris. No problems there. At the moment there is no Windows builds, but I'll provide in 1-2 weeks if you will accept idea to try RusXMMS patch.
In the long run, it seems everyone should be migrating to zipping tools that include UTF-8 and unzipping tools that can read it. Then all this is not needed.
Unfortunatelly, I can't agree that it would not be needed. There is a lot of zip files with non-unicode names and they will circulate forever. I don't even see any way how it is possible to prevent Windows users from producing more non-Unicode zip files with all variety of tools they are using. Exactly, the same situation we have with MP3 files. The ID3 v.2 with Unicode support is already 10 years old, but there are still a lot of broken MP3's with non-unicode encodings around.
Your solution for handling the file names sounds capable and workable.
Platforms such as Linux have dynamicly linked system runtime libraries which are LGPL with explicit authorization for any program to use. Statically binding LGPL code may introduce problems with making precompiled zip/unzip available. The original author(s) can dual licence with the InfoZIP licence and that is what is really required before incorporating directly in the zip/unzip source base.
The z/OS Language Environment (C runtime) already have extensive and fully functional iconv support. It is the recommended method of doing character translation for many years. Recent z/OS releases have added native support for Unicode, and iconv is part of that solution,
If we need a translation library, I prefer the iconv library as that is becoming mainstreamed as part of many OS and so is likely to be maintained long into the future. But the autodetection capability of RusXMMS is tempting, as long as it's available on all the platforms it's needed on.
Character translation is one of the core issues when adapting ASCII-centric tool such as zip and unzip to an EBCDIC-centric platform. It is also important for translating between code pages within ASCII or EBCDIC where "extended" characters have different mappings. For EBCDIC, these variant characters include such basic characters as '@', '$', [' and ']'. This is why we need to be able to control translation on each zip or unzip invocation on the mainframe. This is for both file names, and file contents.
If used correctly, Unicode should take care of these needs. Depending on wide character support on z/OS, you may need to use a library like iconv to translate to and from UTF-8.
For text data, it makes sense to have an optional per-file attribute that identifies the character set of the file data. I suspect that a separate attribute identifying the line termination - LF, CR LF, NEL, or the previously discussed per-record length attribute is required.
With Unicode, a per entry code page attribute should not be needed, at least for migrating archives across platforms. If z/OS needs the code page for other zipping and unzipping functions, then it could be stored in a new z/OS extra field. But once the file is extracted from the archive and restored on the OS, the extra field information goes away unless it is used to restore something else on the OS. You need to be more specific if code page information is needed. To restore the file name, it seems the current platform code page (or a way to restore Unicode directly) and the UTF-8 file name are all that is needed. Note that file names from other platforms might not follow any special conventions, so relying on a $ meaning something is dangerous. You're better off putting such information into an extra field.
As for line termination, the assumption is each platform has a defined line termination. Once that is known, there seems no need to pass it around. Note that the platform the entry was created on is stored already.
The record types issue needs to be handled separately for this platform. I made some suggestions in the other thread, but you guys need to decide how you want to proceed.
The autodetection will work by default for non-unicode names. As well I can add support for -O to override the autodetected with the specified encoding. It is possible to disable library through the RusXMMS configuration, but I can add as well another switch to disable it from unzip command line, just tell me the switch letter.
The command line code has been updated in UnZip 6.1a to handle long options, so what you got is somewhat obsolete.
Don't worry about that. We can do the option stuff rather quickly. Just focus on adding the capability.
Library is included in some of the major Linux distributions and there is builds avaialable for many others. I have tested build from sources on OSX, FreeBSD, and OpenSolaris. No problems there. At the moment there is no Windows builds, but I'll provide in 1-2 weeks if you will accept idea to try RusXMMS patch.
Where is the source code available? I'm wondering how hard this would be for someone to port to a new platform like z/OS?
I don't even see any way how it is possible to prevent Windows users from producing more non-Unicode zip files with all variety of tools they are using. Exactly, the same situation we have with MP3 files. The ID3 v.2 with Unicode support is already 10 years old, but there are still a lot of broken MP3's with non-unicode encodings around.
Yeah, like there's no telling when Windows Explorer would support it, or even when it will support the not-so-new large files standard.
Still, we need to assume this is only temporary and everyone will eventually move to UTF-8 aware tools as it's the best approach (at least I think so).
Still, we need to assume this is only temporary and everyone will eventually move to UTF-8 aware tools as it's the best approach (at least I think so).
I agree - Unicode is much better. Just the move will take quite a while
That's the latest released code for UnZip. When we post the UnZip 6.10a beta there should be an announcement in the announcements thread. Still got a bunch of work to do before that goes out. Probably best to wait until it all works and we post it officially.
> If we need a translation library, I prefer the iconv library [...]
Same here.
> Porting to the POSIX-complaint system should be no problem. [...]
I took a (very) quick look at that source kit, and I would not bet that building it on VMS (for example) would be so easy. There does seem to be some iconv stuff in VMS these days, however.
I took a (very) quick look at that source kit, and I would not bet that building it on VMS (for example) would be so easy. There does seem to be some iconv stuff in VMS these days, however.
It's pretty big, but most of the stuff is optional and can be execluded for some platforms. Basicaly, it needs LibXML and IConv (both libraries are existing for VMS) and includes some string manipulation code, everything else can be stripped out for certain builds.
My thought is we should make sure a solution will work before committing to it. Sounds like we need iconv for any of the solutions. So the question then is can the RusXMMS solution work. How available is LibXML?
I'd like to use the same library on all ports (that implement this). So if we need a stripped-down library for some ports, we should create that and use it for all ports so the implementation more or less works the same on all ports.
While the USS side of z/OS is POSIX complient, the MVS side is by definition not so. Even in USS, it is by default an EBCDIC world and not ASCII, so UTF-8 is relevant for zip archive data and zip/inzip internals but not for a normal user. Unlike an ASCII-based platforms, UTF-8 interfaces don't make a lot of sense when your terminal and files are EBCDIC-based. Remember that even literal characters and strings generated by the z/OS compiler are EBCDIC-based by default.
MVS dataset names are limited to 44-characters, with a very restricted syntax (segments of 1 to 8 characters separated by periods) and restricted character set (Upper case A-Z, $, #, @, and 0-9 (not in 1st char of a qualifier)). And yes, those 3 extra characters are NLS-variant and have to be correctly mapped into the host codepage. In partitioned datasets (PDS, or PDSE libraries) the member name is limited to 8 characters with the same character set rules. This is why Josef and I have been talking about zip and unzip name mapping - while data translation may be relatively straight forward and portable, the transistion between the archive member and MVS dataset naming can be quite a jarring transition. Right now, it only works well in limited cases. The good news is that if we can come up with a good syntax, other folks may find that mapping useful too.
It is a really really bad idea to attempt to replace system functions in most cases. The MVS zip and unzip functionality was broken because folks assumed that the underlying OS conformed to their experience with UNIX, Windows or DOS. Even the folks familliar with z/OS USS made unwarrented assumptions. Unfortunately, files and character handling are those areas that are most difficult to map properly to z/OS. By default, the C runtime hides many warts but also limits performace and functionality. To do a decent job in supporting the MVS platform in zip and unzip, we have to use the platform-specific runtime extensions (both OS and C). C'est la vie.
In the case of iconv and Unicode, the z/OS runtime provides all the necessary tables and logic to handle the translation correctly and efficiently, and to support capturing those cases where input characters can not be successfully mapped to any output character. The iconv functions themselves have a POSIX-compilant interface, but the internals call low-level z/OS functions as required. zip and unzip trying to replicate that is a waste of limited resources for little practical gain. PKZIP appears to have done a lot of work in supporting z/OS MVS files (as part of justifying their large licence fees). I'll check out their public docs and see what I can gleen.
The assumption that each platform has a defined line termination may be justified in some cases, but not all. Any assumption breaks down when the ZIP archive is moved across platforms. Think about the case where an archive is created on one platform, grown on one or more, and delivered to a final platform.
In our case, I could see: z/OS MVS -> z/OS USS -> AIX -> Linux -> Windows
Each of these 5 platforms has a different default line termination: record length prefix, EBCDIC NL, ASCII LF, UTF-8 NEL, CRLF
Both of the z/OS platforms have a native mode that preserves their EBCDIC data. Both z/OS ports have to understand the other's format. Currently one must decide to encode the text files in an archive destined for a native-ASCII platform with the '-a' or '--ascii' option. This translates the file data to ASCII (currenly using a simple translate table) and uses the ASCII LF line term.
Guessing at the current line terminator for a given file seems error prone, where a per-file flag would remove all doubt.
Part of this is to ensure that unzip generates a reasonable message if a properly encoded archive is encountered that can not be decoded (since there is no --ebcdic flag for ASCII-based unzips).
I'll have to check next week to see what happens when a current native-MVS ZIP file is sent to Linux or Windows.