For us it would be also extremly useful to have such an option or to get unzip to extract CP437-encoded filepaths correctly (as CP437 is the default encoding for DOS/Windows-zipped archives).
info-zip support many rare and old architectures which very few people uses but it not support non-ascii encoding which widespread. I think what it not right. And we have so situation in which even distribution must include patch which solve problem with encoding.
Things might have changed here since the original posting and it may now be possible to get this patch implemented. No guarantees, but it looks possible.
If there is still interest in this patch, it might get added to UnZip 6.1a, the next public beta in the works. It needs to be specific to UnZip 6.0 and have enough context that the changes can be made to internal beta UnZip 6.1a03, which already has significant changes. Note that we generally make patch changes by hand, doing sanity checks as we go.
Looks like libiconv is under the LGPL, so it might be workable as long as the user is required to get libiconv themselves or somehow the library is already available.
The proposed patch has support for only a few hardcoded encodings. I have a better solution for unzip. It is based on libraries from RusXMMS project and is multilanguage by design. New languages/encodings can be added using configuration file without rebuild. More you don't need to use '-O' option, but for most languages the correct encoding is autodected. If there is interest I can port patch to the latest alpha. http://dside.dyndns.org/darklin/portage/app-arch/unzip/files/unzip-ds-lazyrcc.patch
Any patches to the actual code must be distributable under the Info-ZIP license, which is similar to other open licenses but allows commercial use. Generally any patch that adds restrictions on distribution or use will likely be rejected.
We also generally reject any patches that require configuration tables or other similar things. Once compiled, the code needs to run independent of other files on the system, with the exception being the environment variables.
The thought was to use a library like libiconv that someone else maintains. Internally we discussed creating our own tables, but that is way too much work for us to maintain. It should be as simple as installing a library on your system and linking to it.
I seem to remember some patch out there somewhere for autodetecting character set encodings. That was years ago, though. I know it can be done with some level of success.
I haven't looked yet at the RussXMMS patch, so don't know if that meets the need.
If you all can weigh the various patches out there against the above requirements and make some suggestions, would appreciate that. We got too much going on right now. This needs to be an easy patch to get done.
RusXMMS libraries are under LGPL, so there should be no problem with licensing. The configuration files are optional: there is predefined configuration (autodection for some languages and static rules for others) which can be overriden if the config files are provided. Since the job is done by RusXMMS libraries, the patch is pretty functional and small. The 'iconv' patch provides more changes to the code and brings much less functionality. Btw. The libraries are included at least in Debian/Ubuntu and OpenSuSe. And OpenSuSe adopts RusXMMS patch for unzip package.
So this does not use the -O option, but should automatically detect the current codepage and display the characters appropriately. Hmm. Sounds good. It would have to bow out and let Unicode do its thing when that is enabled. There should also be an option to disable it. It also should be listed in the unzip -v list when present.
By the way, last night I think I got the ? issue with filenames fixed when Unicode is enabled. The old check was not using wide characters for the checks.
There are quite a few changes in the UnZip 6.10a beta. You probably should work with that. Probably getting about that time to post it as a public beta, though we need to prepare it for that (like updating the documentation) and we got some things in the works we should finish before it goes out.
How does one get and install this library on Windows as well as Linux? Is it available for Mac OS X (which it should as that might just use a Unix version). (Given all the other things going on, figure I'd let you do the looking.)
Before we get too carried away, if anyone has any other suggestions or gripes, please post them. Our time is limited, so if a chosen solution doesn't work out, our first thought may be to just pull it and move on to other things, and there's quite a few of them pending.
Ed, I'd like to see the latest unzip changes all rolled up, so I have a better basis to know how to comment for z/OS USS and MVS. It will also help to help me understand the latest zip codebase. The code for this area of support will need to be available in both zip and unzip on z/OS for both USS and MVS.
I'm going to be getting further up to speed on the various Unicode issues and runtime support capabilities on z/OS for my main work project. The intersection between various ASCII and EBCDIC code pages, plus the historic zip/upzip translation tables is messy... adding Unicode is not going to make it any more pleasant. Tactical solutions applied over the years will require additional care to support transparently.
There should be attributes per file that describe the character set(s) used to encode each file, since archives can contain files processed on multiple platforms in separate zip invocations. I'm not sure if that is currently the case.
zip and unzip should always attempt to do the right thing, but user options should support overrides to both the source and target character sets during character translation processing. Al
I'd like to see the latest unzip changes all rolled up, so I have a better basis to know how to comment for z/OS USS and MVS. It will also help to help me understand the latest zip codebase. The code for this area of support will need to be available in both zip and unzip on z/OS for both USS and MVS.
Not quite sure what changes you're referring to. Development always continues (at least for now), so there never really is a fully rolled up version of unzip or zip. Just snapshots that are the betas. (Unless we get distracted by other things and nothing happens for awhile.) Part of what we do is integrate changes into the moving targets. That said, it looks like UnZip 6.10a is getting closer to the door. However, Zip 3.1d has some new stuff.
I'm going to be getting further up to speed on the various Unicode issues and runtime support capabilities on z/OS for my main work project. The intersection between various ASCII and EBCDIC code pages, plus the historic zip/upzip translation tables is messy... adding Unicode is not going to make it any more pleasant. Tactical solutions applied over the years will require additional care to support transparently.
There should be attributes per file that describe the character set(s) used to encode each file, since archives can contain files processed on multiple platforms in separate zip invocations. I'm not sure if that is currently the case.
zip and unzip should always attempt to do the right thing, but user options should support overrides to both the source and target character sets during character translation processing.
The current standard, as negotiated about two years ago, is to use the UTF-8 file name as the file name if the archive entry has one. This overrides the standard path field. There are two ways to specify a UTF-8 path, either using an extra field or the standard path field. There is a new flag bit that tells an unzip that the standard path is in UTF-8. If the extra field is used, the standard path usually contains a local code page version of the file name.
If there is no UTF-8 for that entry, then the standard path is supposed to use a standard DOS code page according to the standard. However, in practice the standard path uses the local code page so that zipping and unzipping on the same platform works as expected. The problem with this is moving archives to other platforms messes up the file names. That's why we added the UTF-8 encoding that supports all encodings in one character set.
Older archives still have the issue of which code page was used to encode the file name. That's in fact the problem trying to be solved in this thread. There have been discussions regarding including the code page encoding, but the bottom line is the UTF-8 field captures the file name without knowing the encoding. Now that libraries like libiconv are available, conversion between code pages is not so bad, but it still seems unnecessary except for supporting older archives. Zip in particular will automatically include a UTF-8 path with the entry if the file name is not plain ASCII.
Extra fields are used to capture file attributes specific to a platform or port.
Most users probably don't know of or prefer not to care about code pages. Also, supporting all possible code pages probably means integrating a library like libiconv into the code, which was an issue until libraries with LGPL licenses became available for many platforms.
PKZip apparently has been thinking of adding a language encoding extra field that might include the code page, but they haven't had a need so it hasn't been done. The AppNote has had a spot reserved for it for awhile though.
Anyway, this stuff can get complicated. Sounds like you'll have some fun.
There seems a couple possibilities on the table, including the iconv patch and the RusXMMS patch. If you all can do some Google searches and post links to appropriate documentation, it might help other readers of this thread.
In the long run, it seems everyone should be migrating to zipping tools that include UTF-8 and unzipping tools that can read it. Then all this is not needed. Currently unzip on Windows has problems restoring file names in other character sets, but this is being worked and could be fixed in the next UnZip 6.1 beta. So anything done here should be seen as "temporary" and could go away later. Ironically, the full support of UTF-8 by UnZip may happen in the same beta that this translation feature gets added to, making the translation feature almost obsolete out of the box.
I guess what we can do is add support for the approach you all select as an unsupported feature. (So you need to agree on something. Go look at the different patches and go to the web sites of the libraries.) This means we add support for the code changes needed to call the library, putting those changes into an #ifdef block, so they won't be included by default. We would not be debugging issues with using the library other than being able to call it and we would not be distributing any library. We would not distribute executables with this code either.
If that sounds acceptable, we might be able to move forward on this, assuming no one else in the Info-ZIP development group has issues with this.
The z/OS Language Environment (C runtime) already have extensive and fully functional iconv support. It is the recommended method of doing character translation for many years. Recent z/OS releases have added native support for Unicode, and iconv is part of that solution,
Character translation is one of the core issues when adapting ASCII-centric tool such as zip and unzip to an EBCDIC-centric platform. It is also important for translating between code pages within ASCII or EBCDIC where "extended" characters have different mappings. For EBCDIC, these variant characters include such basic characters as '@', '$', [' and ']'. This is why we need to be able to control translation on each zip or unzip invocation on the mainframe. This is for both file names, and file contents.
For text data, it makes sense to have an optional per-file attribute that identifies the character set of the file data. I suspect that a separate attribute identifying the line termination - LF, CR LF, NEL, or the previously discussed per-record length attribute is required.
Since the name for the file within the zip archive can be either plain ASCII or UTF-8, it appears that yet another per-file attribute is mandated.