Option -O that allows you to set an encoding for filenames is missing in the latest release.
To test I made a small zip file in Windows XP that has filenames encoded in shift-jis and tried to open it in Linux in UTF8 environment. I have attached the .zip to this post.
The following example will show what happens with and without -O.
Code
$ unzip -l Zip_Test.zip
Archive: Zip_Test.zip
Length Date Time Name
-------- ---- ---- ----
0 07-20-09 11:21 Zip_Test/
37 07-20-09 11:21 Zip_Test/РVЛKГeГLГXГg ГhГLГЕГБГУГg.txt
-------- -------
37 2 files
$ unzip -O shift-jis -l Zip_Test.zip
Archive: Zip_Test.zip
Length Date Time Name
-------- ---- ---- ----
0 07-20-09 11:21 Zip_Test/
37 07-20-09 11:21 Zip_Test/新規テキスト ドキュメント.txt
-------- -------
37 2 files
What version of unzip are you using (unzip -v should list it) and what are you using to create the archive?
The tendency in the zip community is to store UTF-8 in the archive now so that issues of character set conversion (like knowing the names of the from and to character sets) are mostly gone. Currently Zip 3.0 and later support Unicode encoding of paths and UnZip 6.0 and later mostly can handle recreating Unicode paths.
Ah now I get it, I was using the 5.52 Archlinux package and was about to post the package build, but now I see that the -O is from a patch. In the first release of 6.0 the maintainer forgot to include the patch. Now I see the latest version has the patch again. I had not bothered with unzip updates sticking with 5.52, I just updated to try it out and -O works. So thanks, I can enjoy your 6.0 release now. And thanks patch writers.
I used only 5.52 and its patch unzip-5.52-alt-natspec.patch has the effect. My archive with cyrillic file names is extracted and viewed correctly without some special "-O" option. In this patch - the encoding is chosen automatically.
Briefly going through the (sisyphus) patch it appears to be a version of a similar iconv patch for UnZip that was proposed a while back but the UnZip maintainer rejected. Myself, I don't have any problem with including this feature, but the current industry trend is to move to total UTF-8 paths. We've been a bit more sluggish, trying to maintain backward compatibility with existing archives as we move to that.
That said, it's hard to say if we should include this patch in the main release. Given the UnZip maintainer probably would not accept the patch anyway (since he rejected it before), it might be an uphill battle.
It might be worth putting a pointer on the web site to the patch though. If you all had to pick the primary place to download the patch, which would it be? Also, are there instructions anywhere?
Any updates on this? The lack of being able to unzip Windows archives is really critical for my daily use; I went as far was port (poorly) the old altlinux patch to 6.0 here: http://bugs.archlinux.org/task/15256
but I'd much rather have someone who actually knows the code implement this functionality. If the "-O charset" patch isn't accepted, does InfoZip have a "recommended" method of working around these non-unicode zip files? Is there something like a converter from legacy zip files to the new UTF-8 zip files? An endless sequence of: inflating: ?-??+- ??+??? -?10+?s_.txt is extremely annoying.
EG// You can apply the patch like this (The altlinux sisphus patch requires libnatspec): $ cd unzip60 $ patch -Np1 -i unzip60-alt-iconv-utf8.patch
Any updates on this? The lack of being able to unzip Windows archives is really critical for my daily use; I went as far was port (poorly) the old altlinux patch to 6.0 here: http://bugs.archlinux.org/task/15256
but I'd much rather have someone who actually knows the code implement this functionality. If the "-O charset" patch isn't accepted, does InfoZip have a "recommended" method of working around these non-unicode zip files? Is there something like a converter from legacy zip files to the new UTF-8 zip files? An endless sequence of: inflating: ?-??+- ??+??? -?10+?s_.txt is extremely annoying.
EG// You can apply the patch like this (The altlinux sisphus patch requires libnatspec): $ cd unzip60 $ patch -Np1 -i unzip60-alt-iconv-utf8.patch
Thanks. However, it's the decision of the UnZip maintainer and so far he hasn't accepted adding this capability. Could try again, though.
Another possibility is to add the patch to our site, after looking it over and doing some testing. That assumes there are no issues with us distributing the patch and any required files.
I haven't looked at the license issues on either patch. To use the code it would have to be distributable under the Info-ZIP license. What are the license restrictions on your patch (which I assume inherits the restrictions of the patch you modified)?
Sorry, EG it took a while. I've received a response from the AltLinux maintainer and he says that the license of the patch is identical to the original unzip license. Have you checked with the UnZip maintainer? Thanks.
I would like to ask you about adding a patch to support national character sets of filenames. Since it is impossible to read or restore the file names with the unpacked archive.
When I asked one of the maintainers of package unzip in archlinux bugtrack, he explained to me that you are not interested in adding this patch and without your support it would lead to conflicts with other programs.
Briefly going through the (sisyphus) patch it appears to be a version of a similar iconv patch for UnZip that was proposed a while back but the UnZip maintainer rejected. Myself, I don't have any problem with including this feature, but the current industry trend is to move to total UTF-8 paths. We've been a bit more sluggish, trying to maintain backward compatibility with existing archives as we move to that.
That said, it's hard to say if we should include this patch in the main release. Given the UnZip maintainer probably would not accept the patch anyway (since he rejected it before), it might be an uphill battle.
Why do you think that the libnatspec based patch will be rejected too? Can you tell us about the original reason the previous patch was rejected? I tried searching for more information, but I have only found comments made by people other than the maintainer. Currently I all I know is that a previous patch existed, and it was rejected. I could not find any specific rationale for not accepting the patch.
It would be very helpful to know what the unzip maintainer thinks about this. (Or at least what he originally stated as the reason)
For us it would be also extremly useful to have such an option or to get unzip to extract CP437-encoded filepaths correctly (as CP437 is the default encoding for DOS/Windows-zipped archives).
info-zip support many rare and old architectures which very few people uses but it not support non-ascii encoding which widespread. I think what it not right. And we have so situation in which even distribution must include patch which solve problem with encoding.
Things might have changed here since the original posting and it may now be possible to get this patch implemented. No guarantees, but it looks possible.
If there is still interest in this patch, it might get added to UnZip 6.1a, the next public beta in the works. It needs to be specific to UnZip 6.0 and have enough context that the changes can be made to internal beta UnZip 6.1a03, which already has significant changes. Note that we generally make patch changes by hand, doing sanity checks as we go.
Looks like libiconv is under the LGPL, so it might be workable as long as the user is required to get libiconv themselves or somehow the library is already available.
The proposed patch has support for only a few hardcoded encodings. I have a better solution for unzip. It is based on libraries from RusXMMS project and is multilanguage by design. New languages/encodings can be added using configuration file without rebuild. More you don't need to use '-O' option, but for most languages the correct encoding is autodected. If there is interest I can port patch to the latest alpha. http://dside.dyndns.org/darklin/portage/app-arch/unzip/files/unzip-ds-lazyrcc.patch
Any patches to the actual code must be distributable under the Info-ZIP license, which is similar to other open licenses but allows commercial use. Generally any patch that adds restrictions on distribution or use will likely be rejected.
We also generally reject any patches that require configuration tables or other similar things. Once compiled, the code needs to run independent of other files on the system, with the exception being the environment variables.
The thought was to use a library like libiconv that someone else maintains. Internally we discussed creating our own tables, but that is way too much work for us to maintain. It should be as simple as installing a library on your system and linking to it.
I seem to remember some patch out there somewhere for autodetecting character set encodings. That was years ago, though. I know it can be done with some level of success.
I haven't looked yet at the RussXMMS patch, so don't know if that meets the need.
If you all can weigh the various patches out there against the above requirements and make some suggestions, would appreciate that. We got too much going on right now. This needs to be an easy patch to get done.
RusXMMS libraries are under LGPL, so there should be no problem with licensing. The configuration files are optional: there is predefined configuration (autodection for some languages and static rules for others) which can be overriden if the config files are provided. Since the job is done by RusXMMS libraries, the patch is pretty functional and small. The 'iconv' patch provides more changes to the code and brings much less functionality. Btw. The libraries are included at least in Debian/Ubuntu and OpenSuSe. And OpenSuSe adopts RusXMMS patch for unzip package.
So this does not use the -O option, but should automatically detect the current codepage and display the characters appropriately. Hmm. Sounds good. It would have to bow out and let Unicode do its thing when that is enabled. There should also be an option to disable it. It also should be listed in the unzip -v list when present.
By the way, last night I think I got the ? issue with filenames fixed when Unicode is enabled. The old check was not using wide characters for the checks.
There are quite a few changes in the UnZip 6.10a beta. You probably should work with that. Probably getting about that time to post it as a public beta, though we need to prepare it for that (like updating the documentation) and we got some things in the works we should finish before it goes out.
How does one get and install this library on Windows as well as Linux? Is it available for Mac OS X (which it should as that might just use a Unix version). (Given all the other things going on, figure I'd let you do the looking.)
Before we get too carried away, if anyone has any other suggestions or gripes, please post them. Our time is limited, so if a chosen solution doesn't work out, our first thought may be to just pull it and move on to other things, and there's quite a few of them pending.
Ed, I'd like to see the latest unzip changes all rolled up, so I have a better basis to know how to comment for z/OS USS and MVS. It will also help to help me understand the latest zip codebase. The code for this area of support will need to be available in both zip and unzip on z/OS for both USS and MVS.
I'm going to be getting further up to speed on the various Unicode issues and runtime support capabilities on z/OS for my main work project. The intersection between various ASCII and EBCDIC code pages, plus the historic zip/upzip translation tables is messy... adding Unicode is not going to make it any more pleasant. Tactical solutions applied over the years will require additional care to support transparently.
There should be attributes per file that describe the character set(s) used to encode each file, since archives can contain files processed on multiple platforms in separate zip invocations. I'm not sure if that is currently the case.
zip and unzip should always attempt to do the right thing, but user options should support overrides to both the source and target character sets during character translation processing. Al
I'd like to see the latest unzip changes all rolled up, so I have a better basis to know how to comment for z/OS USS and MVS. It will also help to help me understand the latest zip codebase. The code for this area of support will need to be available in both zip and unzip on z/OS for both USS and MVS.
Not quite sure what changes you're referring to. Development always continues (at least for now), so there never really is a fully rolled up version of unzip or zip. Just snapshots that are the betas. (Unless we get distracted by other things and nothing happens for awhile.) Part of what we do is integrate changes into the moving targets. That said, it looks like UnZip 6.10a is getting closer to the door. However, Zip 3.1d has some new stuff.
I'm going to be getting further up to speed on the various Unicode issues and runtime support capabilities on z/OS for my main work project. The intersection between various ASCII and EBCDIC code pages, plus the historic zip/upzip translation tables is messy... adding Unicode is not going to make it any more pleasant. Tactical solutions applied over the years will require additional care to support transparently.
There should be attributes per file that describe the character set(s) used to encode each file, since archives can contain files processed on multiple platforms in separate zip invocations. I'm not sure if that is currently the case.
zip and unzip should always attempt to do the right thing, but user options should support overrides to both the source and target character sets during character translation processing.
The current standard, as negotiated about two years ago, is to use the UTF-8 file name as the file name if the archive entry has one. This overrides the standard path field. There are two ways to specify a UTF-8 path, either using an extra field or the standard path field. There is a new flag bit that tells an unzip that the standard path is in UTF-8. If the extra field is used, the standard path usually contains a local code page version of the file name.
If there is no UTF-8 for that entry, then the standard path is supposed to use a standard DOS code page according to the standard. However, in practice the standard path uses the local code page so that zipping and unzipping on the same platform works as expected. The problem with this is moving archives to other platforms messes up the file names. That's why we added the UTF-8 encoding that supports all encodings in one character set.
Older archives still have the issue of which code page was used to encode the file name. That's in fact the problem trying to be solved in this thread. There have been discussions regarding including the code page encoding, but the bottom line is the UTF-8 field captures the file name without knowing the encoding. Now that libraries like libiconv are available, conversion between code pages is not so bad, but it still seems unnecessary except for supporting older archives. Zip in particular will automatically include a UTF-8 path with the entry if the file name is not plain ASCII.
Extra fields are used to capture file attributes specific to a platform or port.
Most users probably don't know of or prefer not to care about code pages. Also, supporting all possible code pages probably means integrating a library like libiconv into the code, which was an issue until libraries with LGPL licenses became available for many platforms.
PKZip apparently has been thinking of adding a language encoding extra field that might include the code page, but they haven't had a need so it hasn't been done. The AppNote has had a spot reserved for it for awhile though.
Anyway, this stuff can get complicated. Sounds like you'll have some fun.
There seems a couple possibilities on the table, including the iconv patch and the RusXMMS patch. If you all can do some Google searches and post links to appropriate documentation, it might help other readers of this thread.
In the long run, it seems everyone should be migrating to zipping tools that include UTF-8 and unzipping tools that can read it. Then all this is not needed. Currently unzip on Windows has problems restoring file names in other character sets, but this is being worked and could be fixed in the next UnZip 6.1 beta. So anything done here should be seen as "temporary" and could go away later. Ironically, the full support of UTF-8 by UnZip may happen in the same beta that this translation feature gets added to, making the translation feature almost obsolete out of the box.
I guess what we can do is add support for the approach you all select as an unsupported feature. (So you need to agree on something. Go look at the different patches and go to the web sites of the libraries.) This means we add support for the code changes needed to call the library, putting those changes into an #ifdef block, so they won't be included by default. We would not be debugging issues with using the library other than being able to call it and we would not be distributing any library. We would not distribute executables with this code either.
If that sounds acceptable, we might be able to move forward on this, assuming no one else in the Info-ZIP development group has issues with this.
The z/OS Language Environment (C runtime) already have extensive and fully functional iconv support. It is the recommended method of doing character translation for many years. Recent z/OS releases have added native support for Unicode, and iconv is part of that solution,
Character translation is one of the core issues when adapting ASCII-centric tool such as zip and unzip to an EBCDIC-centric platform. It is also important for translating between code pages within ASCII or EBCDIC where "extended" characters have different mappings. For EBCDIC, these variant characters include such basic characters as '@', '$', [' and ']'. This is why we need to be able to control translation on each zip or unzip invocation on the mainframe. This is for both file names, and file contents.
For text data, it makes sense to have an optional per-file attribute that identifies the character set of the file data. I suspect that a separate attribute identifying the line termination - LF, CR LF, NEL, or the previously discussed per-record length attribute is required.
Since the name for the file within the zip archive can be either plain ASCII or UTF-8, it appears that yet another per-file attribute is mandated.
So this does not use the -O option, but should automatically detect the current codepage and display the characters appropriately. Hmm. Sounds good. It would have to bow out and let Unicode do its thing when that is enabled. There should also be an option to disable it. It also should be listed in the unzip -v list when present.
The autodetection will work by default for non-unicode names. As well I can add support for -O to override the autodetected with the specified encoding. It is possible to disable library through the RusXMMS configuration, but I can add as well another switch to disable it from unzip command line, just tell me the switch letter.
How does one get and install this library on Windows as well as Linux? Is it available for Mac OS X (which it should as that might just use a Unix version). (Given all the other things going on, figure I'd let you do the looking.)
Library is included in some of the major Linux distributions and there is builds avaialable for many others. I have tested build from sources on OSX, FreeBSD, and OpenSolaris. No problems there. At the moment there is no Windows builds, but I'll provide in 1-2 weeks if you will accept idea to try RusXMMS patch.
In the long run, it seems everyone should be migrating to zipping tools that include UTF-8 and unzipping tools that can read it. Then all this is not needed.
Unfortunatelly, I can't agree that it would not be needed. There is a lot of zip files with non-unicode names and they will circulate forever. I don't even see any way how it is possible to prevent Windows users from producing more non-Unicode zip files with all variety of tools they are using. Exactly, the same situation we have with MP3 files. The ID3 v.2 with Unicode support is already 10 years old, but there are still a lot of broken MP3's with non-unicode encodings around.
Your solution for handling the file names sounds capable and workable.
Platforms such as Linux have dynamicly linked system runtime libraries which are LGPL with explicit authorization for any program to use. Statically binding LGPL code may introduce problems with making precompiled zip/unzip available. The original author(s) can dual licence with the InfoZIP licence and that is what is really required before incorporating directly in the zip/unzip source base.
The z/OS Language Environment (C runtime) already have extensive and fully functional iconv support. It is the recommended method of doing character translation for many years. Recent z/OS releases have added native support for Unicode, and iconv is part of that solution,
If we need a translation library, I prefer the iconv library as that is becoming mainstreamed as part of many OS and so is likely to be maintained long into the future. But the autodetection capability of RusXMMS is tempting, as long as it's available on all the platforms it's needed on.
Character translation is one of the core issues when adapting ASCII-centric tool such as zip and unzip to an EBCDIC-centric platform. It is also important for translating between code pages within ASCII or EBCDIC where "extended" characters have different mappings. For EBCDIC, these variant characters include such basic characters as '@', '$', [' and ']'. This is why we need to be able to control translation on each zip or unzip invocation on the mainframe. This is for both file names, and file contents.
If used correctly, Unicode should take care of these needs. Depending on wide character support on z/OS, you may need to use a library like iconv to translate to and from UTF-8.
For text data, it makes sense to have an optional per-file attribute that identifies the character set of the file data. I suspect that a separate attribute identifying the line termination - LF, CR LF, NEL, or the previously discussed per-record length attribute is required.
With Unicode, a per entry code page attribute should not be needed, at least for migrating archives across platforms. If z/OS needs the code page for other zipping and unzipping functions, then it could be stored in a new z/OS extra field. But once the file is extracted from the archive and restored on the OS, the extra field information goes away unless it is used to restore something else on the OS. You need to be more specific if code page information is needed. To restore the file name, it seems the current platform code page (or a way to restore Unicode directly) and the UTF-8 file name are all that is needed. Note that file names from other platforms might not follow any special conventions, so relying on a $ meaning something is dangerous. You're better off putting such information into an extra field.
As for line termination, the assumption is each platform has a defined line termination. Once that is known, there seems no need to pass it around. Note that the platform the entry was created on is stored already.
The record types issue needs to be handled separately for this platform. I made some suggestions in the other thread, but you guys need to decide how you want to proceed.
The autodetection will work by default for non-unicode names. As well I can add support for -O to override the autodetected with the specified encoding. It is possible to disable library through the RusXMMS configuration, but I can add as well another switch to disable it from unzip command line, just tell me the switch letter.
The command line code has been updated in UnZip 6.1a to handle long options, so what you got is somewhat obsolete.
Don't worry about that. We can do the option stuff rather quickly. Just focus on adding the capability.
Library is included in some of the major Linux distributions and there is builds avaialable for many others. I have tested build from sources on OSX, FreeBSD, and OpenSolaris. No problems there. At the moment there is no Windows builds, but I'll provide in 1-2 weeks if you will accept idea to try RusXMMS patch.
Where is the source code available? I'm wondering how hard this would be for someone to port to a new platform like z/OS?
I don't even see any way how it is possible to prevent Windows users from producing more non-Unicode zip files with all variety of tools they are using. Exactly, the same situation we have with MP3 files. The ID3 v.2 with Unicode support is already 10 years old, but there are still a lot of broken MP3's with non-unicode encodings around.
Yeah, like there's no telling when Windows Explorer would support it, or even when it will support the not-so-new large files standard.
Still, we need to assume this is only temporary and everyone will eventually move to UTF-8 aware tools as it's the best approach (at least I think so).
Still, we need to assume this is only temporary and everyone will eventually move to UTF-8 aware tools as it's the best approach (at least I think so).
I agree - Unicode is much better. Just the move will take quite a while
That's the latest released code for UnZip. When we post the UnZip 6.10a beta there should be an announcement in the announcements thread. Still got a bunch of work to do before that goes out. Probably best to wait until it all works and we post it officially.
> If we need a translation library, I prefer the iconv library [...]
Same here.
> Porting to the POSIX-complaint system should be no problem. [...]
I took a (very) quick look at that source kit, and I would not bet that building it on VMS (for example) would be so easy. There does seem to be some iconv stuff in VMS these days, however.
I took a (very) quick look at that source kit, and I would not bet that building it on VMS (for example) would be so easy. There does seem to be some iconv stuff in VMS these days, however.
It's pretty big, but most of the stuff is optional and can be execluded for some platforms. Basicaly, it needs LibXML and IConv (both libraries are existing for VMS) and includes some string manipulation code, everything else can be stripped out for certain builds.
My thought is we should make sure a solution will work before committing to it. Sounds like we need iconv for any of the solutions. So the question then is can the RusXMMS solution work. How available is LibXML?
I'd like to use the same library on all ports (that implement this). So if we need a stripped-down library for some ports, we should create that and use it for all ports so the implementation more or less works the same on all ports.
While the USS side of z/OS is POSIX complient, the MVS side is by definition not so. Even in USS, it is by default an EBCDIC world and not ASCII, so UTF-8 is relevant for zip archive data and zip/inzip internals but not for a normal user. Unlike an ASCII-based platforms, UTF-8 interfaces don't make a lot of sense when your terminal and files are EBCDIC-based. Remember that even literal characters and strings generated by the z/OS compiler are EBCDIC-based by default.
MVS dataset names are limited to 44-characters, with a very restricted syntax (segments of 1 to 8 characters separated by periods) and restricted character set (Upper case A-Z, $, #, @, and 0-9 (not in 1st char of a qualifier)). And yes, those 3 extra characters are NLS-variant and have to be correctly mapped into the host codepage. In partitioned datasets (PDS, or PDSE libraries) the member name is limited to 8 characters with the same character set rules. This is why Josef and I have been talking about zip and unzip name mapping - while data translation may be relatively straight forward and portable, the transistion between the archive member and MVS dataset naming can be quite a jarring transition. Right now, it only works well in limited cases. The good news is that if we can come up with a good syntax, other folks may find that mapping useful too.
It is a really really bad idea to attempt to replace system functions in most cases. The MVS zip and unzip functionality was broken because folks assumed that the underlying OS conformed to their experience with UNIX, Windows or DOS. Even the folks familliar with z/OS USS made unwarrented assumptions. Unfortunately, files and character handling are those areas that are most difficult to map properly to z/OS. By default, the C runtime hides many warts but also limits performace and functionality. To do a decent job in supporting the MVS platform in zip and unzip, we have to use the platform-specific runtime extensions (both OS and C). C'est la vie.
In the case of iconv and Unicode, the z/OS runtime provides all the necessary tables and logic to handle the translation correctly and efficiently, and to support capturing those cases where input characters can not be successfully mapped to any output character. The iconv functions themselves have a POSIX-compilant interface, but the internals call low-level z/OS functions as required. zip and unzip trying to replicate that is a waste of limited resources for little practical gain. PKZIP appears to have done a lot of work in supporting z/OS MVS files (as part of justifying their large licence fees). I'll check out their public docs and see what I can gleen.
The assumption that each platform has a defined line termination may be justified in some cases, but not all. Any assumption breaks down when the ZIP archive is moved across platforms. Think about the case where an archive is created on one platform, grown on one or more, and delivered to a final platform.
In our case, I could see: z/OS MVS -> z/OS USS -> AIX -> Linux -> Windows
Each of these 5 platforms has a different default line termination: record length prefix, EBCDIC NL, ASCII LF, UTF-8 NEL, CRLF
Both of the z/OS platforms have a native mode that preserves their EBCDIC data. Both z/OS ports have to understand the other's format. Currently one must decide to encode the text files in an archive destined for a native-ASCII platform with the '-a' or '--ascii' option. This translates the file data to ASCII (currenly using a simple translate table) and uses the ASCII LF line term.
Guessing at the current line terminator for a given file seems error prone, where a per-file flag would remove all doubt.
Part of this is to ensure that unzip generates a reasonable message if a properly encoded archive is encountered that can not be decoded (since there is no --ebcdic flag for ASCII-based unzips).
I'll have to check next week to see what happens when a current native-MVS ZIP file is sent to Linux or Windows.
The assumption that each platform has a defined line termination may be justified in some cases, but not all.
For good or bad, this is the base assumption in the Zip and UnZip code and the zip standard. If a platform gets more complicated, that needs to be a platform-specific fix.
Any assumption breaks down when the ZIP archive is moved across platforms. Think about the case where an archive is created on one platform, grown on one or more, and delivered to a final platform.
In our case, I could see: z/OS MVS -> z/OS USS -> AIX -> Linux -> Windows
Each of these 5 platforms has a different default line termination: record length prefix, EBCDIC NL, ASCII LF, UTF-8 NEL, CRLF
Each entry has a field that records the platform that entry was created on. The fact that entries were added on different platforms does not matter. All that matters is what platform that entry says it was created on. Line termination is handled on a per entry basis.
Both of the z/OS platforms have a native mode that preserves their EBCDIC data. Both z/OS ports have to understand the other's format. Currently one must decide to encode the text files in an archive destined for a native-ASCII platform with the '-a' or '--ascii' option. This translates the file data to ASCII (currenly using a simple translate table) and uses the ASCII LF line term.
If an entry has line end translation done (through selecting an option) when it is added to an archive, it might make sense to change the platform designation recorded for that entry. I'm not sure if this is currently done. Would have to check the code, but don't think it is. So converting Windows line ends to Unix line ends with -ll might also result in the platform recorded for that entry being changed from DOS to Unix. Any Windows extra field information would remain and I believe UnZip uses that if the entry is unzipped on a Windows platform, regardless of what the platform setting is. Still, this would need some research.
Guessing at the current line terminator for a given file seems error prone, where a per-file flag would remove all doubt.
Sounds like a new extra field for recording line ends is being suggested. It's still not clear, though, what that would provide over making sure the current platform field is set to correctly represent the intended target of the entry.
Part of this is to ensure that unzip generates a reasonable message if a properly encoded archive is encountered that can not be decoded (since there is no --ebcdic flag for ASCII-based unzips).
Giving UnZip a way to convert an entry with EBCDIC encoding, maybe as noted in the platform field, to the current platform encoding seems something to work on also.
Another thing to consider is handling archives from other tools. Relying too much on non-standard information can cause trouble. Ideally any tool should be able to do something intelligent with an entry.
Ed, >Giving UnZip a way to convert an entry with EBCDIC encoding, maybe as >noted in the platform field, to the current platform encoding seems >something to work on also.
I think it is reasonable for unzip and other tools to simply recognize the EBCDIC encoding. - Each should be able to fully support displaying or even modifying the archive directory entry. That is in ASCII (and possibly UTF-8 with defined attributes. - When asked to extract the data, however, the ASCII-based utility should emit a message such as "EBCDIC zip archive member must be extracted on EBCDIC platform". That would be z/OS, CMS, or perhaps I-Series one day.
There are many different EBCDIC code pages, and you would have to replicate all of the logic in the z/OS (or CMS) port and the z/OS (or CMS) C runtime iconv routine and tables to do the right thing.
An alternative is for the ASCII-based platforms to support the --ebcdic flag to request a basic translation of the data using the same translate tables that are currently present in unzip. If a more precise translation is required, then they can always extract on an EBCDIC platform. There could even be a simple check and message issued if non-translatable characters are encountered.
Ed, >Giving UnZip a way to convert an entry with EBCDIC encoding, maybe as noted in the platform >field, to the current platform encoding seems something to work on also.
I think it is reasonable for unzip and other tools to simply recognize the EBCDIC encoding. - Each should be able to fully support displaying or even modifying the archive directory entry. That is in ASCII (and possibly UTF-8) with defined attributes. - When asked to extract the data, however, the utility should emit a message such as "EBCDIC zip archive member must be extracted on EBCDIC platform". That would be z/OS, CMS, or perhaps I-Series one day.
There are many different EBCDIC code pages, and you would have to replicate all of the logic in the z/OS (or CMS) port and the z/OS (or CMS) C runtime iconv routine and tables to do the right thing.
An alternative is for the ASCII-based platforms to support the --ebcdic flag to request a basic translation of the data using the same translate tables that are currently present in unzip. If a more precise translation is required, then they can extract on an EBCDIC platform.