Welcome, Guest.
Please login or register.
Extending the ZIP (and UNZIP) command line
Forum Login
Login Name: Create a new account
Password:     Forgot password

Info-ZIP Discussion Forum    Info-ZIP Software Discussions and Feature Requests    Info-ZIP Zip  ›  Extending the ZIP (and UNZIP) command line

Extending the ZIP (and UNZIP) command line  This thread currently has 891 views. Print
2 Pages 1 2 » All Recommend Thread
Al Dunsmuir
July 21, 2010, 11:06pm Report to Moderator
Info-ZIP Team
Posts: 94
Folks,
One of the areas that tends to really make the z/OS MVS ports of zip and unzip a challenge to use is the command line.

A lot of MVS production processing is done in batch using JCL.  Arguments are passed to the program via a PARM='...'  keyword.  The parameter string is limited to a whole 120 characters.      Invocation in foreground using TSO (or via REXX) may increase this limit, but it is still nowhere near what one is used to in a UNIX environment (such as z/OS USS). 

One can save a lot of space with our port by using files allocated to short DD names, and specifying these in the command line, but it only helps so much.

There are also issues with the parameter string tending to be upper-cased if one is not careful.   Part of that is because some parts of z/OS draw heavily on a mid-60's OS/360 heritage (with lots and lots of iterative change along the way).

There is support going in to the 3.1 zip to support -@filename as a way to specify that the list of files to be zipped is contained as text within the given file.  We'd like to come up with an extension to that syntax to support mapping the file name to the desired archive name.    I'm not sure if the new unzip yet supports an equivalent syntax for controlling the mapping of archive name to extracted name, but I would suggest we need it.    The details of both of these should really be discussed in another new topic.

My suggestion to relieve the z/OS MVS command line limit is to extend on this idea and have --@filename be a way to have the command processor to transparently open the specified filename and parse the text inside as if it appeared in the command line at that point.  Basically it would act as a command line #include.   We should support multiple --@file specifications, but it would be OK to only allow one level of nesting (eliminates need to avoid checks for accidental recursion).  

The z/OS C/C++ compiler supports something similar for the same reasons - it has an OPTFILE(filename) option.

This would seem to be a generally useful extension, and not in conflict with the existing zip/unzip command line syntax.
Al
Logged
Private Message
EG
July 22, 2010, 5:03am Report to Moderator
Info-ZIP Team
Posts: 463
The new command line parser was originally designed to support @filename (not -@filename, as -@ reads names from stdin) as instruction to open up the file filename and insert the contents into the command line at that point.  If the file contained a list of names, those names would be inserted at that point.  Options can also be inserted the same way.  The contents are added and then the resulting command line parsed.  Zip provides the -sc option to show the final command line instead of executing it, as an aid to debugging.  The @filename feature supports recursion, allowing filename to itself include @filename2.  I forget the default recursion depth allowed.  That was all tested when the parser was written.

That code is laying around and is planned to be added back to Zip 3.1.  The hooks are already there so it should be fairly easy.

UnZip 6.1a has been switched over to the new command parser code.  That's probably enough for that beta.  In a later beta the @filename code can be added back to that version of the parser.
Logged
Private Message Reply: 1 - 17
Al Dunsmuir
July 22, 2010, 9:21am Report to Moderator
Info-ZIP Team
Posts: 94
Quoted from EG
The new command line parser was originally designed to support @filename (not -@filename, as -@ reads names from stdin) as instruction to open up the file filename and insert the contents into the command line at that point.  If the file contained a list of names, those names would be inserted at that point.  Options can also be inserted the same way.  The contents are added and then the resulting command line parsed.  Zip provides the -sc option to show the final command line instead of executing it, as an aid to debugging.  The @filename feature supports recursion, allowing filename to itself include @filename2.  I forget the default recursion depth allowed.  That was all tested when the parser was written.

As far as the topic of the forum, it means that this should be fully addressed by the time the z/OS (and VM/CMS) related changes are incorporated in the source base and make it out in the next betas that you would ship.

As I had mentioned elsewhere we need extensions to the zip and unzip syntax to support an add "file as{....}" syntax and extract "file as{...}".  These extensions would be processed so that they note a modification to the previous token.  This in turn may add a wee bit of complexity to the parser but I think it would be worth it for all platforms.   Creating another pair of topics to discuss the two mappings seems to be next on the list (even if the discussion for those topics can wait a bit as they are not the highest priorities).
Quoted from EG

That code is laying around and is planned to be added back to Zip 3.1.  The hooks are already there so it should be fairly easy.

UnZip 6.1a has been switched over to the new command parser code.  That's probably enough for that beta.  In a later beta the @filename code can be added back to that version of the parser.

Seems to me that it needs to be added sooner rather than later.
Logged
Private Message Reply: 2 - 17
EG
July 22, 2010, 3:21pm Report to Moderator
Info-ZIP Team
Posts: 463
The "insert as" and "extract as" functions would probably not involve the parser, as it has enough to do just parsing the command line.  Though the renaming commands need to be on the command line somewhere, that functionality will probably be implemented when the file list is scanned or processed (depending on insert or extract) later on.  We've given that some thought as that functionality is already on the ToDo lists of Zip and UnZip.  If you have any ideas how you prefer the renaming commands be specified on the command line, that's one of the issues we've been trying to resolve.  Probably wild cards need to be supported.

There is also the task of converting the parser to wide characters so that matching can be done directly in Unicode on most all ports.  Currently only those ports with UTF-8 command lines support direct matching of command line arguments to the Unicode paths in an archive.  That works for Unix, but Windows uses a wide character command line function to get Unicode from the user, which we haven't had time to tackle yet.

There's only so many hours in the day (or night).  You might have heard it's better to give code than to receive.
Logged
Private Message Reply: 3 - 17
Al Dunsmuir
July 22, 2010, 4:45pm Report to Moderator
Info-ZIP Team
Posts: 94
Quoted from EG
The "insert as" and "extract as" functions would probably not involve the parser, as it has enough to do just parsing the command line.  Though the renaming commands need to be on the command line somewhere, that functionality will probably be implemented when the file list is scanned or processed (depending on insert or extract) later on.

We've given that some thought as that functionality is already on the ToDo lists of Zip and UnZip.  If you have any ideas how you prefer the renaming commands be specified on the command line, that's one of the issues we've been trying to resolve.  Probably wild cards need to be supported.

OK - that keeps it simple. 

At a very high level, the basic idea was:

  • For zip encoding, follow each file name or directory specification with an optional output "insert as" pattern.  

    For example,   

        zip  output.zip file1 as{file1 encoded name} file2 as{file2 encoded name} ...

    The real life example would be where we had a Universal command ucopy function where we wanted to save the stdout and stderr to files, and encode them with literal file names.  Then we could attach the zip file to EMail as a mime-encoded attachment, all neat and tidy.

        zip ucopylogs.zip DD:UCOUT as{ucopy.stdout} DD:UCERR as{ucopy.stdout}

    The thought was that reserving a standard "as{" prefix is not going to impact any real-life file names.

    It gets more complicated when one wants to pass some or all of the original name as part of the encoded name pattern (base name or directory).

    An example of this would be my real-life need to encode the zip and unzip source, but using a name mapping that is more useful at unzip time.  The standard encoding based on the last dataset qualifier and the member name does not work, because our file types for our source control system are in the 2nd-last qualifer, not the last one.   In this case, if we had 2 MVS PDS (Partitioned Data Sets), one containing C source and the other containing C headers, the invocation might be:

       zip zipsrc.zip -r DD:ZIPC as{zipsrc/*.c} DD:ZIPH as{zipsrc/*.h}

    Basic wild cards ('*' and '?') and literals would be supported for fn.ft ('*' and '?'), with literals for the optional directory specification (if not specified, would use existing generated direcotory, unless suppressed).

    More complicated patterns (specifying directy names, user ID etc) would be more complicated, but might best be supported by some form of regular expression with symbols for directory name and base name.

    Even having the two forms I described would cover a large portion of the requirement, and be quite portable.  We could agree on a character (say %) that would indicate the start of a regex pattern and then impliment that in a (much) later release.

  • For unzip decoding inbound, you already have a syntax for using patterns to select files and/or directories.  Using that with the same sort of logic and as{...} escape as above would allow the output name for the selected file to be changed.

    On z/OS, we would often want to direct the output to a DD that provides an indirection to the actual existing file (or standard file specification for files to be created).

    If we keep it simple and try to reuse as much code between zip and unzip in these areas, we should be able to get the basic literal+ fn.ft mapping working without too much code (or added complexity).
Quoted from EG
There is also the task of converting the parser to wide characters so that matching can be done directly in Unicode on most all ports.  Currently only those ports with UTF-8 command lines support direct matching of command line arguments to the Unicode paths in an archive.  That works for Unix, but Windows uses a wide character command line function to get Unicode from the user, which we haven't had time to tackle yet.

Environments like Linux that are native UTF-8 are easy.  Any other environement would need the args translated into UTF-8.

Since all files (zip and application) are in the native character set you would need to either keep a parallel copy of the original arg text, or be able to reliabley translate it back.

On z/OS and VM/CMS, the program arguments are provided in EBCDIC.  Regardless of the platform, encoding it in UTF-8 would require a conversion, and there would be a need to capture the current runtime character set to get the translation done correctly (otherwise any unusual characters in file names would be corrupted). 

Quoted from EG
There's only so many hours in the day (or night).  You might have heard it's better to give code than to receive.
 
Indeed!  I have my main project to keep working away on (around 250 KLOC) and some "fun" work on ARM that I want to find time for.   My 9-year old precocious daugher and significant other also have their demands.
I suspect I'm going to contribute a fair bit of code over time, but since my name is not "Dr Who" I am bound by the laws of space and time, so it is all going to have to come piece by piece [GRIN].

Basic function first, especially that based on code we already have.

Subsequent stuff is driven by user demand.  Would be nice to have other contributors on the z/OS and CMS, but I suspect realistically the majority of folks will be providing assistance in requirements/design/testing.
Logged
Private Message Reply: 4 - 17
EG
July 23, 2010, 5:16am Report to Moderator
Info-ZIP Team
Posts: 463
The as{ prefix probably is not generic enough.  I've seen filenames that are close to that on Windows and Unix.  We've played with different alternatives.  One question is if the mapping is to be done file by file or if groups of files are to be remapped using wildcards.  Probably both need to be possible.  One possibility is to use a new option to define the mappings.  Maybe
--remap-name oldname1 newname1 oldname2 newname2 ...
If --remap-name is set as taking a list value (as -i and -x do), then the command parser knows how to handle this.  Then the main code would need to check if there are an even number of values.  After that, it's going through the file list and doing the conversions.  Now another question is what to do about Unicode.  The default action is to read anything from the command line in the current character set, so all the names might be translated to UTF-8 before conversion, or converted in the local character set.  But that leads to the next problem, that any names in the archive not in the local character set are hard to match from the command line (unless the Unicode character number escapes are used).

By the way, we are all volunteer and have day jobs and lives.  It's usually hard to find time to do this stuff and that time can dry up at any time.
Logged
Private Message Reply: 5 - 17
Al Dunsmuir
July 23, 2010, 11:18am Report to Moderator
Info-ZIP Team
Posts: 94
Quoted from EG
The as{ prefix probably is not generic enough.  I've seen filenames that are close to that on Windows and Unix.  We've played with different alternatives.  One question is if the mapping is to be done file by file or if groups of files are to be remapped using wildcards.  Probably both need to be possible.  One possibility is to use a new option to define the mappings.  Maybe
--remap-name oldname1 newname1 oldname2 newname2 ...
If --remap-name is set as taking a list value (as -i and -x do), then the command parser knows how to handle this.  Then the main code would need to check if there are an even number of values.  After that, it's going through the file list and doing the conversions.  Now another question is what to do about Unicode.  The default action is to read anything from the command line in the current character set, so all the names might be translated to UTF-8 before conversion, or converted in the local character set.

While it may not be as "pretty" as decorating the original syntax with as{..} but it does indeed get the job done and fits the existing zip/unzip syntax well.

While there is nothing in the syntax that distinguishes the "from" and "to" names, there are a number of ways that this can be done by simply arranging the zip/unzip command text. 

My previous example using as{...}
    zip  output.zip file1 as{file1 encoded name} file2 as{file2 encoded name} ...
can now be expressed as
   zip  output.zip file1 --remap-name \
       file1     file1_encoded name  \
       file 2    file2_eencoded_name \
            . . . 

Using the @ escape to use a file within the command line means you arrange the data in that flle in a similar manner to help organize very long lists.

Quoted from EG
But that leads to the next problem, that any names in the archive not in the local character set are hard to match from the command line (unless the Unicode character number escapes are used).

unzip already had that problem with it's existing pattern matching and exclusion syntax.   Using the standard Unicode escapes is the right thing to do in the general case for all zip and unzip command processing, since intend to encode internally as UTF-8.

Note that for zip, it is unlikely to be an issue for the "from names as z/OS MVS and VM/CMS native file names have a very restricted syntax. 

For z/OS USS and VM/CMS BFS and SFS file names, specifying a NLS "from" character is likely to be handled by the EBCDIC command line to UTF-8 internal mapping, as long as the current execution character set matches that used to generate the file names and can be determined easily. 

To handle cases where the current command character set can not be determined, or if there is a need to alter it so that the zip input ("from") file names can be specified with command set that matches the referenced file names, there needs to be one more assist.  I'd suggest something like --using-charset  This would only be allowed between other zip options - could not be specified between zip option and the arguments for that option. 

As mentioned before the command token processing that does the conversion into internal UTF-8 needs to record the original character set of each command token (so that the process can be reversed and options such as file names correctly converted back to their original character set when you actually need to open that file) it works out nicely   The --using-charset gets consumed by token->UTF- processing, and changes the "current" token character set that is applied to subsequent tokens until further notice.
Quoted from EG

By the way, we are all volunteer and have day jobs and lives.  It's usually hard to find time to do this stuff and that time can dry up at any time.

I know that.   Sorry if I can come across as a tad abrupt at times - the last 3 nights I've been burning the candle at both ends trying to keep up with the forum and work, and averaging 4 hours per night sleep. 

Even with management approval to work on the zip/unzip port as part of my tasks, it doesn't mean that other work stuff isn't going to have to take precidence at times (or that I will have enough hours to get everything that needs to be done completed quickly).  Given I had to drop out of the forum for 18 months for a large heads-down project, I can definately say "been there, bought the T-shrit".  I'm trying to ensure that I about 25% of my time allotted on an ongoing basis so that will not happen again.

It's one of the reasons I need to focus on the platforms that are most used at RBC - z/OS IMVS and USS), AIX, Windows, and Linux - so I can continue to remind management of the need for me to spend my time. 

By the way, we also use iSeries (AS400, aslo EBCDIC) and Tandem but I have no experience on the those plaforms.  I would not mind at all if someone wanted to volunteer to actively work with us on those ports.
Logged
Private Message Reply: 6 - 17
EG
July 27, 2010, 4:04am Report to Moderator
Info-ZIP Team
Posts: 463
Quoted from Al Dunsmuir

While it may not be as "pretty" as decorating the original syntax with as{..} but it does indeed get the job done and fits the existing zip/unzip syntax well.

I liked it.  Been trying to come up with something for years and that just came to me.

Quoted from Al Dunsmuir
unzip already had that problem with it's existing pattern matching and exclusion syntax.   Using the standard Unicode escapes is the right thing to do in the general case for all zip and unzip command processing, since intend to encode internally as UTF-8.

By Unicode escapes I refer to the #Uxxxx and #Lxxxxxx escapes currently used by Zip and UnZip, where x is a hex digit.  These can be used on the command line and are converted to the Unicode characters internally if UNICODE_SUPPORT is enabled.

Quoted from Al Dunsmuir
To handle cases where the current command character set can not be determined,

Not sure what you mean.  The command line is always assumed to be in the current character set and that is known.  Indeed, not sure how you would put any other characters on the command line.  Not sure about other ports, but on Windows you're stuck with that character set unless you use special character support features to compose the command line and Zip or UnZip uses the Windows wide character console calls to read it (which they currently don't, but it's on the list).

Quoted from Al Dunsmuir
or if there is a need to alter it so that the zip input ("from") file names can be specified with command set that matches the referenced file names, there needs to be one more assist.  I'd suggest something like --using-charset  This would only be allowed between other zip options - could not be specified between zip option and the arguments for that option. 

Just don't see how that would work as you generally have only the current character set available to compose the command line.
Logged
Private Message Reply: 7 - 17
Al Dunsmuir
July 27, 2010, 5:02am Report to Moderator
Info-ZIP Team
Posts: 94
Quoted from EG

By Unicode escapes I refer to the #Uxxxx and #Lxxxxxx escapes currently used by Zip and UnZip, where x is a hex digit.  These can be used on the command line and are converted to the Unicode characters internally if UNICODE_SUPPORT is enabled.

That's what I thought you meant.
Quoted from EG

Not sure what you mean.  The command line is always assumed to be in the current character set and that is known.

Repeat after me: There is no such thing as ASCII or EBCDIC. There are many ISO code pages, each with a different representation for one or more variant characters. 

Even windows supports changing your current code page to accomodate different language requirements.

Thhe zip/unzip program is receiving a parsed command line broken into args by the C language runtime (which looks for specific characters to break into the args, typically whitespace or commas).  These can include indirections to command files.  The original characters came from a command file (entered via editor) or command line session (entered via terminal).  There may be multiple command files entered in different edit sessions with different code pages.

The runtime may have the concept of "current code page" which is that of the command execution environment, or it may not.  Even if it does, the code page used in a given command file (mapping various file names) my be deliberately different.   That is why being able to explicitly control the "current" code pages is useful.

For example, on z/OS there is a #pragma users can put in their source and header files that lets the compiler know the encoding of that particular file.  It can be useful if one is using an English platform to compile a program written in German that uses libraries written in various code pages (including English) and the compiler tags each literal with the "from" code page and ensures that each is correctly generated.

Quoted from EG
Indeed, not sure how you would put any other characters on the command line.  Not sure about other ports, but on Windows you're stuck with that character set unless you use special character support features to compose the command line and Zip or UnZip uses the Windows wide character console calls to read it (which they currently don't, but it's on the list).


Shell files or batch files are written during edit sessions which use a given code page.  Indirected files to extend the command line do the rest.
Quoted from EG

Just don't see how that would work as you generally have only the current character set available to compose the command line.
 
You have only one code page at a given point of time, but you can create those command files with different invocations of the editor.
And as to the mainframe, the standard editor has a hex edit mode to ensure that you can cover these sorts of thing.   Evil but useful.

Clearly it is better to best tp explicitly tag each command file with odd-ball characters and have the command parser do the right thing as it traverses each command file.

Realistically, you don't use this very often but when you need it, it is essential.
Logged
Private Message Reply: 8 - 17
EG
July 27, 2010, 6:50am Report to Moderator
Info-ZIP Team
Posts: 463
I'm too tired to do the quote thing.

> Even windows supports changing your current code page to accomodate
different language requirements.

But not in the same command line, unless the Windows console tools are used.  But in that case Windows requires calling the wide console interface routines to get it, and that is returned as Unicode.  No port I've worked with returns multiple code pages on the same command line, unless the command line is returned in Unicode, which is actually still one character set.

> Thhe zip/unzip program is receiving a parsed command line broken into
args by the C language runtime (which looks for specific characters to
break into the args, typically whitespace or commas).  These can include
indirections to command files.  The original characters came from a
command file (entered via editor) or command line session (entered via
terminal).  There may be multiple command files entered in different
edit sessions with different code pages.

Still, it's all in one character set.  You might somehow encode a string in another character set and embed it in the command line and so pass in gibberish in the current character set, but it is still the local character set being used by the application to read the string.  You're just encoding information in the bytes of that character set.  Things seem to be simpler if the user on z/OS just converts everything to UTF-8 and passes that in on the command line.  Zip, for example, then already knows how to handle that.  Sounds like you need a new tool rather than increase the bulk of Zip to deal with character encodings.  Also, I assume these different character sets are MBCS encoded.  The MBCS routines can fail if they aren't using the correct character set.

Honestly, I think time is better spent adding in new compression methods and strong encryption rather than dealing with trying to interpret bytes on the command line in multiple languages.

> The runtime may have the concept of "current code page" which is that of
the command execution environment, or it may not.

All modern ports I've worked with seem to understand locale to some degree.

>  Even if it does, the
code page used in a given command file (mapping various file names) my
be deliberately different.   That is why being able to explicitly
control the "current" code pages is useful.

We're talking about just file names, right?  When these names are restored, they are restored based on the local character set that UnZip finds itself in.  If Unicode is supported, then the names are restored using that, so should match exactly what was zipped up if Unicode was stored for the original names.  So even though you are possibly getting names from different files in different character sets, in the end those names are restored based on where they are going or on the Unicode stored.  In the latter case, why is knowledge of the original character set important if the actual characters from that character set are recorded in the Unicode?

> For example, on z/OS there is a #pragma users can put in their source
and header files that lets the compiler know the encoding of that
particular file.  It can be useful if one is using an English platform
to compile a program written in German that uses libraries written in
various code pages (including English) and the compiler tags each
literal with the "from" code page and ensures that each is correctly
generated.

We're still talking about file names, right?  Does any of that matter as long as the original name is recreated on the destination port?  Convert everything to Unicode before giving it to Zip and UnZip should recreate all the names.  Saves a lot of work tracking character sets.

Still haven't sold me that having Zip do translations from multiple code pages is worth adding additional complexity to Zip to support.  Or adding stored code page support to UnZip to do what is already possible if the input to Zip is Unicode.  One thing we try to avoid is adding features that only a couple users would ever use.  This seems like one of them.  Also, Unicode is the standard for zip archives.  Storing code pages would be non-standard.

By the way, this probably would greatly increase the complexity of the parser.  Currently the command line is permuted, so any character encoding flags would have to be permuted with the arguments they go with.  Then that information needs to be passed in with the arguments, adding an additional data flag that would have to be referenced throughout.  To be compatible with other utilities, the names would still have to be converted to UTF-8, and if they aren't in the local character set, that probably would build in a dependence on iconv.

So if you have all the right characters, why do you need the original code page?
Logged
Private Message Reply: 9 - 17
Al Dunsmuir
July 27, 2010, 9:45am Report to Moderator
Info-ZIP Team
Posts: 94
Ed,
Quoted from EG
So if you have all the right characters, why do you need the original code page?

You have a string of bytes, not characters.    Characters == bytes interpreted via a code page.

Let me explain again, for the simple 1 code page case.


  • UTF-8 is an ASCII-based method of encoding UNICODE data.

  • Your parser is collecting EBCDIC characters in a given code page, and converting them to UTF-8 via iconv.

    But Wait! You can't convert those EBCDIC characters to UTF-8 without knowing the "from" IBM-xxxx code page.  Otherwise you are going to randomly interpret any NLS-variant characters during the conversion.

  • This conversion must be performed via iconv() on z/OS, snce UTF-8 is not a native characer encoding mechanism.

  • After your UTF-8 based parser has chugged away and processed these strings into tokens, you now want to go through the actual task of encoding your user's file.   This means you need to translate this name back from UTF-8 into an EBCDIC name that can actually be used by the C runtime file I/O routines.

    But Wait!  You can't convert that UTF-8 string back into the EBCDIC characters without knowing the "to" IBM-xxxx EBCDIC code page.
If the transformations going in to the parser are not 100% reversible during zip/unzip processing, then the parser is not usable, plain and simple.

Now as to your remark about locales, that is fine for a POSIX environment such as z/OS USS... but such things do not exist in a vanilla z/OS MVS environment such as batch, or on VM/CMS.    This means that we need to default to a code page (likely the IBM-1047 standard USS one, as it has more characters than older pages such as IBM-037).   

A default code page does us no good if the user wants to specify NLS-specific characters not in IBM-1047.  There needs to be a way to switch the command line processing into using a given code page so it can process names passed later on in either a JCL PARM= value or in a zip/unzip command line @file.  

The issues with unzip are similar - when users chose to try to process a file that contains information with file names and/or file content that were originally in languages not mapped by the default IBM-1047 (Westurn European Latin 1),

Say the user is on a US system, and their command line is being created using an editor that can switch languages (current code page).  The user's task is to extract language specific files from a given zip file containing program literal translations.   They need to handle German, Lithuanian, Hindi and Russian.



  • The file was encoded on AIX, and each of the file names is encoded in Unicode (UTF-8 representation) and the file contents are each encoded to match the encoding of the file name.

  • If extracting to z/OS MVS, we must remap our file names (or have unzip generate an artificial "clean") file name.

  • If extracting on z/OS USS, we need to be able to extract each UTF-8 encoded file name into the appropriate EBCDIC encoding for that file name.

  • For both variations, we need to be able to specify the translation of the file data.  If the file data is not tagged, both a from and to codepage name are required.
Does this help explain better?


Please see the posting by fits in InfoZIP UnZIP/new Function $LETDSN$ and $MHQDSN$ at July 26, 2010, 4:08am. 


  • This shows him building custom unzip modules, each with a single "from" "to"  combination for data translation.
  • It is a very ugly but functional solution, as he emphasises.

    It is not a solution that you would want to propagate to other users on the platform via the beta or release builds, since it would break every other user of the zip/unzip ports - they do not want that particular translation pairing, but their own appropriate to each file.  Lord help us when someone wants to transfer zip files between  z/OS systems using a different port translation variations.
Logged
Private Message Reply: 10 - 17
EG
July 28, 2010, 12:26am Report to Moderator
Info-ZIP Team
Posts: 463
So names are not just names on MVS and z/OS.  I'm concluding that z/OS is not as straight forward as nearly any other port, including Windows and it's use of OEM translations.  Sorry, but this stuff makes me a bit dizzy.  OK.  Sounds like we need to store the code page information just for these platforms.  Maybe that's something to include in the single new extra field.
Logged
Private Message Reply: 11 - 17
Al Dunsmuir
July 28, 2010, 5:06am Report to Moderator
Info-ZIP Team
Posts: 94
Quoted from EG
So names are not just names on MVS and z/OS.  I'm concluding that z/OS is not as straight forward as nearly any other port, including Windows and it's use of OEM translations.  Sorry, but this stuff makes me a bit dizzy.
 
I think my wild hand waving has had the desired effect.   I think the classic map note was "There be dragons!"

By the way, VM/CMS has it's own fair share of pitfalls. 


  • We now know that CMS has 3 file representations that need to be supported.  Each will need a unique platform file type doe.
  • A lot will be a combination of existing techniques from other platforms.
  • There will be heavy reuse of common code with the z/OS platform.

    IE.  VM/CMS BFS files are effectivey the same as z/OS USS files, except they use separate platform file type codes.
          VM/CMS FILEDEFs are effectively the same as z/OS DD statements (or dynamic allocations).
          VM/CMS non-POSIX environment has a similar lack of locale support to z/OS MVS non-POSIX environment.
Quoted from EG
OK.  Sounds like we need to store the code page information just for these platforms.  Maybe that's something to include in the single new extra field.

Sorry.  That only works if you assume that a binary copy of the data bytes between line terminators is all you need.

The fact that this may be what is happening now simply means that you are not going to correctly handle text transfers with variant characters between two computers with non-idential settings.

Here's the thing: You have the same issues on all platforms when your data is not Uncode.  Even for Unicode, you need to identify the Unicode encoding representation (UTF-8, UTF-EBCDIC, UCS-2, etc.).

You need to know the "from" and "to" code pages to translate variant characters correctly between two ASCII-based code pages.   If I want to import that data into z/OS or VM/CMS, I will need to know the "from" code page for all translations.

If all of the InfoZIP ports try to do the right thing with text data and allow the user to override appropriately when another translation is required, then you will have made a serious dent in eliminating problems moving beween systems, different locales, and differences in character encodings.
Logged
Private Message Reply: 12 - 17
EG
July 29, 2010, 1:40am Report to Moderator
Info-ZIP Team
Posts: 463
>>OK.  Sounds like we need to store the code page information just for these platforms.  Maybe that's something to include in the single new extra field.

>Sorry.  That only works if you assume that a binary copy of the data bytes between line terminators is all you need.

Let's clearly define the tasks being discussed.

One is the conversion of file names from one character set to another.  That is currently done
using Unicode as specified in the zip standard (AppNote) and has been implemented in Zip and UnZip, as well as WinZip and PKZip.  This has been tested on all ports, except the more obscure mostly dead ones and on the MVS variants.

A second is converting entry comments from one character set to another.  This is currently done in the same way as file names are done and is also covered in the standard.  Though Zip and UnZip have not implemented this functionality yet, WinZip and PKZip have.  This has been more or less tested, in the case of Zip using code that hasn't made it into the main code tree yet.

A third task is converting file contents.  Currently the only thing that Zip and UnZip support are line end conversions and some crude EBCDIC ASCII conversions.  There is also the OEM translations done on Windows, but that is specific to Windows.  Converting file contents between languages is probaby not something we want to get into.  Assuming the contents are to be in the same language but are being moved to a different platform or a different code page (that is compatible with the language), there seem things like line end conversions that can be done to help users.  Also character set translations from ASCII to EBCDIC and back.  These are defined processes so should be straight forward to
implement.

We were talking about file names and how they get added to a command line.  Apparently also about those names being possibly different encodings and so the desire for a way to tell Zip what those encodings are.

Now it seems we're talking about translating file contents, where the line ends are.  Or is this referring to the code page the command line is coming from?

Also, are we talking about converting contents between languages or just accounting for platform differences?

>The fact that this may be what is happening now simply means that you are not going to correctly handle text transfers with variant characters between two computers with non-idential settings.

What specific transfers?


I'm sorry, but I'm getting to the point where maybe I need to wait for you to post code to see just what you're talking about.  Maybe if you can provide some specific examples showing what needs to happen it might help.
Logged
Private Message Reply: 13 - 17
Al Dunsmuir
July 29, 2010, 1:39pm Report to Moderator
Info-ZIP Team
Posts: 94
Ed,

| Modified to note trimming of trailing blanks, and blank padding operations. 
| Note that any of the "-- options" noted below are simply 1st pass "option names that might work" proposals.

As you have indicated you want to rework the parser to use UTF-8 encoding, it seems appropriate to try to identify and resolve issues before the work is done, rather than go through multiple iterations.

The issue is that when dealing with text characters, the translation issues are the same whether one is dealing with command line text (which may represent data file names) and data file contents.  That's why we keep talking about the same fundamental concepts/problems again.   I tried to provide a quick summary and provide a bunch of pointers to resources in the topic "ZIP (& UNZIP) data translation (EBCDIC and/or NLS)", but perhaps I should have called it "ZIP (& UNZIP) character translation (EBCDIC and/or NLS)".

Please note that most of Josef's (fits') patches deal with trying to handle this issue for his files - it is a very big deal for folks who have to deal with multiple language encodings on a daily basis.

There is no real problem with invariant characters - the ones that maintain the same byte value encoding (code point) within gheir given character set.  There is a subset of these invariant characters that can be directly mapped back and forth between ASCII (and UTF-8 and EBCDIC.   It can be done easily with a translate table, and that is what the 32-bit implementation of zip and unzip did. 

The problem comes with variant characters.  The translations using hard-coded tables often don't bother to check if the inbound character has a valid translation.  That is a problem when the end user requires the data be kept useful and intact.

To map these variant via iconv, one opens an iconv handle with the names of the "from" and "to" code pages.  This is the same whether those are an ASCII-flavoured code page or an EBCDIC-flavoured code page.   The important point is that there is no such thing as a code page called "EBCDIC" nor one called "ASCII".   You need to know the exact variation that one is dealing with by name. 

Iconv will report an error condition when it encounters input characters that have no valid representation in the output code page.  When transforming to an EBCDIC-based code page, it replaces that character with an 0x3F. What we do in our server is track the number of lines with errors, and the total number of untranslatable characters and report it when processing that particular file is done.

The typical scenarios related to character translation/transformations in zip and unzip would be:


  • Encoding command line text in the parser into UTF-8.

    Here one needs to open the iconv handle with the current command line code page name. 

    For z/OS, this would tend to be "IBM-1047" and "UTF-8".   Josef would mostly run with a different EBCDIC (IBM-xxxx) code page as he is not using Western-European encoding when he runs.

    Providing a mechanism within the parser that tags each token with the current "from" code page is important to later processing in zip/unzip.

    Being able to switch the command line processing between different "from" code pages via a reserved keyword (say --cp codepage_name) helps keep processing simple and explicit when one is dealing with a situation where file names are encoded in different code pages. 

    Just maintain an array of iconv handles for a given codepage and the overhead is minimal. Close all the iconv handles when parsing is complete.  Alternatively use a cache with from and to names and it can be used throughout zip/unzip processing and closed at the end.

  • Doing encoding or decoding processing code within zip or unzip

    The issue here is that one needs to open the file referenced in the command line using native encoding, not the internally encoded UTF-8 representation.  Each bytes in that file name used during the file operations must be identical to the file name bytes within the filesystem.  Any discrepency in byte values for the name means you will not be able to access that file

    The file name is built up from the command line tokens, translated back from UTF-8, again via iconv.
    Since we have the required "to" codepage name for the iconv open encoded along with that file name token, it is easy to manage.

    It could get a tad complex with file names built up from multiple tokens which happened to have different encoding code page names.   Best policy is likely to build it up in segments each translated with thier respestive code page names.  

  • As the text data is read from the file, one may need to do data transformations (if requested by command line options) before zip encoding each text line.  These transformations include:

    - Tab expansion of the file text line (--detab)

    - Trim trailing blanks from the file text line (--trim)

    - Tab encoding of the file text line (--entab)

    - Translation of the data using the hard-coded ASCII->EBCDIC table (--ascii)

    - Translation of the data using the hard-coded EBCDIC->ASCII table (--ebcdic)

    - Translation of the data using iconv and the specified code page name pair (--iconv cp1 cp2)

    - Translation of the line termination character (--?)
  • During file text data decoding in unzip, one may need to do data transformations (if requested by command line options).  This can get a bit tricky as one must be able to recognize the logical end of each data line - which is why explicitly noting that in a per-file attribute really simplifies processing.   These transformations include:  

    - Translation of the data using the hard-coded ASCII->EBCDIC table (--ascii)

    - Translation of the data using the hard-coded EBCDIC->ASCII table (--ebcdic)

    - Translation of the data using iconv and the specified code page name pair (--iconv cp1 cp2)

    - Tab expansion of the file text line (--detab)

    -  Trim trailing blanks from the file text line (--trim)

    - Tab encoding of the file text line (--entab)

    - Pad file text line with blanks (if required by RECFM=Fxxx format)

    - Translation of the line termination character (--?)
On the z/OS and VM/CMS platforms, we would not be using the MBCS functions for character translation, only iconv. 



  • That is the standard mechanism that has been used for many years on this plaform.

  • Remember that when UNIX Systems Services was added, the default character set chosen was EBCDIC for file names and data. This matches the convention used for native files and APIs, as well as platform I/O such as terminals.  

  • The z/OS MVS platform supported many different character encodings for EBCDIC.  These were originally standardized with a code page number.  When ISO and iconv came around, all they did was add an "IBM-" prefix to that existing number.

  • The ASCII code page numbers that I'm familliar with were from Windows and OS/2 (CP-850).  ISO added some standardized ones (ISO-8859-1, etc.) but unfortunately Microsoft did not switch to using the standard ISO code pages and continued to use their own which had more variant characters encoded.  Those were also assigned ISO code page names.
Nothing that has been mentioned so far is really language related - just related to the translation of characters used to represent languages in a text file.

There are some very funky transformations that are language-related that are done within iconv in some situations, where characters represented by a single character in one code page are transformed to (or from) a sequence of multiple characters.  I think that is done to handle some languages like Finish.

Those transformations would mess up the simple 8-byte entab/detab processing a wee bit, but that is not worth worrying about in the general case.  If the user cares, they will have their own fancy entab/detab processor, or simply have someone correct as required with a text editor.

This isn't code... but I hope this gives you some insight in how this all affects how the code is designed.
Logged
Private Message Reply: 14 - 17
2 Pages 1 2 » All Recommend Thread
Print

Info-ZIP Discussion Forum    Info-ZIP Software Discussions and Feature Requests    Info-ZIP Zip  ›  Extending the ZIP (and UNZIP) command line