zsh a diakritika v konzole nepodporovana - zsh nepozna UTF-8


To czdebian-l zavinac debian bod cz
From Jan Kunder <jan bod kunder zavinac gmail bod com>
Date Tue, 18 Apr 2006 20:38:49 +0200
Domainkey-signature a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:organization:user-agent:mime-version:to:subject:content-type:content-transfer-encoding; b=TypBzc+iTBkV5VtTlroiAdFvJ1j/3SMbyi31bF+hC2CIxWMO/SFlpA7J2tI+4uF1lqmm4ocW3k1QPL4YizO//uxmEQQS6Mc2quSAP2gWvHbNbZpYDc968H+mjhlVNNUWs39EIWcFZIhlmoj2u5QDE5RlbieJ/gnMArpmjr9ucz8=
Organization Jan Kunder
User-agent Thunderbird 1.5 (Windows/20051201)

Pekny den.

Viem mi niekto $SUBJECT potvrdit/vyvratit?

Teraz zakazdym sa budem musiet prepinat do bashu, ak budem chciet zmazat subor ŇČŤÄÜÔŇŠŤŽĽ, alebo vim-nut subor ĺľščťžýáíéňäú?
(IMHO je sice diakritika v suboroch prasacina, ale neviem to zmenit/zrusit)

Ide o Debian Sarge.


****************
http://www.cs.elte.hu/pub/zsh/FAQ
http://packages.debian.org/cgi-bin/search_packages.pl?suite=all&subword=0&exact=1&arch=any&section=all&case=insensitive&keywords=zsh&searchon=names
cize plna podpora UTF-8 v zsh nehrozi ani v 4.4 :(

****************
Proste v bash-i mi ide vs. OK:
ubash zavinac ltspke:/tmp/9999ubash/ŇČŤÄÜÔŇŠŤŽĽ$ touch ĺľščťžýáíéňäú
ubash zavinac ltspke:/tmp/9999ubash/ŇČŤÄÜÔŇŠŤŽĽ$
***
v zsh:
/tmp/9999ubash$ cd -^--^-Ť-^--^--^--^-ŠŤŽĽ/ /tmp/9999ubash/\M-E\M-^G\M-D\M-^L\M-E\M-$\M-C\M-^D\M-C\M-^\\M-C\M-^T\M-E\M-^G\M-E\M- \M-E\M-$\M-E\M-=\M-D\M-=$
**
/tmp/9999ubash/\M-E\M-^G\M-D\M-^L\M-E\M-$\M-C\M-^D\M-C\M-^\\M-C\M-^T\M-E\M-^G\M-E\M- \M-E\M-$\M-E\M-=\M-D\M-=$ ll
total 0
-rw-r--r--  1 ubash ubash 0 2006-04-14 14:28 ÉĹÓÁÚ
-rw-r--r--  1 ubash ubash 0 2006-04-14 21:18 ĺľščťžýáíéňäú
ťžýáíé-^-äú
touch: cannot touch `ĺľščťžýáíéňäú': Prístup odmietnutý
/tmp/9999ubash/\M-E\M-^G\M-D\M-^L\M-E\M-$\M-C\M-^D\M-C\M-^\\M-C\M-^T\M-E\M-^G\M-E\M- \M-E\M-$\M-E\M-=\M-D\M-=$

****************
Mrzi ma to, lebo zsh "nanucujem" :) kazdemu a len zo zvyku a pre vsade-dostupnost pisem scripty v bashi. Ale inak mam na kazdej konzole zsh. A teraz zistujem, ze bash uz davno pozna UTF-8 a zsh az 4.3 (co sice existuje, ale nebudem migrovat kvoli tomu zo stable) a plna ani v 4.4 :-(.






****************
http://www.cs.elte.hu/pub/zsh/FAQ

2.7: What is zsh's support for Unicode/UTF-8?

  `Unicode', or UCS for Universal Character Set, is the modern way
  of specifying character sets.  It replaces a large number of ad hoc
  ways of supporting character sets beyond ASCII.  `UTF-8' is an
  encoding of Unicode that is particularly natural on Unix-like systems.

  Q: Does zsh support UTF-8?

  A: zsh's built-in printf command supports "\u" and "\U" escapes
  to output arbitrary Unicode characters.  ZLE (the Zsh Line Editor) has
  no concept of character encodings, and is confused by multi-octet
  encodings.

  Q: Why doesn't zsh have proper UTF-8 support?

  A: The code has not been written yet.

  Q: What makes UTF-8 support difficult to implement?

  A: In order to handle arbitrary encodings the correct way, significant
  and intrusive changes must be made to the shell.

  Q: Why can't zsh just use readline?

  A: ZLE is not encapsulated from the rest of the shell.  Isolating it
  such that it could be replaced by readline would be a significant
  effort.  Furthermore, using readline would effect a significant loss of
  features.

  Q: What changes are planned?

  A: Introduction of Unicode support will be gradual, so if you are
  interested in being involved you should join the zsh-workers mailing
  list.  As a first step ZLE will be rewritten to use wide characters
  internally.  Character based widgets can then operate on a single wide
  character instead of a single byte, and the proper display width can be
  calculated with wcswidth().

[cut]

Chapter 5: Multibyte input

5.1: What is multibyte input?

  For a long time computers had a simple idea of a character: each octet
  (8-bit byte) of text contained one character.  This meant an application
  could only use 256 characters at once.  The first 128 characters (0 to
  127) on Unix and similar systems usually corresponded to the ASCII
  character set, as they still do.  So all other possibilities had to be
  crammed into the remaining 128.  This was done by picking the appropriate
  character set for the use you were making.  For example, ISO 8859
  specified a set of extensions to ASCII for various alphabets.

  This was fine for simple extensions and certain short enough relatives of
  the Latin alphabet (with no more than a few dozen alphabetic characters),
  but useless for complex alphabets.  Also, having a different character
  set for each language is inconvenient: you have to start a new terminal
  to run the shell with each character set.  So the character set had to be
  extended.  To cut a long story short, the world has mostly standardised
  on a character set called Unicode, related to the international standard
  ISO 10646.  The intention is that this will contain every single
  character used in all the languages of the world.

  This has far too many characters to fit into a single octet.  What's
  more, UNIX utilities such as zsh are so used to dealing with ASCII that
  removing it would cause no end of trouble.  So what happens is this: the
  128 ASCII characters are kept exactly the same (and they're the same as
  the first 128 characters of Unicode), but the remaining 128 characters
  are used to build up any other Unicode character by combining multiple
  octets together.  The shell doesn't need to interpret these directly; it
  just needs to ask the system library how many octets form the next
  character, and if there's a valid character there at all.  (It can also
  ask the system what width the character takes up on the screen, so that
  characters no longer need to be exacxtly one position wide.)

  The way this is done is called UTF-8.  Multibyte encodings of other
  character sets exist (you might encounter them for Asian character sets);
  zsh will be able to use any such encoding as long as it contains ASCII as
  a single-octet subset and the system can provide information about other
  characters.  However, in the case of Unicode, UTF-8 is the only one you
  are likely to enounter.

  (In case you're confused: Unicode is the character set, while UTF-8 is
  an encoding of it.  You might hear about other encodings, such as UCS-2
  and UCS-4 which are basically the character's index in the character set
  as a two-octet or four-octet integer.  You might see files encoded this
  way, for example on Windows, but the shell can't deal directly with text
  in those formats.)

5.2: How does zsh handle multibyte input?

  Until version 4.3, zsh didn't handle multibyte input properly at all.
  Each octet in a multibyte character would look to the shell like a
  separate character.  If your terminal handled the character set,
  characters might appear correct on screen, but trying to edit them would
  cause all sorts of odd effects.  (It was possible to edit in zsh using
  single-byte extensions of ASCII such as the ISO 8859 family, however.)

  From version 4.3, multibyte input is handled in the line editor if zsh
  has been compiled with the appropriate definitions.  This will happen
  automatically if the compiler defines __STDC_ISO_10646__, which is true
  for many recent GNU-based systems.  On other systems you must configure
  zsh with the argument --enable-multibyte to configure.  Explicit use of
  --enable-multibyte should work on many other recent UNIX systems; if it
  works on yours, and that's not mentioned in the shell documentation,
  please report this to zsh-workers zavinac sunsite bod dk, and if it doesn't 
but you
  can work out why not we'd also be interested in hearing.

  (The reason for the test for __STDC_ISO_10646__ is that its presence
  happens to indicate that the required library support is likely to be
  present, short-circuiting a large number of configuration tests.  This
  isn't strictly guaranteed, since the definition indicates the rather more
  limited fact that the wide character representation used internally by
  the shell is Unicode.  However, in practice such systems provide the
  right level of support for zsh to use.  It would be better to test
  individually for the library features the shell needs; unfortunately
  there are a lot of them.)

  You can test if multibyte handling is compiled into your version of the
  shell by running:

    (bindkey -m)

  which should output a warning:

    bindkey: warning: `bindkey -m' disables multibyte support

  If it doesn't, you don't have multibyte support in your shell.  The
  parentheses are there to run the command in a subshell, which protects
  your interactive shell from the effects being warned about.

  Multibyte strings are not yet handled anywhere else in the shell.  This
means, for example, patterns treat multibyte characters as a set of single
  octets and the ${#var} syntax counts octets, not characters.  There will
probably be new syntax to ensure that zsh can work both in its traditional
  way as well as when interpreting multibyte characters.

5.3: How do I ensure multibyte input works on my system?

  Once you have a version of zsh with multibyte support, you need to
  ensure the envivronment is correct.  We'll assume you're using UTF-8.
  Many modern systems may come set up correctly already.  Try one of
  the editing widgets described in the next section to see.

  There are basically three components.

   o  The locale.  This describes a whole series of features specific
      to countries or regions of which the character set is one.  Usually
      it is controlled by the environment variable LANG (there are
      others but this is the one to start with).  You need to find a
      locale whose name contains `UTF-8'.  This will be a variant on
      your usual locale, which typically indicates the language and
      country; for example, mine is `en_GB.UTF-8'.  Luckily, zsh can
      complete locale names, so if you have the new completion system
      loaded you can type `export LANG=' and attempt to complete a
      suitable locale.  It's the locale that tells the shell to expect the
      right form of multibyte input.  (However, there's no guarantee that
      the shell is actually going to get this input: for example, if you
      edit file names that have been created using a different character
      set it won't work properly.)
   o  The terminal emulator.  Those that are supplied with a recent
      desktop environment, such as gnome-terminal, are likely to have
      extensive support for localization and may work correctly as soon
      as they know the locale.  You can enable UTF-8 support for
      xterm in its application defaults file.  The following are
      the relevant resources; you don't actually need all of them, as
      described below.  If you use a `~/.Xdefaults' or
      `~/.Xresources' file for setting resources, prefix all the lines
      with `xterm':

        *wideChars: true
        *locale: true
        *utf8: 1
        *vt100Graphics: true

      This turns on support for wide characters (this is enabled by the
      utf8 resource, too); enables conversions to UTF-8 from other
      locales (this is the key resource and actually overrides
      `utf8'); turns on UTF-8 mode (this resource is mostly used to
      force use of UTF-8 characters if your locale system isn't up to it);
      and allows certain graphic characters to work even with UTF-8
      enabled.  (Thanks to Phil Pennock for suggestions.)
   o  The font.  If you selected this from a menu in your terminal
      emulator, there's a good chance it already selected the right
      character set to go with it.  If you hand-picked an old fashioned
      X font with a lot of dashes, you need to make sure it ends with
      the right character encoding, `iso10646-1' (and not, for
      example, `iso8859-1').  Not all characters will be available
      in any font, and some fonts may have a more restricted range of
      Unicode characters than others.

  As mentioned in the previous section, `bindkey -m' now outputs
  a warning message telling you that multibyte input from the terminal
  is likely not to work.  (See 3.5 if you don't know what
  this feature does.)  If your terminal doesn't have characters
  that need to be input as multibyte, however, you can still use
  the meta bindings and can ignore the warning message.  Use
  `bindkey -m 2>/dev/null' to suprress it.

5.4: How can I input characters that aren't on my keyboard?

  Two functions are provided with zsh that help you input characters.
  As with all editing widgets implemented by functions, you need to
  mark the function for autoload, create the widget, and, if you are
  going to use it frequently, bind it to a key sequence.  The
  following binds insert-composed-char to F5 on my keyboard:

    autoload -Uz insert-composed-char
    zle -N insert-composed-char
    bindkey '\e[15~' insert-composed-char

  The two widgets are described in the zshcontrib(1) manual
  page, but here is a brief summary:

  insert-composed-char is followed by two characters that
  are a mnemonic for a multibyte character.  For example `a:'
  is a with an Umlaut; `cH' is the symbol for hearts on a playing
  card.  Various accented characters, European and related alphabets,
  and punctuation and mathematical symbols are available.  The
  mnemonics are mostly those given by RFC 1345, see
  http://www.faqs.org/rfcs/rfc1345.html.

  insert-unicode-char is used to input a Unicode character by
  its hexadecimal number.  This is the number given in the Unicode
  character charts, see for example http://www.unicode.org/charts/.
  You need to execute the function, then type the hexadecimal number
  (you can omit any leading zeroes), then execute the function again.

  Both functions can be used without multibyte mode, provided the locale is
  correct and the character selected exists in the current character set;
  however, using UTF-8 massively extends the number of valid characters
  that can be produced.



Vdaka.


--
Jan Kunder
jan.kunderHATESPAMgmail.com


Partial thread listing: