zsh a diakritika v konzole nepodporovana - zsh nepozna UTF-8
To |
czdebian-l zavinac debian bod cz |
From |
Jan Kunder <jan bod kunder zavinac gmail bod com> |
Date |
Tue, 18 Apr 2006 20:38:49 +0200 |
Domainkey-signature |
a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:organization:user-agent:mime-version:to:subject:content-type:content-transfer-encoding; b=TypBzc+iTBkV5VtTlroiAdFvJ1j/3SMbyi31bF+hC2CIxWMO/SFlpA7J2tI+4uF1lqmm4ocW3k1QPL4YizO//uxmEQQS6Mc2quSAP2gWvHbNbZpYDc968H+mjhlVNNUWs39EIWcFZIhlmoj2u5QDE5RlbieJ/gnMArpmjr9ucz8= |
Organization |
Jan Kunder |
User-agent |
Thunderbird 1.5 (Windows/20051201) |
Pekny den.
Viem mi niekto $SUBJECT potvrdit/vyvratit?
Teraz zakazdym sa budem musiet prepinat do bashu, ak budem chciet zmazat
subor ŇČŤÄÜÔŇŠŤŽĽ, alebo vim-nut subor ĺľščťžýáíéňäú?
(IMHO je sice diakritika v suboroch prasacina, ale neviem to zmenit/zrusit)
Ide o Debian Sarge.
****************
http://www.cs.elte.hu/pub/zsh/FAQ
http://packages.debian.org/cgi-bin/search_packages.pl?suite=all&subword=0&exact=1&arch=any§ion=all&case=insensitive&keywords=zsh&searchon=names
cize plna podpora UTF-8 v zsh nehrozi ani v 4.4 :(
****************
Proste v bash-i mi ide vs. OK:
ubash zavinac ltspke:/tmp/9999ubash/ŇČŤÄÜÔŇŠŤŽĽ$ touch ĺľščťžýáíéňäú
ubash zavinac ltspke:/tmp/9999ubash/ŇČŤÄÜÔŇŠŤŽĽ$
***
v zsh:
/tmp/9999ubash$ cd -^--^-Ť-^--^--^--^-ŠŤŽĽ/
/tmp/9999ubash/\M-E\M-^G\M-D\M-^L\M-E\M-$\M-C\M-^D\M-C\M-^\\M-C\M-^T\M-E\M-^G\M-E\M-
\M-E\M-$\M-E\M-=\M-D\M-=$
**
/tmp/9999ubash/\M-E\M-^G\M-D\M-^L\M-E\M-$\M-C\M-^D\M-C\M-^\\M-C\M-^T\M-E\M-^G\M-E\M-
\M-E\M-$\M-E\M-=\M-D\M-=$ ll
total 0
-rw-r--r-- 1 ubash ubash 0 2006-04-14 14:28 ÉĹÓÁÚ
-rw-r--r-- 1 ubash ubash 0 2006-04-14 21:18 ĺľščťžýáíéňäú
ťžýáíé-^-äú
touch: cannot touch `ĺľščťžýáíéňäú': Prístup odmietnutý
/tmp/9999ubash/\M-E\M-^G\M-D\M-^L\M-E\M-$\M-C\M-^D\M-C\M-^\\M-C\M-^T\M-E\M-^G\M-E\M-
\M-E\M-$\M-E\M-=\M-D\M-=$
****************
Mrzi ma to, lebo zsh "nanucujem" :) kazdemu a len zo zvyku a pre
vsade-dostupnost pisem scripty v bashi. Ale inak mam na kazdej konzole
zsh. A teraz zistujem, ze bash uz davno pozna UTF-8 a zsh az 4.3 (co
sice existuje, ale nebudem migrovat kvoli tomu zo stable) a plna ani v
4.4 :-(.
****************
http://www.cs.elte.hu/pub/zsh/FAQ
2.7: What is zsh's support for Unicode/UTF-8?
`Unicode', or UCS for Universal Character Set, is the modern way
of specifying character sets. It replaces a large number of ad hoc
ways of supporting character sets beyond ASCII. `UTF-8' is an
encoding of Unicode that is particularly natural on Unix-like systems.
Q: Does zsh support UTF-8?
A: zsh's built-in printf command supports "\u" and "\U" escapes
to output arbitrary Unicode characters. ZLE (the Zsh Line Editor) has
no concept of character encodings, and is confused by multi-octet
encodings.
Q: Why doesn't zsh have proper UTF-8 support?
A: The code has not been written yet.
Q: What makes UTF-8 support difficult to implement?
A: In order to handle arbitrary encodings the correct way, significant
and intrusive changes must be made to the shell.
Q: Why can't zsh just use readline?
A: ZLE is not encapsulated from the rest of the shell. Isolating it
such that it could be replaced by readline would be a significant
effort. Furthermore, using readline would effect a significant loss of
features.
Q: What changes are planned?
A: Introduction of Unicode support will be gradual, so if you are
interested in being involved you should join the zsh-workers mailing
list. As a first step ZLE will be rewritten to use wide characters
internally. Character based widgets can then operate on a single wide
character instead of a single byte, and the proper display width can be
calculated with wcswidth().
[cut]
Chapter 5: Multibyte input
5.1: What is multibyte input?
For a long time computers had a simple idea of a character: each octet
(8-bit byte) of text contained one character. This meant an application
could only use 256 characters at once. The first 128 characters (0 to
127) on Unix and similar systems usually corresponded to the ASCII
character set, as they still do. So all other possibilities had to be
crammed into the remaining 128. This was done by picking the appropriate
character set for the use you were making. For example, ISO 8859
specified a set of extensions to ASCII for various alphabets.
This was fine for simple extensions and certain short enough relatives of
the Latin alphabet (with no more than a few dozen alphabetic characters),
but useless for complex alphabets. Also, having a different character
set for each language is inconvenient: you have to start a new terminal
to run the shell with each character set. So the character set had to be
extended. To cut a long story short, the world has mostly standardised
on a character set called Unicode, related to the international standard
ISO 10646. The intention is that this will contain every single
character used in all the languages of the world.
This has far too many characters to fit into a single octet. What's
more, UNIX utilities such as zsh are so used to dealing with ASCII that
removing it would cause no end of trouble. So what happens is this: the
128 ASCII characters are kept exactly the same (and they're the same as
the first 128 characters of Unicode), but the remaining 128 characters
are used to build up any other Unicode character by combining multiple
octets together. The shell doesn't need to interpret these directly; it
just needs to ask the system library how many octets form the next
character, and if there's a valid character there at all. (It can also
ask the system what width the character takes up on the screen, so that
characters no longer need to be exacxtly one position wide.)
The way this is done is called UTF-8. Multibyte encodings of other
character sets exist (you might encounter them for Asian character sets);
zsh will be able to use any such encoding as long as it contains ASCII as
a single-octet subset and the system can provide information about other
characters. However, in the case of Unicode, UTF-8 is the only one you
are likely to enounter.
(In case you're confused: Unicode is the character set, while UTF-8 is
an encoding of it. You might hear about other encodings, such as UCS-2
and UCS-4 which are basically the character's index in the character set
as a two-octet or four-octet integer. You might see files encoded this
way, for example on Windows, but the shell can't deal directly with text
in those formats.)
5.2: How does zsh handle multibyte input?
Until version 4.3, zsh didn't handle multibyte input properly at all.
Each octet in a multibyte character would look to the shell like a
separate character. If your terminal handled the character set,
characters might appear correct on screen, but trying to edit them would
cause all sorts of odd effects. (It was possible to edit in zsh using
single-byte extensions of ASCII such as the ISO 8859 family, however.)
From version 4.3, multibyte input is handled in the line editor if zsh
has been compiled with the appropriate definitions. This will happen
automatically if the compiler defines __STDC_ISO_10646__, which is true
for many recent GNU-based systems. On other systems you must configure
zsh with the argument --enable-multibyte to configure. Explicit use of
--enable-multibyte should work on many other recent UNIX systems; if it
works on yours, and that's not mentioned in the shell documentation,
please report this to zsh-workers zavinac sunsite bod dk, and if it doesn't
but you
can work out why not we'd also be interested in hearing.
(The reason for the test for __STDC_ISO_10646__ is that its presence
happens to indicate that the required library support is likely to be
present, short-circuiting a large number of configuration tests. This
isn't strictly guaranteed, since the definition indicates the rather more
limited fact that the wide character representation used internally by
the shell is Unicode. However, in practice such systems provide the
right level of support for zsh to use. It would be better to test
individually for the library features the shell needs; unfortunately
there are a lot of them.)
You can test if multibyte handling is compiled into your version of the
shell by running:
(bindkey -m)
which should output a warning:
bindkey: warning: `bindkey -m' disables multibyte support
If it doesn't, you don't have multibyte support in your shell. The
parentheses are there to run the command in a subshell, which protects
your interactive shell from the effects being warned about.
Multibyte strings are not yet handled anywhere else in the shell. This
means, for example, patterns treat multibyte characters as a set of
single
octets and the ${#var} syntax counts octets, not characters. There will
probably be new syntax to ensure that zsh can work both in its
traditional
way as well as when interpreting multibyte characters.
5.3: How do I ensure multibyte input works on my system?
Once you have a version of zsh with multibyte support, you need to
ensure the envivronment is correct. We'll assume you're using UTF-8.
Many modern systems may come set up correctly already. Try one of
the editing widgets described in the next section to see.
There are basically three components.
o The locale. This describes a whole series of features specific
to countries or regions of which the character set is one. Usually
it is controlled by the environment variable LANG (there are
others but this is the one to start with). You need to find a
locale whose name contains `UTF-8'. This will be a variant on
your usual locale, which typically indicates the language and
country; for example, mine is `en_GB.UTF-8'. Luckily, zsh can
complete locale names, so if you have the new completion system
loaded you can type `export LANG=' and attempt to complete a
suitable locale. It's the locale that tells the shell to expect the
right form of multibyte input. (However, there's no guarantee that
the shell is actually going to get this input: for example, if you
edit file names that have been created using a different character
set it won't work properly.)
o The terminal emulator. Those that are supplied with a recent
desktop environment, such as gnome-terminal, are likely to have
extensive support for localization and may work correctly as soon
as they know the locale. You can enable UTF-8 support for
xterm in its application defaults file. The following are
the relevant resources; you don't actually need all of them, as
described below. If you use a `~/.Xdefaults' or
`~/.Xresources' file for setting resources, prefix all the lines
with `xterm':
*wideChars: true
*locale: true
*utf8: 1
*vt100Graphics: true
This turns on support for wide characters (this is enabled by the
utf8 resource, too); enables conversions to UTF-8 from other
locales (this is the key resource and actually overrides
`utf8'); turns on UTF-8 mode (this resource is mostly used to
force use of UTF-8 characters if your locale system isn't up to it);
and allows certain graphic characters to work even with UTF-8
enabled. (Thanks to Phil Pennock for suggestions.)
o The font. If you selected this from a menu in your terminal
emulator, there's a good chance it already selected the right
character set to go with it. If you hand-picked an old fashioned
X font with a lot of dashes, you need to make sure it ends with
the right character encoding, `iso10646-1' (and not, for
example, `iso8859-1'). Not all characters will be available
in any font, and some fonts may have a more restricted range of
Unicode characters than others.
As mentioned in the previous section, `bindkey -m' now outputs
a warning message telling you that multibyte input from the terminal
is likely not to work. (See 3.5 if you don't know what
this feature does.) If your terminal doesn't have characters
that need to be input as multibyte, however, you can still use
the meta bindings and can ignore the warning message. Use
`bindkey -m 2>/dev/null' to suprress it.
5.4: How can I input characters that aren't on my keyboard?
Two functions are provided with zsh that help you input characters.
As with all editing widgets implemented by functions, you need to
mark the function for autoload, create the widget, and, if you are
going to use it frequently, bind it to a key sequence. The
following binds insert-composed-char to F5 on my keyboard:
autoload -Uz insert-composed-char
zle -N insert-composed-char
bindkey '\e[15~' insert-composed-char
The two widgets are described in the zshcontrib(1) manual
page, but here is a brief summary:
insert-composed-char is followed by two characters that
are a mnemonic for a multibyte character. For example `a:'
is a with an Umlaut; `cH' is the symbol for hearts on a playing
card. Various accented characters, European and related alphabets,
and punctuation and mathematical symbols are available. The
mnemonics are mostly those given by RFC 1345, see
http://www.faqs.org/rfcs/rfc1345.html.
insert-unicode-char is used to input a Unicode character by
its hexadecimal number. This is the number given in the Unicode
character charts, see for example http://www.unicode.org/charts/.
You need to execute the function, then type the hexadecimal number
(you can omit any leading zeroes), then execute the function again.
Both functions can be used without multibyte mode, provided the locale is
correct and the character selected exists in the current character set;
however, using UTF-8 massively extends the number of valid characters
that can be produced.
Vdaka.
--
Jan Kunder
jan.kunderHATESPAMgmail.com
Partial thread listing: