peps/pep-0528/index.html

304 lines
24 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="color-scheme" content="light dark">
<title>PEP 528 Change Windows console encoding to UTF-8 | peps.python.org</title>
<link rel="shortcut icon" href="../_static/py.png">
<link rel="canonical" href="https://peps.python.org/pep-0528/">
<link rel="stylesheet" href="../_static/style.css" type="text/css">
<link rel="stylesheet" href="../_static/mq.css" type="text/css">
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" media="(prefers-color-scheme: light)" id="pyg-light">
<link rel="stylesheet" href="../_static/pygments_dark.css" type="text/css" media="(prefers-color-scheme: dark)" id="pyg-dark">
<link rel="alternate" type="application/rss+xml" title="Latest PEPs" href="https://peps.python.org/peps.rss">
<meta property="og:title" content='PEP 528 Change Windows console encoding to UTF-8 | peps.python.org'>
<meta property="og:type" content="website">
<meta property="og:url" content="https://peps.python.org/pep-0528/">
<meta property="og:site_name" content="Python Enhancement Proposals (PEPs)">
<meta property="og:image" content="https://peps.python.org/_static/og-image.png">
<meta property="og:image:alt" content="Python PEPs">
<meta property="og:image:width" content="200">
<meta property="og:image:height" content="200">
<meta name="description" content="Python Enhancement Proposals (PEPs)">
<meta name="theme-color" content="#3776ab">
</head>
<body>
<svg xmlns="http://www.w3.org/2000/svg" style="display: none;">
<symbol id="svg-sun-half" viewBox="0 0 24 24" pointer-events="all">
<title>Following system colour scheme</title>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none"
stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
<circle cx="12" cy="12" r="9"></circle>
<path d="M12 3v18m0-12l4.65-4.65M12 14.3l7.37-7.37M12 19.6l8.85-8.85"></path>
</svg>
</symbol>
<symbol id="svg-moon" viewBox="0 0 24 24" pointer-events="all">
<title>Selected dark colour scheme</title>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none"
stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
<path stroke="none" d="M0 0h24v24H0z" fill="none"></path>
<path d="M12 3c.132 0 .263 0 .393 0a7.5 7.5 0 0 0 7.92 12.446a9 9 0 1 1 -8.313 -12.454z"></path>
</svg>
</symbol>
<symbol id="svg-sun" viewBox="0 0 24 24" pointer-events="all">
<title>Selected light colour scheme</title>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none"
stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
<circle cx="12" cy="12" r="5"></circle>
<line x1="12" y1="1" x2="12" y2="3"></line>
<line x1="12" y1="21" x2="12" y2="23"></line>
<line x1="4.22" y1="4.22" x2="5.64" y2="5.64"></line>
<line x1="18.36" y1="18.36" x2="19.78" y2="19.78"></line>
<line x1="1" y1="12" x2="3" y2="12"></line>
<line x1="21" y1="12" x2="23" y2="12"></line>
<line x1="4.22" y1="19.78" x2="5.64" y2="18.36"></line>
<line x1="18.36" y1="5.64" x2="19.78" y2="4.22"></line>
</svg>
</symbol>
</svg>
<script>
document.documentElement.dataset.colour_scheme = localStorage.getItem("colour_scheme") || "auto"
</script>
<section id="pep-page-section">
<header>
<h1>Python Enhancement Proposals</h1>
<ul class="breadcrumbs">
<li><a href="https://www.python.org/" title="The Python Programming Language">Python</a> &raquo; </li>
<li><a href="../pep-0000/">PEP Index</a> &raquo; </li>
<li>PEP 528</li>
</ul>
<button id="colour-scheme-cycler" onClick="setColourScheme(nextColourScheme())">
<svg aria-hidden="true" class="colour-scheme-icon-when-auto"><use href="#svg-sun-half"></use></svg>
<svg aria-hidden="true" class="colour-scheme-icon-when-dark"><use href="#svg-moon"></use></svg>
<svg aria-hidden="true" class="colour-scheme-icon-when-light"><use href="#svg-sun"></use></svg>
<span class="visually-hidden">Toggle light / dark / auto colour theme</span>
</button>
</header>
<article>
<section id="pep-content">
<h1 class="page-title">PEP 528 Change Windows console encoding to UTF-8</h1>
<dl class="rfc2822 field-list simple">
<dt class="field-odd">Author<span class="colon">:</span></dt>
<dd class="field-odd">Steve Dower &lt;steve.dower&#32;&#97;t&#32;python.org&gt;</dd>
<dt class="field-even">Status<span class="colon">:</span></dt>
<dd class="field-even"><abbr title="Accepted and implementation complete, or no longer active">Final</abbr></dd>
<dt class="field-odd">Type<span class="colon">:</span></dt>
<dd class="field-odd"><abbr title="Normative PEP with a new feature for Python, implementation change for CPython or interoperability standard for the ecosystem">Standards Track</abbr></dd>
<dt class="field-even">Created<span class="colon">:</span></dt>
<dd class="field-even">27-Aug-2016</dd>
<dt class="field-odd">Python-Version<span class="colon">:</span></dt>
<dd class="field-odd">3.6</dd>
<dt class="field-even">Post-History<span class="colon">:</span></dt>
<dd class="field-even">01-Sep-2016, 04-Sep-2016</dd>
<dt class="field-odd">Resolution<span class="colon">:</span></dt>
<dd class="field-odd"><a class="reference external" href="https://mail.python.org/pipermail/python-dev/2016-September/146278.html">Python-Dev message</a></dd>
</dl>
<hr class="docutils" />
<section id="contents">
<details><summary>Table of Contents</summary><ul class="simple">
<li><a class="reference internal" href="#abstract">Abstract</a></li>
<li><a class="reference internal" href="#specific-changes">Specific Changes</a><ul>
<li><a class="reference internal" href="#add-io-windowsconsoleio">Add _io.WindowsConsoleIO</a></li>
<li><a class="reference internal" href="#add-pyos-windowsconsolereadline">Add _PyOS_WindowsConsoleReadline</a></li>
<li><a class="reference internal" href="#add-legacy-mode">Add legacy mode</a></li>
</ul>
</li>
<li><a class="reference internal" href="#alternative-approaches">Alternative Approaches</a></li>
<li><a class="reference internal" href="#code-that-may-break">Code that may break</a><ul>
<li><a class="reference internal" href="#assuming-stdin-stdout-encoding">Assuming stdin/stdout encoding</a></li>
<li><a class="reference internal" href="#incorrectly-using-the-raw-object">Incorrectly using the raw object</a></li>
<li><a class="reference internal" href="#using-the-raw-object-with-small-buffers">Using the raw object with small buffers</a></li>
</ul>
</li>
<li><a class="reference internal" href="#copyright">Copyright</a></li>
</ul>
</details></section>
<section id="abstract">
<h2><a class="toc-backref" href="#abstract" role="doc-backlink">Abstract</a></h2>
<p>Historically, Python uses the ANSI APIs for interacting with the Windows
operating system, often via C Runtime functions. However, these have been long
discouraged in favor of the UTF-16 APIs. Within the operating system, all text
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
the active code page.</p>
<p>This PEP proposes changing the default standard stream implementation on Windows
to use the Unicode APIs. This will allow users to print and input the full range
of Unicode characters at the default Windows console. This also requires a
subtle change to how the tokenizer parses text from readline hooks.</p>
</section>
<section id="specific-changes">
<h2><a class="toc-backref" href="#specific-changes" role="doc-backlink">Specific Changes</a></h2>
<section id="add-io-windowsconsoleio">
<h3><a class="toc-backref" href="#add-io-windowsconsoleio" role="doc-backlink">Add _io.WindowsConsoleIO</a></h3>
<p>Currently an instance of <code class="docutils literal notranslate"><span class="pre">_io.FileIO</span></code> is used to wrap the file descriptors
representing standard input, output and error. We add a new class (implemented
in C) <code class="docutils literal notranslate"><span class="pre">_io.WindowsConsoleIO</span></code> that acts as a raw IO object using the Windows
console functions, specifically, <code class="docutils literal notranslate"><span class="pre">ReadConsoleW</span></code> and <code class="docutils literal notranslate"><span class="pre">WriteConsoleW</span></code>.</p>
<p>This class will be used when the legacy-mode flag is not in effect, when opening
a standard stream by file descriptor and the stream is a console buffer rather
than a redirected file. Otherwise, <code class="docutils literal notranslate"><span class="pre">_io.FileIO</span></code> will be used as it is today.</p>
<p>This is a raw (bytes) IO class that requires text to be passed encoded with
utf-8, which will be decoded to utf-16-le and passed to the Windows APIs.
Similarly, bytes read from the class will be provided by the operating system as
utf-16-le and converted into utf-8 when returned to Python.</p>
<p>The use of an ASCII compatible encoding is required to maintain compatibility
with code that bypasses the <code class="docutils literal notranslate"><span class="pre">TextIOWrapper</span></code> and directly writes ASCII bytes to
the standard streams (for example, <a class="reference external" href="https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py">Twisteds process_stdinreader.py</a>). Code that assumes
a particular encoding for the standard streams other than ASCII will likely
break.</p>
</section>
<section id="add-pyos-windowsconsolereadline">
<h3><a class="toc-backref" href="#add-pyos-windowsconsolereadline" role="doc-backlink">Add _PyOS_WindowsConsoleReadline</a></h3>
<p>To allow Unicode entry at the interactive prompt, a new readline hook is
required. The existing <code class="docutils literal notranslate"><span class="pre">PyOS_StdioReadline</span></code> function will delegate to the new
<code class="docutils literal notranslate"><span class="pre">_PyOS_WindowsConsoleReadline</span></code> function when reading from a file descriptor
that is a console buffer and the legacy-mode flag is not in effect (the logic
should be identical to above).</p>
<p>Since the readline interface is required to return an 8-bit encoded string with
no embedded nulls, the <code class="docutils literal notranslate"><span class="pre">_PyOS_WindowsConsoleReadline</span></code> function transcodes from
utf-16-le as read from the operating system into utf-8.</p>
<p>The function <code class="docutils literal notranslate"><span class="pre">PyRun_InteractiveOneObject</span></code> which currently obtains the encoding
from <code class="docutils literal notranslate"><span class="pre">sys.stdin</span></code> will select utf-8 unless the legacy-mode flag is in effect.
This may require readline hooks to change their encodings to utf-8, or to
require legacy-mode for correct behaviour.</p>
</section>
<section id="add-legacy-mode">
<h3><a class="toc-backref" href="#add-legacy-mode" role="doc-backlink">Add legacy mode</a></h3>
<p>Launching Python with the environment variable <code class="docutils literal notranslate"><span class="pre">PYTHONLEGACYWINDOWSSTDIO</span></code> set
will enable the legacy-mode flag, which completely restores the previous
behaviour.</p>
</section>
</section>
<section id="alternative-approaches">
<h2><a class="toc-backref" href="#alternative-approaches" role="doc-backlink">Alternative Approaches</a></h2>
<p>The <a class="reference external" href="https://pypi.org/project/win_unicode_console/">win_unicode_console package</a> is a pure-Python alternative to changing the
default behaviour of the console. It implements essentially the same
modifications as described here using pure Python code.</p>
</section>
<section id="code-that-may-break">
<h2><a class="toc-backref" href="#code-that-may-break" role="doc-backlink">Code that may break</a></h2>
<p>The following code patterns may break or see different behaviour as a result of
this change. All of these code samples require explicitly choosing to use a raw
file object in place of a more convenient wrapper that would prevent any visible
change.</p>
<section id="assuming-stdin-stdout-encoding">
<h3><a class="toc-backref" href="#assuming-stdin-stdout-encoding" role="doc-backlink">Assuming stdin/stdout encoding</a></h3>
<p>Code that assumes that the encoding required by <code class="docutils literal notranslate"><span class="pre">sys.stdin.buffer</span></code> or
<code class="docutils literal notranslate"><span class="pre">sys.stdout.buffer</span></code> is <code class="docutils literal notranslate"><span class="pre">'mbcs'</span></code> or a more specific encoding may currently be
working by chance, but could encounter issues under this change. For example:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">buffer</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">&#39;mbcs&#39;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">r</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">buffer</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s1">&#39;cp437&#39;</span><span class="p">)</span>
</pre></div>
</div>
<p>To correct this code, the encoding specified on the <code class="docutils literal notranslate"><span class="pre">TextIOWrapper</span></code> should be
used, either implicitly or explicitly:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="c1"># Fix 1: Use wrapper correctly</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">r</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># Fix 2: Use encoding explicitly</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">buffer</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">encoding</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">r</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">buffer</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">encoding</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="incorrectly-using-the-raw-object">
<h3><a class="toc-backref" href="#incorrectly-using-the-raw-object" role="doc-backlink">Incorrectly using the raw object</a></h3>
<p>Code that uses the raw IO object and does not correctly handle partial reads and
writes may be affected. This is particularly important for reads, where the
number of characters read will never exceed one-fourth of the number of bytes
allowed, as there is no feasible way to prevent input from encoding as much
longer utf-8 strings:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">raw_stdin</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">buffer</span><span class="o">.</span><span class="n">raw</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">raw_stdin</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">15</span><span class="p">)</span>
<span class="go">abcdefghijklm</span>
<span class="go">b&#39;abc&#39;</span>
<span class="go"># data contains at most 3 characters, and never more than 12 bytes</span>
<span class="go"># error, as &quot;defghijklm\r\n&quot; is passed to the interactive prompt</span>
</pre></div>
</div>
<p>To correct this code, the buffered reader/writer should be used, or the caller
should continue reading until its buffer is full:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="c1"># Fix 1: Use the buffered reader/writer</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">stdin</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">buffer</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">stdin</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">15</span><span class="p">)</span>
<span class="go">abcedfghijklm</span>
<span class="go">b&#39;abcdefghijklm\r\n&#39;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># Fix 2: Loop until enough bytes have been read</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">raw_stdin</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">buffer</span><span class="o">.</span><span class="n">raw</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">b</span> <span class="o">=</span> <span class="sa">b</span><span class="s1">&#39;&#39;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="k">while</span> <span class="nb">len</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">15</span><span class="p">:</span>
<span class="gp">... </span> <span class="n">b</span> <span class="o">+=</span> <span class="n">raw_stdin</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">15</span><span class="p">)</span>
<span class="go">abcedfghijklm</span>
<span class="go">b&#39;abcdefghijklm\r\n&#39;</span>
</pre></div>
</div>
</section>
<section id="using-the-raw-object-with-small-buffers">
<h3><a class="toc-backref" href="#using-the-raw-object-with-small-buffers" role="doc-backlink">Using the raw object with small buffers</a></h3>
<p>Code that uses the raw IO object and attempts to read less than four characters
will now receive an error. Because its possible that any single character may
require up to four bytes when represented in utf-8, requests must fail:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">raw_stdin</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">buffer</span><span class="o">.</span><span class="n">raw</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">raw_stdin</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
<span class="gt">Traceback (most recent call last):</span>
File <span class="nb">&quot;&lt;stdin&gt;&quot;</span>, line <span class="m">1</span>, in <span class="n">&lt;module&gt;</span>
<span class="gr">ValueError</span>: <span class="n">must read at least 4 bytes</span>
</pre></div>
</div>
<p>The only workaround is to pass a larger buffer:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="c1"># Fix: Request at least four bytes</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">raw_stdin</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">buffer</span><span class="o">.</span><span class="n">raw</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">raw_stdin</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span>
<span class="go">a</span>
<span class="go">b&#39;a&#39;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="o">&gt;&gt;&gt;</span>
</pre></div>
</div>
<p>(The extra <code class="docutils literal notranslate"><span class="pre">&gt;&gt;&gt;</span></code> is due to the newline remaining in the input buffer and is
expected in this situation.)</p>
</section>
</section>
<section id="copyright">
<h2><a class="toc-backref" href="#copyright" role="doc-backlink">Copyright</a></h2>
<p>This document has been placed in the public domain.</p>
</section>
</section>
<hr class="docutils" />
<p>Source: <a class="reference external" href="https://github.com/python/peps/blob/main/peps/pep-0528.rst">https://github.com/python/peps/blob/main/peps/pep-0528.rst</a></p>
<p>Last modified: <a class="reference external" href="https://github.com/python/peps/commits/main/peps/pep-0528.rst">2023-09-09 17:39:29 GMT</a></p>
</article>
<nav id="pep-sidebar">
<h2>Contents</h2>
<ul>
<li><a class="reference internal" href="#abstract">Abstract</a></li>
<li><a class="reference internal" href="#specific-changes">Specific Changes</a><ul>
<li><a class="reference internal" href="#add-io-windowsconsoleio">Add _io.WindowsConsoleIO</a></li>
<li><a class="reference internal" href="#add-pyos-windowsconsolereadline">Add _PyOS_WindowsConsoleReadline</a></li>
<li><a class="reference internal" href="#add-legacy-mode">Add legacy mode</a></li>
</ul>
</li>
<li><a class="reference internal" href="#alternative-approaches">Alternative Approaches</a></li>
<li><a class="reference internal" href="#code-that-may-break">Code that may break</a><ul>
<li><a class="reference internal" href="#assuming-stdin-stdout-encoding">Assuming stdin/stdout encoding</a></li>
<li><a class="reference internal" href="#incorrectly-using-the-raw-object">Incorrectly using the raw object</a></li>
<li><a class="reference internal" href="#using-the-raw-object-with-small-buffers">Using the raw object with small buffers</a></li>
</ul>
</li>
<li><a class="reference internal" href="#copyright">Copyright</a></li>
</ul>
<br>
<a id="source" href="https://github.com/python/peps/blob/main/peps/pep-0528.rst">Page Source (GitHub)</a>
</nav>
</section>
<script src="../_static/colour_scheme.js"></script>
<script src="../_static/wrap_tables.js"></script>
<script src="../_static/sticky_banner.js"></script>
</body>
</html>