Python “黑魔法” 之 Encoding & Decoding

鄒強(qiáng) 發(fā)布于2019-07-25 10:38 / 3836人閱讀

摘要：我可以明確告訴你這不是，但它可以用解釋器運(yùn)行。這種黑魔法，還要從說(shuō)起。提案者設(shè)想使用一種特殊的文件首注釋?zhuān)糜谥付ùa的編碼。暴露了一個(gè)函數(shù)，用于注冊(cè)自定義編碼。所謂的黑魔法其實(shí)并不神秘，照貓畫(huà)虎定義好相應(yīng)的接口即可。

首發(fā)于我的博客，轉(zhuǎn)載請(qǐng)注明出處

寫(xiě)在前面

本文為科普文

本文中的例子在 Ubuntu 14.04 / Python 2.7.11 下運(yùn)行成功，Python 3+ 的接口有些許不同，需要讀者自行轉(zhuǎn)換

引子

先看一段代碼：

example.py：

# -*- coding=yi -*-

從 math 導(dǎo)入 sin, pi

打印 "sin(pi) =", sin(pi)

這是什么？！是 Python 嗎？可以運(yùn)行嗎？——想必你會(huì)問(wèn)。

我可以明確告訴你：這不是 Python，但它可以用 Python 解釋器運(yùn)行。當(dāng)然，如果你愿意，可以叫它 “Yython” （易語(yǔ)言 + Python）。

怎么做到的？也許你已經(jīng)注意到第一行的奇怪注釋——沒(méi)錯(cuò)，秘密全在這里。

這種黑魔法，還要從 PEP 263 說(shuō)起。

古老的 PEP 263

我相信 99% 的中國(guó) Python 開(kāi)發(fā)者都曾經(jīng)為一個(gè)問(wèn)題而頭疼——字符編碼。那是每個(gè)初學(xué)者的夢(mèng)靨。

還記得那天嗎？當(dāng)你試圖用代碼向它示好：

print "你好"

它卻給你當(dāng)頭一棒：

SyntaxError: Non-ASCII character "xe4" in file chi.py on line 1, but no encoding declared

【一臉懵逼】

于是，你上網(wǎng)查找解決方案。很快，你便有了答案：

# -*- coding=utf-8 -*-

print "你好"

其中第一行的注釋用于指定解析該文件的編碼。

這個(gè)特新來(lái)自 2001 年的 PEP 263 -- Defining Python Source Code Encodings，它的出現(xiàn)是為了解決一個(gè)反響廣泛的問(wèn)題：

In Python 2.1, Unicode literals can only be written using the Latin-1 based encoding "unicode-escape". This makes the programming environment rather unfriendly to Python users who live and work in non-Latin-1 locales such as many of the Asian countries. Programmers can write their 8-bit strings using the favorite encoding, but are bound to the "unicode-escape" encoding for Unicode literals.

Python 默認(rèn)用 ASCII 編碼解析文件，給 15 年前的非英文世界開(kāi)發(fā)者造成了不小的困擾——看來(lái) Guido 老爹有些個(gè)人主義，設(shè)計(jì)時(shí)只考慮到了英文世界。

提案者設(shè)想：使用一種特殊的文件首注釋?zhuān)糜谥付ùa的編碼。這個(gè)注釋的正則原型是這樣的：

^[ 	v]*#.*?coding[:=][ 	]*([-_.a-zA-Z0-9]+)

也就是說(shuō) # -*- coding=utf-8 -*- 并不是唯一的寫(xiě)法，只是 Emacs 推薦寫(xiě)法而已。諸如 # coding=utf-8、# encoding: utf-8 都是合法的——因此你不必驚訝于他人編碼聲明與你不同。

正則的捕獲組 ([-_.a-zA-Z0-9]+) 將會(huì)被用作查找編碼的名稱(chēng)，查找到的編碼信息會(huì)被用于解碼文件。也就是說(shuō)，import example 背后其實(shí)相當(dāng)于有如下轉(zhuǎn)換過(guò)程：

with open("example.py", "r") as f:
    content = f.read()
    encoding = extract_encoding_info(content) # 解析首注釋
    exec(content.decode(encoding))

問(wèn)題其實(shí)又回到我們常用的 str.encode 和 str.decode 上來(lái)了。

可 Python 怎么這么強(qiáng)大？！幾乎所有編碼它都認(rèn)得！這是怎么做到的？是標(biāo)準(zhǔn)庫(kù)？還是內(nèi)置于解釋器中？

一切，都是 codecs 模塊在起作用。

codecs

codecs 算是較為冷門(mén)的一個(gè)模塊，更為常用的是 str 的 encode/decode 的方法——但它們本質(zhì)都是對(duì) codecs 的調(diào)用。

打開(kāi) /path/to/your/python/lib/encodings/ 目錄，你會(huì)發(fā)現(xiàn)有許多以編碼名稱(chēng)命名的 .py 文件，如 utf_8.py、latin_1.py。這些都是系統(tǒng)預(yù)定義的編碼系統(tǒng)，實(shí)現(xiàn)了應(yīng)對(duì)各種編碼的邏輯——也就是說(shuō)：編碼系統(tǒng)其實(shí)也是普通的模塊。

除了內(nèi)置的編碼，用戶(hù)也可以 自行定義編碼系統(tǒng)。codecs 暴露了一個(gè) register 函數(shù)，用于注冊(cè)自定義編碼。register 簽名如下：

codecs.register(search_function)
Register a codec search function. Search functions are expected to take one argument, the encoding name in all lower case letters, and return a CodecInfo object having the following attributes:

name: The name of the encoding;
encode: The stateless encoding function;
decode: The stateless decoding function;
incrementalencoder: An incremental encoder class or factory function;
incrementaldecoder: An incremental decoder class or factory function;
streamwriter: A stream writer class or factory function;
streamreader: A stream reader class or factory function.

encode 和 decode 是無(wú)狀態(tài)的編碼/解碼的函數(shù)，簡(jiǎn)單說(shuō)就是：前一個(gè)被編解碼的字符串與后一個(gè)沒(méi)有關(guān)聯(lián)。如果你想用 codecs 系統(tǒng)進(jìn)行語(yǔ)法樹(shù)解析，解析邏輯最好不要寫(xiě)在這里，因?yàn)榇a的連續(xù)性無(wú)法被保證；incremental* 則是有狀態(tài)的解析類(lèi)，能彌補(bǔ) encode、decode 的不足；stream* 是流相關(guān)的解析類(lèi)，行為通常與 encode/decode 相同。

關(guān)于這六個(gè)對(duì)象的具體寫(xiě)法，可以參考 /path/to/your/python/lib/encodings/rot_13.py，該文件實(shí)現(xiàn)了一個(gè)簡(jiǎn)單的密碼系統(tǒng)。

那么，是時(shí)候揭開(kāi)真相了。

所謂的 “Yython”

黑魔法其實(shí)并不神秘，照貓畫(huà)虎定義好相應(yīng)的接口即可。作為例子，這里只處理用到的關(guān)鍵字：

yi.py：

# encoding=utf8

import codecs

yi_map = {
    u"從": "from",
    u"導(dǎo)入": "import",
    u"打印": "print"
}


def encode(input):
    for key, value in yi_map.items():
        input = input.replace(value, key)

    return input.encode("utf8")


def decode(input):
    input = input.decode("utf8")
    for key, value in yi_map.items():
        input = input.replace(key, value)

    return input


class Codec(codecs.Codec):

    def encode(self, input, errors="strict"):
        input = encode(input)

        return (input, len(input))

    def decode(self, input, errors="strict"):
        input = decode(input)

        return (input, len(input))


class IncrementalEncoder(codecs.IncrementalEncoder):
    def encode(self, input, final=False):
        return encode(input)


class IncrementalDecoder(codecs.IncrementalDecoder):
    def decode(self, input, final=False):
        return decode(input)


class StreamWriter(Codec, codecs.StreamWriter):
    pass


class StreamReader(Codec, codecs.StreamReader):
    pass


def register_entry(encoding):
    return codecs.CodecInfo(
        name="yi",
        encode=Codec().encode,
        decode=Codec().decode,
        incrementalencoder=IncrementalEncoder,
        incrementaldecoder=IncrementalDecoder,
        streamwriter=StreamWriter,
        streamreader=StreamReader
    ) if encoding == "yi" else None

在命令行里注冊(cè)一下，就可以看到激動(dòng)人心的結(jié)果了：

>>> import codecs, yi
>>> codecs.register(yi.register_entry)
>>> import example
sin(pi) = 1.22464679915e-16

結(jié)語(yǔ)

有時(shí)，對(duì)習(xí)以為常的東西深入了解一下，說(shuō)不定會(huì)有驚人的發(fā)現(xiàn)。

References

codecs - Codec registry and base classes

GPU云服務(wù)器云服務(wù)器 python 黑魔法黑魔法 Decoding python&amp39

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://www.ezyhdfw.cn/yun/38074.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

鄒強(qiáng)

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

同樣是做后端的，為什么你同學(xué)年薪80萬(wàn)，你20多萬(wàn)？

閱讀 1154·2021-11-19 09:40
阿里云和騰訊云服務(wù)器選擇哪個(gè)比較好（細(xì)數(shù)各路服務(wù)商套路）

閱讀 2295·2021-11-15 18:00
hostyun，中國(guó)香港大帶寬VPS，低至17元/月，50M – 100M 香港

閱讀 1349·2021-10-18 13:34
嵌入式工程師月薪有多少？零基礎(chǔ)學(xué)嵌入式要多久？

閱讀 2306·2021-09-02 15:40
20180224-css選擇器的權(quán)重

閱讀 1605·2019-08-30 14:01
近階段前端面試問(wèn)題匯總（css篇）

閱讀 1170·2019-08-30 11:11
單選框和字對(duì)齊

閱讀 2538·2019-08-29 15:26
【CSS】nth-child 與 nth-of-type 的元素查找方式

閱讀 791·2019-08-29 14:15

亚洲中字慕日产2020,大陆极品少妇内射AAAAAA,无码av大香线蕉伊人久久,久久精品国产亚洲av麻豆网站

資訊專(zhuān)欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Python “黑魔法” 之 Encoding & Decoding

相關(guān)文章

Python “黑魔法” 之 Meta Classes

**關(guān)于解決Python亂碼問(wèn)題的終極解決方案 (TL;DR)**

Python “黑魔法” 之 Generator Coroutines

從hello world看JavaScript隱藏的黑魔法

**經(jīng)驗(yàn)拾憶（純手工）=> Python黑魔法**

發(fā)表評(píng)論

0條評(píng)論

鄒強(qiáng)

男|高級(jí)講師

TA的文章

同樣是做后端的，為什么你同學(xué)年薪80萬(wàn)，你20多萬(wàn)？

阿里云和騰訊云服務(wù)器選擇哪個(gè)比較好（細(xì)數(shù)各路服務(wù)商套路）

hostyun，中國(guó)香港大帶寬VPS，低至17元/月，50M – 100M 香港

嵌入式工程師月薪有多少？零基礎(chǔ)學(xué)嵌入式要多久？

20180224-css選擇器的權(quán)重

近階段前端面試問(wèn)題匯總（css篇）

單選框和字對(duì)齊

【CSS】nth-child 與 nth-of-type 的元素查找方式

最新活動(dòng)

資訊專(zhuān)欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Python “黑魔法” 之 Encoding & Decoding

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！