python學(xué)習(xí)筆記 --- scikit-learn 學(xué)習(xí) [1]

dingding199389 發(fā)布于2019-07-30 14:22 / 2971人閱讀

摘要：詳細(xì)講解記錄在傳送門(mén)我在這里只是大概整理我使用過(guò)學(xué)習(xí)過(guò)的。這部分先放過(guò)，接下講。這種特殊的策略也叫或是，完全忽略詞在文中位置關(guān)系。具體在項(xiàng)目中是如下使用。使用技巧來(lái)適配大數(shù)據(jù)集，沒(méi)用過(guò)，看上去很牛

Feature extraction

詳細(xì)講解記錄在傳送門(mén)

我在這里只是大概整理我使用過(guò)學(xué)習(xí)過(guò)的api。

Loading features from dicts

這個(gè)方便提取數(shù)據(jù)特征，比如我們的數(shù)據(jù)是dict形式的，里面有city是三種不同城市，就可以one-hot encode。

使用的是 DictVectorizer 這個(gè)模塊

>>> measurements = [
...     {"city": "Dubai", "temperature": 33.},
...     {"city": "London", "temperature": 12.},
...     {"city": "San Fransisco", "temperature": 18.},
... ]

>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()

>>> vec.fit_transform(measurements).toarray()
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])

>>> vec.get_feature_names()
["city=Dubai", "city=London", "city=San Fransisco", "temperature"]

下面官網(wǎng)又舉了個(gè)使用例子，是關(guān)于pos_window的，詞性這方面我也沒(méi)做過(guò)，但是我一開(kāi)始以為的是在講這種方式在這種情況下不行，因?yàn)橛泻芏?,但是細(xì)看后又覺(jué)得不是，希望有人能幫我解答。

以下英文是原文摘抄。

For example, suppose that we have a first algorithm that extracts Part of Speech (PoS) tags that we want to use as complementary tags for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:

    
>>>

>>> pos_window = [
...     {
...         "word-2": "the",
...         "pos-2": "DT",
...         "word-1": "cat",
...         "pos-1": "NN",
...         "word+1": "on",
...         "pos+1": "PP",
...     },
...     # in a real application one would extract many such dictionaries
... ]

This description can be vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):

>>>

>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized                
<1x6 sparse matrix of type "<... "numpy.float64">"
    with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1.,  1.,  1.,  1.,  1.,  1.]])
>>> vec.get_feature_names()
["pos+1=PP", "pos-1=NN", "pos-2=DT", "word+1=on", "word-1=cat", "word-2=the"]

As you can imagine, if one extracts such a context around each individual word of a corpus of documents the resulting matrix will be very wide (many one-hot-features) with most of them being valued to zero most of the time. So as to make the resulting data structure able to fit in memory the DictVectorizer class uses a scipy.sparse matrix by default instead of a numpy.ndarray.

這部分先放過(guò)，接下講。

Feature hashing

FeatureHasher 這個(gè)類使用來(lái)高速低占用內(nèi)存向量化，使用的技術(shù)是feature hashing，由于現(xiàn)在還沒(méi)怎么接觸這個(gè)方面，不細(xì)聊了。

基于murmurhash，這個(gè)蠻出名的，以前接觸過(guò)。由于scipy.sparse的限制，最大的feature個(gè)數(shù)上限是

$$2^{31}-1$$

Text feature extraction 文本特征提取

Common Vectorizer usage 普通用法

vectorization ，也就是將文本集合轉(zhuǎn)化成數(shù)字向量。這種特殊的策略也叫 "Bag of words" 或是 "Bag of n-grams"，完全忽略詞在文中位置關(guān)系。

第一個(gè)介紹 CountVectorizer。

 >>> from sklearn.feature_extraction.text import CountVectorizer

有很多的參數(shù)

 >>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer                     
CountVectorizer(analyzer=..."word", binary=False, decode_error=..."strict",
    dtype=<... "numpy.int64">, encoding=..."utf-8", input=..."content",
    lowercase=True, max_df=1.0, max_features=None, min_df=1,
    ngram_range=(1, 1), preprocessor=None, stop_words=None,
    strip_accents=None, token_pattern=..."(?u)ww+",
    tokenizer=None, vocabulary=None)

下面稍微使用一下

>>> corpus = [
...     "This is the first document.",
...     "This is the second second document.",
...     "And the third one.",
...     "Is this the first document?",
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X                              
<4x9 sparse matrix of type "<... "numpy.int64">"
    with 19 stored elements in Compressed Sparse ... format>

結(jié)果

>>> vectorizer.get_feature_names() == (
...     ["and", "document", "first", "is", "one",
...      "second", "the", "third", "this"])
True

>>> X.toarray()           
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)

可以看出這是根據(jù)單詞來(lái)統(tǒng)計(jì)feature個(gè)數(shù)，屬于one-hot，一般來(lái)講不實(shí)用。

Tf–idf term weighting

這個(gè)能好點(diǎn)，tf-idf我就不講了，原理很簡(jiǎn)單。

下面可貼一個(gè)實(shí)例，count里面就是計(jì)算好了的單詞出現(xiàn)的個(gè)數(shù)，只有三個(gè)單詞。

>>> counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf                         
<6x3 sparse matrix of type "<... "numpy.float64">"
    with 9 stored elements in Compressed Sparse ... format>

>>> tfidf.toarray()                        
array([[ 0.81940995,  0.        ,  0.57320793],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.47330339,  0.88089948,  0.        ],
       [ 0.58149261,  0.        ,  0.81355169]])

具體在項(xiàng)目中是如下使用。

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer(min_df=1)
>>> vectorizer.fit_transform(corpus)

Vectorizing a large text corpus with the hashing trick

使用hash技巧來(lái)適配大數(shù)據(jù)集，沒(méi)用過(guò)，看上去很牛

The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large datasets:

the larger the corpus, the larger the vocabulary will grow and hence the memory use too,

fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset.
building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner.
pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size),
it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.

>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> hv = HashingVectorizer(n_features=10)
>>> hv.transform(corpus)

云服務(wù)器 GPU云服務(wù)器 Python學(xué)習(xí)筆記學(xué)習(xí)筆記學(xué)習(xí)筆記一基礎(chǔ)學(xué)習(xí)筆記

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://www.ezyhdfw.cn/yun/40663.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

dingding199389

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

搬瓦工只能年付嗎?能不能按月付款?能不能一次性多買(mǎi)幾年?

閱讀 2205·2021-09-27 14:04
基于vue的驗(yàn)證碼組件

閱讀 1941·2019-08-30 15:55
CSS居中那些事

閱讀 1762·2019-08-30 13:13
提升你的CSS姿勢(shì)

閱讀 1127·2019-08-30 13:07
Node.js究竟是什么？

閱讀 2801·2019-08-29 15:20
垂直居中

閱讀 3295·2019-08-29 12:42
圖解利用CSS實(shí)現(xiàn)三角形

閱讀 3394·2019-08-28 17:58
嘿，咱不能老靠著css

閱讀 3674·2019-08-28 17:56

亚洲中字慕日产2020,大陆极品少妇内射AAAAAA,无码av大香线蕉伊人久久,久久精品国产亚洲av麻豆网站

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

python學(xué)習(xí)筆記 --- scikit-learn 學(xué)習(xí) [1]

Common Vectorizer usage 普通用法

Tf–idf term weighting

Vectorizing a large text corpus with the hashing trick

相關(guān)文章

ApacheCN 人工智能知識(shí)樹(shù) v1.0

五位專家跟你講講為啥Python更適合做AI/機(jī)器學(xué)習(xí)

五位專家跟你講講為啥Python更適合做AI/機(jī)器學(xué)習(xí)

發(fā)表評(píng)論

0條評(píng)論

dingding199389

男|高級(jí)講師

TA的文章

搬瓦工只能年付嗎?能不能按月付款?能不能一次性多買(mǎi)幾年?

基于vue的驗(yàn)證碼組件

CSS居中那些事

提升你的CSS姿勢(shì)

Node.js究竟是什么？

垂直居中

圖解利用CSS實(shí)現(xiàn)三角形

嘿，咱不能老靠著css

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

python學(xué)習(xí)筆記 --- scikit-learn 學(xué)習(xí) [1]

Common Vectorizer usage 普通用法

Tf–idf term weighting

Vectorizing a large text corpus with the hashing trick

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！