电竞比分网-中国电竞赛事及体育赛事平台

分享

python 簡單爬取今日頭條熱點(diǎn)新聞(一)

 阿甘ch1wn8cyc3 2019-06-19

今日頭條如今在自媒體領(lǐng)域算是比較強(qiáng)大的存在,今天就帶大家利用python爬去今日頭條的熱點(diǎn)新聞,理論上是可以做到無限爬取的;

在瀏覽器中打開今日頭條的鏈接,選中左側(cè)的熱點(diǎn),在瀏覽器開發(fā)者模式network下很快能找到一個(gè)‘?category=new_hot...’字樣的文件,查看該文件發(fā)現(xiàn)新聞內(nèi)容的數(shù)據(jù)全部存儲(chǔ)在data里面,且能發(fā)現(xiàn)數(shù)據(jù)類型為json;如下圖:

這樣一來就簡單了,只要找到這個(gè)文件的requests url即可通過python requests來爬取網(wǎng)頁了;

查看請求的url,如下圖:

發(fā)現(xiàn)鏈接為:https://www.toutiao.com/api/pc/feed/?category=news_hot&utm_source=toutiao&widen=1&max_behot_time=0&max_behot_time_tmp=0&tadrequire=true&as=A1B5AC16548E0FA&cp=5C647E601F9AEE1&_signature=F09fYAAASzBjiSc9oUU9MxdPX3

其中有9個(gè)參數(shù),對比如下表:

其中max_behot_time在獲取的json數(shù)據(jù)中獲得,具體數(shù)據(jù)見如下截圖:

 

在網(wǎng)上找了下大神對as和cp算法的分析,發(fā)現(xiàn)兩個(gè)參數(shù)在js文件:home_4abea46.js中有,具體算法如下代碼:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
!function(t) {
    var e = {};
    e.getHoney = function() {
        var t = Math.floor((new Date).getTime() / 1e3)
          , e = t.toString(16).toUpperCase()
          , i = md5(t).toString().toUpperCase();
        if (8 != e.length)
            return {
                as: "479BB4B7254C150",
                cp: "7E0AC8874BB0985"
            };
        for (var n = i.slice(0, 5), a = i.slice(-5), s = "", o = 0; 5 > o; o++)
            s += n[o] + e[o];
        for (var r = "", c = 0; 5 > c; c++)
            r += e[c + 3] + a[c];
        return {
            as: "A1" + s + e.slice(-3),
            cp: e.slice(0, 3) + r + "E1"
        }
    }
    ,
    t.ascp = e
}(window, document),

 python獲取as和cp值的代碼如下:(代碼參考blog:https://www.cnblogs.com/xuchunlin/p/7097391.html)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def get_as_cp():  # 該函數(shù)主要是為了獲取as和cp參數(shù),程序參考今日頭條中的加密js文件:home_4abea46.js
    zz = {}
    now = round(time.time())
    print(now) # 獲取當(dāng)前計(jì)算機(jī)時(shí)間
    e = hex(int(now)).upper()[2:] #hex()轉(zhuǎn)換一個(gè)整數(shù)對象為16進(jìn)制的字符串表示
    print('e:', e)
    a = hashlib.md5()  #hashlib.md5().hexdigest()創(chuàng)建hash對象并返回16進(jìn)制結(jié)果
    print('a:', a)
    a.update(str(int(now)).encode('utf-8'))
    i = a.hexdigest().upper()
    print('i:', i)
    if len(e)!=8:
        zz = {'as':'479BB4B7254C150',
        'cp':'7E0AC8874BB0985'}
        return zz
    n = i[:5]
    a = i[-5:]
    r = ''
    s = ''
    for i in range(5):
        s= s+n[i]+e[i]
    for j in range(5):
        r = r+e[j+3]+a[j]
    zz ={
    'as':'A1'+s+e[-3:],
    'cp':e[0:3]+r+'E1'
    }
    print('zz:', zz)
    return zz

  這樣完整的鏈接就構(gòu)成了,另外提一點(diǎn)就是:_signature參數(shù)去掉也是可以獲取到j(luò)son數(shù)據(jù)的,因此這樣請求的鏈接就完成了;下面附上完整代碼:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import requests
import json
from openpyxl import Workbook
import time
import hashlib
import os
import datetime
start_url = 'https://www.toutiao.com/api/pc/feed/?category=news_hot&utm_source=toutiao&widen=1&max_behot_time='
url = 'https://www.toutiao.com'
headers={
    'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
cookies = {'tt_webid':'6649949084894053895'} # 此處cookies可從瀏覽器中查找,為了避免被頭條禁止爬蟲
max_behot_time = '0'   # 鏈接參數(shù)
title = []       # 存儲(chǔ)新聞標(biāo)題
source_url = []  # 存儲(chǔ)新聞的鏈接
s_url = []       # 存儲(chǔ)新聞的完整鏈接
source = []      # 存儲(chǔ)發(fā)布新聞的公眾號
media_url = {}   # 存儲(chǔ)公眾號的完整鏈接
def get_as_cp():  # 該函數(shù)主要是為了獲取as和cp參數(shù),程序參考今日頭條中的加密js文件:home_4abea46.js
    zz = {}
    now = round(time.time())
    print(now) # 獲取當(dāng)前計(jì)算機(jī)時(shí)間
    e = hex(int(now)).upper()[2:] #hex()轉(zhuǎn)換一個(gè)整數(shù)對象為16進(jìn)制的字符串表示
    print('e:', e)
    a = hashlib.md5()  #hashlib.md5().hexdigest()創(chuàng)建hash對象并返回16進(jìn)制結(jié)果
    print('a:', a)
    a.update(str(int(now)).encode('utf-8'))
    i = a.hexdigest().upper()
    print('i:', i)
    if len(e)!=8:
        zz = {'as':'479BB4B7254C150',
        'cp':'7E0AC8874BB0985'}
        return zz
    n = i[:5]
    a = i[-5:]
    r = ''
    s = ''
    for i in range(5):
        s= s+n[i]+e[i]
    for j in range(5):
        r = r+e[j+3]+a[j]
    zz ={
    'as':'A1'+s+e[-3:],
    'cp':e[0:3]+r+'E1'
    }
    print('zz:', zz)
    return zz
def getdata(url, headers, cookies):  # 解析網(wǎng)頁函數(shù)
    r = requests.get(url, headers=headers, cookies=cookies)
    print(url)
    data = json.loads(r.text)
    return data
def savedata(title, s_url, source, media_url):  # 存儲(chǔ)數(shù)據(jù)到文件
    # 存儲(chǔ)數(shù)據(jù)到xlxs文件
    wb = Workbook()
    if not os.path.isdir(os.getcwd()+'/result'):   # 判斷文件夾是否存在
        os.makedirs(os.getcwd()+'/result') # 新建存儲(chǔ)文件夾
    filename = os.getcwd()+'/result/result-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%m')+'.xlsx' # 新建存儲(chǔ)結(jié)果的excel文件
    ws = wb.active
    ws.title = 'data'   # 更改工作表的標(biāo)題
    ws['A1'] = '標(biāo)題'   # 對表格加入標(biāo)題
    ws['B1'] = '新聞鏈接'
    ws['C1'] = '頭條號'
    ws['D1'] = '頭條號鏈接'
    for row in range(2, len(title)+2):   # 將數(shù)據(jù)寫入表格
        _= ws.cell(column=1, row=row, value=title[row-2])
        _= ws.cell(column=2, row=row, value=s_url[row-2])
        _= ws.cell(column=3, row=row, value=source[row-2])
        _= ws.cell(column=4, row=row, value=media_url[source[row-2]])
    wb.save(filename=filename)  # 保存文件
def main(max_behot_time, title, source_url, s_url, source, media_url):   # 主函數(shù)
    for i in range(3):   # 此處的數(shù)字類似于你刷新新聞的次數(shù),正常情況下刷新一次會(huì)出現(xiàn)10條新聞,但夜存在少于10條的情況;所以最后的結(jié)果并不一定是10的倍數(shù)
        ascp = get_as_cp()    # 獲取as和cp參數(shù)的函數(shù)
        demo = getdata(start_url+max_behot_time+'&max_behot_time_tmp='+max_behot_time+'&tadrequire=true&as='+ascp['as']+'&cp='+ascp['cp'], headers, cookies)
        print(demo)
        # time.sleep(1)
        for j in range(len(demo['data'])):
            # print(demo['data'][j]['title'])
            if demo['data'][j]['title'] not in title:
                title.append(demo['data'][j]['title'])  # 獲取新聞標(biāo)題
                source_url.append(demo['data'][j]['source_url'])  # 獲取新聞鏈接
                source.append(demo['data'][j]['source'])  # 獲取發(fā)布新聞的公眾號
            if demo['data'][j]['source'] not in media_url:
                media_url[demo['data'][j]['source']] = url+demo['data'][j]['media_url'# 獲取公眾號鏈接
        print(max_behot_time)
        max_behot_time = str(demo['next']['max_behot_time'])  # 獲取下一個(gè)鏈接的max_behot_time參數(shù)的值
        for index in range(len(title)):
            print('標(biāo)題:', title[index])
            if 'https' not in source_url[index]:
                s_url.append(url+source_url[index])
                print('新聞鏈接:', url+source_url[index])
            else:
                print('新聞鏈接:', source_url[index])
                s_url.append(source_url[index])
                # print('源鏈接:', url+source_url[index])
            print('頭條號:', source[index])
            print(len(title))   # 獲取的新聞數(shù)量
if __name__ == '__main__':
    main(max_behot_time, title, source_url, s_url, source, media_url)
    savedata(title, s_url, source, media_url)

  簡單百行代碼搞定今日頭條熱點(diǎn)新聞爬取并存儲(chǔ)到本地,同理也可以爬取其他頻道的新聞;本次的爬取程序到此結(jié)束,下次從爬取的公眾號對公眾號下的新聞進(jìn)行爬取,主要爬取公眾號的粉絲量以及最近10條新聞的或圖文的閱讀量及評論數(shù)等數(shù)據(jù);請期待...

最后送上程序運(yùn)行的截圖及數(shù)據(jù)存儲(chǔ)的表格截圖:

---------------------------------------------------------

歡迎大家留言交流,共同進(jìn)步。

 

    本站是提供個(gè)人知識管理的網(wǎng)絡(luò)存儲(chǔ)空間,所有內(nèi)容均由用戶發(fā)布,不代表本站觀點(diǎn)。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息,謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請點(diǎn)擊一鍵舉報(bào)。
    轉(zhuǎn)藏 分享 獻(xiàn)花(0

    0條評論

    發(fā)表

    請遵守用戶 評論公約

    類似文章 更多