每天 5 分钟系列

爬虫代码

为了更方便的查看所有的每天 5 分钟玩转 OpenStack 文章,花了点时间写了个小爬虫抓取了 Cloudman 在 CSDN 上的所有文章,所有文章链接均已附在下方,同时具体爬虫代码见下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import requests
from lxml import etree
import math

get_page_url = "https://blog.csdn.net/cloudman6"
base_url = "https://blog.csdn.net/cloudman6/article/list/"
sess = requests.Session()
page_html = sess.get(get_page_url)
page = etree.HTML(page_html.text)
total_num = page.xpath('//*[@id="asideProfile"]/div/dl/dd/a/span/text()')
page_num = math.ceil(int(total_num[0]) / 20) + 1
url_and_title = []
for page in range(1, page_num):
    payload = {"orderby": "UpdateTime"}
    content_html = sess.get(base_url + str(page), params=payload)
    html = etree.HTML(content_html.text)
    urls = html.xpath('//*[@id="mainBox"]/main/div/div/h4/a/@href')
    titles = html.xpath('//*[@id="mainBox"]/main/div/div/h4/a/text()')
    for i in range(len(urls)):
        url_and_title.append("- [{}]({})".format(titles[1+i*2].strip(), urls[i]))
with open("article.md", "w+") as f:
    f.write("# Cloudman 博客文章\r\n\r\n")
    for i in url_and_title[::-1]:
        f.write(i + "\r\n")

Cloudman 博客文章