利用python爬虫下载<<盗墓笔记>>
一个简单的网页爬虫实验. 利用python把小时<<盗墓笔记>>下载到本地.
先查看需要下载的网页基本信息, 定位要处理的问题.
比如, 需要注意网页的字符编码, 查看charset.
还要注意正文每个段落开始都有’\u3000’的空白符.
每个章节(除了最终篇)最后一个段落末尾有一些附加信息要处理.
python代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
| import urllib.request import re
def getContent(): html = urllib.request.urlopen("http://www.daomubiji.com/dao-mu-bi-ji-1").read() html = html.decode("UTF-8") reg = r'<article class="excerpt excerpt-c3"><a href="(.*?)>(.*?)</a></article>' reg = re.compile(reg) # 获得所有章节的url以及章节名称 urls = re.findall(reg,html) for url in urls: chapter_url = url[0] chapter_title = url[1] chapter = urllib.request.urlopen(chapter_url).read() chapter = chapter.decode("UTF-8") reg = r'<p>(.*?)</p>' reg = re.compile(reg, re.S) chapter_content = re.findall(reg, chapter) for i in range(len(chapter_content)): chapter_content[i] = chapter_content[i].replace(u'\u3000', u'') pattern1 = re.compile(r'[<span style="color:#C00">].*(</span>)') chapter_content[i] = re.sub(pattern1, '', chapter_content[i]) pattern2 = re.compile(r'<a.*') chapter_content[i] = re.sub(pattern2, '', chapter_content[i]) pattern3 = re.compile(r'^评论.*') chapter_content[i] = re.sub(pattern3, '', chapter_content[i]) chapter_content = [item for item in chapter_content if item != ''] print("saving %s" % chapter_title) with open("{}.txt".format(chapter_title),'w') as f: for item in chapter_content: f.write(item) f.write('\n') f.close()
|
执行完成后, 会在相应的文件夹里看到生成的章节.txt文件.