web_crawler_experiment

利用python爬虫下载<<盗墓笔记>>

一个简单的网页爬虫实验. 利用python把小时<<盗墓笔记>>下载到本地.

先查看需要下载的网页基本信息, 定位要处理的问题.

比如, 需要注意网页的字符编码, 查看charset.

还要注意正文每个段落开始都有’\u3000’的空白符.

每个章节(除了最终篇)最后一个段落末尾有一些附加信息要处理.

python代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import urllib.request
import re

def getContent():
html = urllib.request.urlopen("http://www.daomubiji.com/dao-mu-bi-ji-1").read()
html = html.decode("UTF-8")
reg = r'<article class="excerpt excerpt-c3"><a href="(.*?)>(.*?)</a></article>'
reg = re.compile(reg)
# 获得所有章节的url以及章节名称
urls = re.findall(reg,html)

for url in urls:
chapter_url = url[0]
chapter_title = url[1]
chapter = urllib.request.urlopen(chapter_url).read()
chapter = chapter.decode("UTF-8")
reg = r'<p>(.*?)</p>'
reg = re.compile(reg, re.S)
chapter_content = re.findall(reg, chapter)

for i in range(len(chapter_content)):
chapter_content[i] = chapter_content[i].replace(u'\u3000', u'')
pattern1 = re.compile(r'[<span style="color:#C00">].*(</span>)')
chapter_content[i] = re.sub(pattern1, '', chapter_content[i])
pattern2 = re.compile(r'<a.*')
chapter_content[i] = re.sub(pattern2, '', chapter_content[i])
pattern3 = re.compile(r'^评论.*')
chapter_content[i] = re.sub(pattern3, '', chapter_content[i])

chapter_content = [item for item in chapter_content if item != '']

print("saving %s" % chapter_title)

with open("{}.txt".format(chapter_title),'w') as f:
for item in chapter_content:
f.write(item)
f.write('\n')

f.close()

执行完成后, 会在相应的文件夹里看到生成的章节.txt文件.


  转载请注明: ZQY's Lab web_crawler_experiment

 本篇
web_crawler_experiment web_crawler_experiment
利用python爬虫下载<<盗墓笔记>>一个简单的网页爬虫实验. 利用python把小时<<盗墓笔记>>下载到本地. 先查看需要下载的网页基本信息, 定位要处理的问题. 比如, 需要注意网页的
2019-03-12
下一篇 
Sybaseiq_HandsOn Sybaseiq_HandsOn
Install SybaseIQ server on Linuxwhy we should install IQ server on Linux?We have found an existing bug for having IQ ser
2019-03-11