web_crawler_experiment

web-crawler

python

发布日期: 2019-03-12

利用python爬虫下载<<盗墓笔记>>

一个简单的网页爬虫实验. 利用python把小时<<盗墓笔记>>下载到本地.

先查看需要下载的网页基本信息, 定位要处理的问题.

比如, 需要注意网页的字符编码, 查看charset.

还要注意正文每个段落开始都有’\u3000’的空白符.

每个章节(除了最终篇)最后一个段落末尾有一些附加信息要处理.

python代码如下:

import urllib.request
import re

def getContent():
    html = urllib.request.urlopen("http://www.daomubiji.com/dao-mu-bi-ji-1").read()
    html = html.decode("UTF-8")
    reg = r'<article class="excerpt excerpt-c3"><a href="(.*?)>(.*?)</a></article>'
    reg = re.compile(reg)
    # 获得所有章节的url以及章节名称
    urls = re.findall(reg,html)
    
    for url in urls:
        chapter_url = url[0]
        chapter_title = url[1]
        chapter = urllib.request.urlopen(chapter_url).read()
        chapter = chapter.decode("UTF-8")
        reg = r'<p>(.*?)</p>'
        reg = re.compile(reg, re.S)
        chapter_content = re.findall(reg, chapter)
        
        for i in range(len(chapter_content)):
            chapter_content[i] = chapter_content[i].replace(u'\u3000', u'')
            pattern1 = re.compile(r'[<span style="color:#C00">].*(</span>)')
            chapter_content[i] = re.sub(pattern1, '', chapter_content[i])
            pattern2 = re.compile(r'<a.*')
            chapter_content[i] = re.sub(pattern2, '', chapter_content[i])
            pattern3 = re.compile(r'^评论.*')
            chapter_content[i] = re.sub(pattern3, '', chapter_content[i])
               
        chapter_content = [item for item in chapter_content if item != '']
        
        print("saving %s" % chapter_title)
        
        with open("{}.txt".format(chapter_title),'w') as f:
            for item in chapter_content:
                f.write(item)
                f.write('\n')
        
        f.close()

执行完成后, 会在相应的文件夹里看到生成的章节.txt文件.

转载请注明: ZQY's Lab web_crawler_experiment

本篇

web_crawler_experiment

利用python爬虫下载<<盗墓笔记>>一个简单的网页爬虫实验. 利用python把小时<<盗墓笔记>>下载到本地. 先查看需要下载的网页基本信息, 定位要处理的问题. 比如, 需要注意网页的

2019-03-12 python

web-crawler

Sybaseiq_HandsOn

Install SybaseIQ server on Linuxwhy we should install IQ server on Linux?We have found an existing bug for having IQ ser

2019-03-11 database

SybaseIQ