十年网站开发经验 + 多家企业客户 + 靠谱的建站团队
量身定制 + 运营维护+专业推广+无忧售后,网站问题一站解决
scrapy利用selenium实现爬取豆瓣阅读的方法教程?针对这个问题,这篇文章详细介绍了相对应的分析和解答,希望可以帮助更多想解决这个问题的小伙伴找到更简单易行的方法。

首先创建scrapy项目
命令:scrapy startproject douban_read
创建spider
命令:scrapy genspider douban_spider url
网址:https://read.douban.com/charts
关键注释代码中有,若有不足,请多指教
scrapy项目目录结构如下

douban_spider.py文件代码
爬虫文件
import scrapy
import re, json
from ..items import DoubanReadItem
class DoubanSpiderSpider(scrapy.Spider):
name = 'douban_spider'
# allowed_domains = ['www']
start_urls = ['https://read.douban.com/charts']
def parse(self, response):
# print(response.text)
# 获取图书分类的url
type_urls = response.xpath('//div[@class="rankings-nav"]/a[position()>1]/@href').extract()
# print(type_urls)
for type_url in type_urls:
# /charts?type=unfinished_column&index=featured&dcs=charts&dcm=charts-nav
part_param = re.search(r'charts\?(.*?)&dcs', type_url).group(1)
# https://read.douban.com/j/index//charts?type=intermediate_finalized&index=science_fiction&verbose=1
ajax_url = 'https://read.douban.com/j/index//charts?{}&verbose=1'.format(part_param)
yield scrapy.Request(ajax_url, callback=self.parse_ajax, encoding='utf-8', meta={'request_type': 'ajax'})
def parse_ajax(self, response):
# print(response.text)
# 获取分类中图书的json数据
json_data = json.loads(response.text)
for data in json_data['list']:
item = DoubanReadItem()
item['book_id'] = data['works']['id']
item['book_url'] = data['works']['url']
item['book_title'] = data['works']['title']
item['book_author'] = data['works']['author']
item['book_cover_image'] = data['works']['cover']
item['book_abstract'] = data['works']['abstract']
item['book_wordCount'] = data['works']['wordCount']
item['book_kinds'] = data['works']['kinds']
# 把item yield给Itempipeline
yield item