Python爬虫学习 - 初识Bs4

Posted at 2021-06-22 学习笔记 Python

简介

BeautifulSoup 用于定位元素。

有两种查找方式：

1.find：查找到第一个匹配元素便结束

2.find_all：查找当前内容中所有匹配的元素

安装

pip install bs4

导包

from bs4 import BeautifulSoup

调用

Step1：将代码交给 BeautifulSoup

main_page = BeautifulSoup(resp.text,"html.parser")

注： 这里"html.parser" 的作用是指定为 html 解析器，不加可以正常出结果，但是会报一个warn。

Step2：查找内容

alist = main_page.find("div", class_ = "TypeList").find_all("a", class_ = "TypeBigPics")

注：这里就一次用到了find和find_all 这两个功能，第一次的find 代表查找 class 为“TypeList”的 div 标签，缩小了查找范围；第二次的find_all代表查找所有 class 为“TypeBigPics”的 a 标签，做为结果保存于 alist 变量中。

Step3：输出内容

for a in alist:
    href = a.get("href").strip("/weimeitupian")  #获取到a标签href里的内容
    href = url + href #拼接字符串

注： *这里我需要的结果是 a 标签中 href 里的 URL，所以直接使用[var].get("property")*这种 get 形式获取即可

爬取优美图库 > 唯美图片中的高清大图源代码

import requests
from bs4 import BeautifulSoup
import time
domain = "https://www.youmeitu.com"
url = "https://www.youmeitu.com/weimeitupian/"
resp = requests.get(url)
main_page = BeautifulSoup(resp.text,"html.parser")  #将Index源码交给bs，使用HTML解析器
alist = main_page.find("div", class_ = "TypeList").find_all("a", class_ = "TypeBigPics")
# 第一次缩小范围：查找class为TypeList的div标签
# 第二次缩小范围：在这个div里查找所有class为TypeBigPics的a标签
for a in alist:
    href = a.get("href").strip("/weimeitupian")  #获取到a标签href里的内容
    href = url + href  #拼接为正常网址
    child_resp = requests.get(href) #请求href
    child_page = BeautifulSoup(child_resp.text,"html.parser") #将子页面源码交给bs，使用HTML解析器
    imgli = child_page.find("div",class_ = "ImageBody").find_all("img")
    #第一次缩小范围：查找class为ImageBody的div
    #第二次缩小范围：查找所有img标签
    for imgsrc in imgli:
        imgsrc = domain + imgsrc.get("src")  #获取到img标签src里的内容，并拼接为网址
        img_resp = requests.get(imgsrc)  #请求图片网址
        filename = imgsrc.split("/")[-1]  #分割图片网址做为文件名：以“/”分割开，取最后者
        with open("Youmeiimg/" + filename, mode="wb") as f:  #以图片模式写入到Youmeiimg文件夹中，以filename变量作文件名
            f.write(img_resp.content)  #将请求到的图片的字节写入到文件
        print(filename + " over!") #输出结果：1595309242280236.jpg over!
        time.sleep(1) #每循环一次休息1s，防止被服务器封禁
print("Completed!") #全部完成输出Completed！

이전 포스트: Python爬虫学习 - Xpath 다음 포스트: Python爬虫学习 - RE模块

路虽远行则将至

Lee

简介

安装

导包

调用

Step1：将代码交给 BeautifulSoup

Step2：查找内容

Step3：输出内容

爬取优美图库 > 唯美图片中的高清大图源代码

路虽远 行则将至

Lee

简介

安装

导包

调用

Step1：将代码交给 BeautifulSoup

Step2：查找内容

Step3：输出内容

爬取 优美图库 > 唯美图片 中的高清大图源代码

路虽远行则将至

爬取优美图库 > 唯美图片中的高清大图源代码