《Python 网络爬虫简易速速上手小册》第1章：Python 网络爬虫基础（2024 最新版）

本文介绍: 网络爬虫，也称为网页蜘蛛或网页机器人，是一种自动化的网络程序，设计用来从万维网上下载网页，提取出有用的信息或者资源。HTML & CSS: 网页的骨架和皮肤。HTML 定义了网页的结构，而 CSS 则负责外观。掌握它们，你才能让爬虫知道去哪儿找数据。JavaScript: 许多现代网站利用 JavaScript 动态加载内容。了解基础的 JavaScript 及其如何影响网页内容的加载，对爬取动态内容至关重要。HTTP/HTTPS 协议: 这是爬虫与网站交流的语言。

在这里插入图片描述

网络爬虫，也称为网页蜘蛛或网页机器人，是一种自动化的网络程序，设计用来从万维网上下载网页，提取出有用的信息或者资源。想要精通网络爬虫，首先得了解几个基础概念：

接下来，让我们通过几个案例，深入探索网络爬虫在实际生产中的应用。

假设你是一个数据分析师，需要从 Twitter 抓取关于特定话题的推文，进行情感分析。使用 Python 的 Tweepy 库，可以方便地接入Twitter API，抓取数据。这个案例不仅实用，而且非常贴近现实生产，社交媒体数据分析在市场研究、公共舆论监控等领域有广泛应用。

import tweepy

# 初始化API
auth = tweepy.OAuthHandler('YOUR_CONSUMER_KEY', 'YOUR_CONSUMER_SECRET')
auth.set_access_token('YOUR_ACCESS_TOKEN', 'YOUR_ACCESS_TOKEN_SECRET')
api = tweepy.API(auth)

# 抓取特定话题的推文
for tweet in tweepy.Cursor(api.search, q="#特定话题", lang="en").items(100):
    print(tweet.text)

想象你是一个电商企业的竞争情报分析师，需要监控竞争对手的产品价格。使用 Python 的 BeautifulSoup 库可以解析 HTML 页面，抓取产品价格信息。这个案例在电子商务竞争分析中非常常见。

import requests
from bs4 import BeautifulSoup

# 请求网页
response = requests.get('http://example.com/product')
soup = BeautifulSoup(response.text, 'html.parser')

# 解析价格信息
price = soup.find('span', class_='product-price').text
print(f"产品价格: {price}")

import requests
from bs4 import BeautifulSoup

# 请求新闻网页
response = requests.get('http://news.example.com')
soup = BeautifulSoup(response.text, 'html.parser')

# 抓取新闻标题和链接
for news_item in soup.find_all('div', class_='news-item'):
    title = news_item.find('h2').text
    link = news_item.find('a')['href']
    print(f"标题: {title}, 链接: {link}")

假设你是一名金融分析师，需要实时追踪特定股票的价格变动。使用 Python 的 requests 库可以轻松地实现这一目标。通过发送 GET 请求到股票信息网站，然后解析响应数据获取股价信息。这个案例在金融分析和市场监控中非常实用。

import requests
from bs4 import BeautifulSoup

# 发送 GET 请求
url = "http://example.com/stock/AAPL"
response = requests.get(url)

# 解析响应内容
soup = BeautifulSoup(response.content, 'html.parser')
price = soup.find('div', class_='stock-price').text
print(f"苹果股价: {price}")

想象你正在构建一个个人项目，需要从你最喜欢的技术博客中抓取最新文章的标题和链接，以便快速浏览。这时，你可以使用 Python 的 requests 和 BeautifulSoup 来完成这项任务。这个案例对于内容聚合器或个人学习资源库的构建非常有帮助。

import requests
from bs4 import BeautifulSoup

# 请求博客首页
response = requests.get('https://techblog.example.com')
soup = BeautifulSoup(response.text, 'html.parser')

# 抓取文章标题和链接
articles = []
for article in soup.find_all('article'):
    title = article.find('h2').text
    link = article.find('a')['href']
    articles.append({'title': title, 'link': link})

for article in articles:
    print(f"标题: {article['title']}, 链接: {article['link']}")

假设你是一名旅行爱好者，希望监控某旅游网站上目的地酒店的价格，以便在价格最低时预订。通过 Python 的 requests 库发送请求，并利用 BeautifulSoup 解析响应内容中的酒店价格信息。这个案例对于预算有限的旅行者来说非常实用。

import requests
from bs4 import BeautifulSoup

# 发送请求到酒店列表页面
response = requests.get('http://travel.example.com/hotels?destination=paris')
soup = BeautifulSoup(response.text, 'html.parser')

# 解析酒店价格
hotels = []
for hotel in soup.find_all('div', class_='hotel-item'):
    name = hotel.find('h2').text
    price = hotel.find('span', class_='price').text
    hotels.append({'name': name, 'price': price})

for hotel in hotels:
    print(f"

酒店: {hotel['name']}, 价格: {hotel['price']}")

import tweepy

# 初始化 Tweepy API
auth = tweepy.OAuthHandler('YOUR_CONSUMER_KEY', 'YOUR_CONSUMER_SECRET')
auth.set_access_token('YOUR_ACCESS_TOKEN', 'YOUR_ACCESS_TOKEN_SECRET')
api = tweepy.API(auth, wait_on_rate_limit=True)

# 搜索帖子
for tweet in tweepy.Cursor(api.search, q="#特定话题", lang="en", tweet_mode='extended').items(100):
    print(tweet.full_text)

import requests
from bs4 import BeautifulSoup

# 请求新闻网站
response = requests.get('https://news.example.com')
soup = BeautifulSoup(response.text, 'html.parser')

# 解析并展示新闻标题和链接
for news_item in soup.select('.news-title'):
    title = news_item.text
    link = news_item.find('a')['href']
    print(f"标题: {title}, 链接: {link}")

import time
import requests
from bs4 import BeautifulSoup

product_urls = ['http://onlinestore.example.com/product1', 'http://onlinestore.example.com/product2']

for url in product_urls:
    # 发送请求
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # 解析产品价格
    price = soup.find('span', class_='price').text
    print(f"产品价格: {price}")

    # 间隔时间，避免过快请求
    time.sleep(10)