优化 Python 爬虫性能：异步爬取新浪财经大数据

一、同步爬虫的瓶颈
传统的同步爬虫（如requests+BeautifulSoup）在请求网页时，必须等待服务器返回响应后才能继续下一个请求。这种阻塞式I/O操作在面对大量数据时存在以下问题：
1速度慢：每个请求必须串行执行，无法充分利用网络带宽。
2易被封禁：高频请求可能触发IP限制或验证码。
3资源浪费：CPU在等待I/O时处于空闲状态。
解决方案：异步爬虫（Asynchronous Crawling）Python的asyncio+aiohttp库可以实现非阻塞I/O，允许同时发起多个请求，大幅提升爬取效率。

二、异步爬虫技术选型

技术方案	适用场景	优势
aiohttp	HTTP请求	异步HTTP客户端，支持高并发
asyncio	事件循环	Python原生异步I/O框架
aiofiles	异步文件存储	避免文件写入阻塞主线程
uvloop	加速事件循环	替换asyncio 默认循环，性能提升2-4倍

三、实战：异步爬取新浪财经股票数据

目标

●

爬取新浪财经A股股票实时行情（代码、名称、价格、涨跌幅等）。

●

使用

aiohttp

实现高并发请求。

●

存储至

CSV

文件，避免数据丢失。

步骤1：分析数据接口

新浪财经的股票数据通常通过API返回，我们可以通过浏览器开发者工具（F12）抓包分析：

●

示例接口：

https://finance.sina.com.cn/realstock/company/sh600000/nc.shtml

●

数据格式：部分数据直接渲染在HTML中，部分通过Ajax加载（如分时数据）。

步骤2：安装依赖库

步骤3：编写异步爬虫代码

四、性能优化策略

1. 控制并发量

新浪财经可能限制高频请求

2. 使用代理IP

避免IP被封：

3. 随机User-Agent

减少被识别为爬虫的概率：

4. 数据存储优化

●

异步数据库写入

：如

aiomysql

、

asyncpg

。

●

批量写入

：减少I/O次数。

import asyncio

import aiohttp

from bs4 import BeautifulSoup

import pandas as pd

from fake_useragent import UserAgent

import aiomysql

# 使用 Semaphore 限制并发数

semaphore = asyncio.Semaphore(10) # 最大并发 10

# 代理信息

proxyHost = "www.16yun.cn"

proxyPort = "5445"

proxyUser = "16QMSOML"

proxyPass = "280651"

# 构造代理 URL

PROXY = f"http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}"

# 随机 User-Agent

ua = UserAgent()

# 数据库配置

DB_CONFIG = {

'host': 'localhost',

'port': 3306,

' user': 'your_username',

'password': 'your_password',

'db': 'your_database',

'charset': 'utf8mb4'

}

# 数据存储优化：异步数据库写入

async def save_to_db(data):

conn = await aiomysql.connect(**DB_CONFIG)

async with conn.cursor() as cur:

await cur.executemany("INSERT INTO finance_data (column1, column2, column3) VALUES (%s, %s, %s)", data)

await conn.commit()

conn.close()

# 爬取单个股票数据

async def crawl_stock(stock_code, session):

async with semaphore:

url = f"https://finance.sina.com.cn/stock/{stock_code}.html"

HEADERS = {"User-Agent": ua.random}

async with session.get(url, headers=HEADERS, proxy=PROXY) as response:

html = await response.text()

data = parse(html)

return data

# 解析网页内容

def parse(html):

soup = BeautifulSoup(html, 'html.parser')

# 假设数据在特定的表格中

table = soup.find('table', {'class': 'example'})

data = []

for row in table.find_all('tr'):

cols = row.find_all('td')

cols = [ele.text.strip() for ele in cols]

data.append([ele for ele in cols if ele])

return data

# 主函数

async def main(stock_codes):

async with aiohttp.ClientSession() as session:

tasks = [crawl_stock(stock_code, session) for stock_code in stock_codes]

all_data = await asyncio.gather(*tasks)

# 扁平化数据

flat_data = [item for sublist in all_data for item in sublist]

# 异步批量写入数据库

await save_to_db(flat_data)

# 示例股票代码列表

stock_codes = [

'000001',

'000002',

# 更多股票代码

]

# 运行爬虫

asyncio.run(main(stock_codes))

五、对比同步与异步爬虫性能

指标	同步爬虫（requests）	异步爬虫（aiohttp）
100次请求耗时	~20秒	~3秒
CPU占用	低（大量时间在等待）	高（并发处理）
反爬风险	高（易触发封禁）	较低（可控并发）

二、异步爬虫技术选型

技术方案	适用场景	优势
aiohttp	HTTP请求	异步HTTP客户端，支持高并发
asyncio	事件循环	Python原生异步I/O框架
aiofiles	异步文件存储	避免文件写入阻塞主线程
uvloop	加速事件循环	替换asyncio 默认循环，性能提升2-4倍

三、实战：异步爬取新浪财经股票数据

目标

●

爬取新浪财经A股股票实时行情（代码、名称、价格、涨跌幅等）。

●

使用

aiohttp

实现高并发请求。

●

存储至

CSV

文件，避免数据丢失。

步骤1：分析数据接口

新浪财经的股票数据通常通过API返回，我们可以通过浏览器开发者工具（F12）抓包分析：

●

示例接口：

https://finance.sina.com.cn/realstock/company/sh600000/nc.shtml

●

数据格式：部分数据直接渲染在HTML中，部分通过Ajax加载（如分时数据）。

步骤2：安装依赖库

步骤3：编写异步爬虫代码

四、性能优化策略

1. 控制并发量

新浪财经可能限制高频请求

2. 使用代理IP

避免IP被封：

3. 随机User-Agent

减少被识别为爬虫的概率：

4. 数据存储优化

●

异步数据库写入

：如

aiomysql

、

asyncpg

。

●

批量写入

：减少I/O次数。

import asyncio

import aiohttp

from bs4 import BeautifulSoup

import pandas as pd

from fake_useragent import UserAgent

import aiomysql

# 使用 Semaphore 限制并发数

semaphore = asyncio.Semaphore(10) # 最大并发 10

# 代理信息

proxyHost = "www.16yun.cn"

proxyPort = "5445"

proxyUser = "16QMSOML"

proxyPass = "280651"

# 构造代理 URL

PROXY = f"http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}"

# 随机 User-Agent

ua = UserAgent()

# 数据库配置

DB_CONFIG = {

'host': 'localhost',

'port': 3306,

' user': 'your_username',

'password': 'your_password',

'db': 'your_database',

'charset': 'utf8mb4'

}

# 数据存储优化：异步数据库写入

async def save_to_db(data):

conn = await aiomysql.connect(**DB_CONFIG)

async with conn.cursor() as cur:

await cur.executemany("INSERT INTO finance_data (column1, column2, column3) VALUES (%s, %s, %s)", data)

await conn.commit()

conn.close()

# 爬取单个股票数据

async def crawl_stock(stock_code, session):

async with semaphore:

url = f"https://finance.sina.com.cn/stock/{stock_code}.html"

HEADERS = {"User-Agent": ua.random}

async with session.get(url, headers=HEADERS, proxy=PROXY) as response:

html = await response.text()

data = parse(html)

return data

# 解析网页内容

def parse(html):

soup = BeautifulSoup(html, 'html.parser')

# 假设数据在特定的表格中

table = soup.find('table', {'class': 'example'})

data = []

for row in table.find_all('tr'):

cols = row.find_all('td')

cols = [ele.text.strip() for ele in cols]

data.append([ele for ele in cols if ele])

return data

# 主函数

async def main(stock_codes):

async with aiohttp.ClientSession() as session:

tasks = [crawl_stock(stock_code, session) for stock_code in stock_codes]

all_data = await asyncio.gather(*tasks)

# 扁平化数据

flat_data = [item for sublist in all_data for item in sublist]

# 异步批量写入数据库

await save_to_db(flat_data)

# 示例股票代码列表

stock_codes = [

'000001',

'000002',

# 更多股票代码

]

# 运行爬虫

asyncio.run(main(stock_codes))

五、对比同步与异步爬虫性能

指标	同步爬虫（requests）	异步爬虫（aiohttp）
100次请求耗时	~20秒	~3秒
CPU占用	低（大量时间在等待）	高（并发处理）
反爬风险	高（易触发封禁）	较低（可控并发）