自动爬取 Leetcode 题目,并保存为 Markdown 格式

本文最后更新于:1 天前

复制,粘贴,运行

前言

leetcode 刷题的时候我习惯在本地写好题解,不过每题都复制就太麻烦了,而且也不是Markdown格式的。于是想能不能爬取内容,然后自动转换格式。

正文

自己来写的话因为对爬虫并不是很熟悉(主要是对于抓包并不熟),所以直接搜索,于是发现了一个博客写这个事情。看起来挺不错的,那我就直接用CV大法了。这里主要是用了转换格式和爬取内容的方法。

原理

leetcode 的请求都是用的 GraphQL 技术。相比起 RESTful 格式而言,它更加灵活,能够减少请求数(一次就能查询多个数据)。查询的时候可以只请求部分内容。

对于本例而言,用 F12 ,然后看 Network,搜索graphql的请求后,筛选一下就能得到合适的结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
curl 'https://leetcode-cn.com/graphql/' \
-H 'authority: leetcode-cn.com' \
-H 'x-timezone: undefined' \
-H 'x-operation-name: questionData' \
-H 'accept-language: zh-CN' \
-H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' \
-H 'content-type: application/json' \
-H 'accept: */*' \
-H 'x-csrftoken: balabalabala' \
-H 'dnt: 1' \
-H 'x-definition-name: question' \
-H 'origin: https://leetcode-cn.com' \
-H 'sec-fetch-site: same-origin' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-dest: empty' \
-H 'referer: https://leetcode-cn.com/problems/best-time-to-buy-and-sell-stock-iii/' \
-H 'cookie: gbalabalabala' \
--data-binary $'{"operationName":"questionData","variables":{"titleSlug":"best-time-to-buy-and-sell-stock-iii"},"query":"query questionData($titleSlug: String\u0021) {\\n question(titleSlug: $titleSlug) {\\n questionId\\n questionFrontendId\\n boundTopicId\\n title\\n titleSlug\\n content\\n translatedTitle\\n translatedContent\\n isPaidOnly\\n difficulty\\n likes\\n dislikes\\n isLiked\\n similarQuestions\\n contributors {\\n username\\n profileUrl\\n avatarUrl\\n __typename\\n }\\n langToValidPlayground\\n topicTags {\\n name\\n slug\\n translatedName\\n __typename\\n }\\n companyTagStats\\n codeSnippets {\\n lang\\n langSlug\\n code\\n __typename\\n }\\n stats\\n hints\\n solution {\\n id\\n canSeeDetail\\n __typename\\n }\\n status\\n sampleTestCase\\n metaData\\n judgerAvailable\\n judgeType\\n mysqlSchemas\\n enableRunCode\\n envInfo\\n book {\\n id\\n bookName\\n pressName\\n source\\n shortDescription\\n fullDescription\\n bookImgUrl\\n pressImgUrl\\n productUrl\\n __typename\\n }\\n isSubscribed\\n isDailyQuestion\\n dailyRecordStatus\\n editorType\\n ugcQuestionId\\n style\\n __typename\\n }\\n}\\n"}' \
--compressed

可以看到还是蛮多的请求项的,不过这样太花眼了,转换一下格式吧。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
{
operationName: "questionData"
query: "query questionData($titleSlug: String!) {
question(titleSlug: $titleSlug) {
questionId
questionFrontendId
boundTopicId
title
titleSlug
content
translatedTitle
translatedContent
isPaidOnly
difficulty
likes
dislikes
isLiked
similarQuestions
contributors {
username
profileUrl
avatarUrl
__typename
}
langToValidPlayground
topicTags {
name
slug
translatedName
__typename
}
companyTagStats
codeSnippets {
lang
langSlug
code
__typename
}
stats
hints
solution {
id
canSeeDetail
__typename
}
status
sampleTestCase
metaData
judgerAvailable
judgeType
mysqlSchemas
enableRunCode
envInfo
book {
id
bookName
pressName
source
shortDescription
fullDescription
bookImgUrl
pressImgUrl
productUrl
__typename
}
isSubscribed
isDailyQuestion
dailyRecordStatus
editorType
ugcQuestionId
__typename
}
}
"
variables: {titleSlug: "merge-two-sorted-lists"}
}

而转换成 Python 代码的就大概是这个样子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def get_all(slug):
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36"
session = requests.Session()
url = "https://leetcode-cn.com/graphql"
params = {
'operationName':
"getQuestionDetail",
'variables': {
'titleSlug': slug
},
'query':
'''query getQuestionDetail($titleSlug: String!) {
question(titleSlug: $titleSlug) {
questionId
questionFrontendId
title
titleSlug
content
translatedTitle
translatedContent
difficulty
topicTags {
name
slug
translatedName
__typename
}
codeSnippets {
lang
langSlug
code
__typename
}
__typename
}
}'''
}
json_data = json.dumps(params).encode('utf8')
headers = {
'User-Agent': user_agent,
'Connection': 'keep-alive',
'Content-Type': 'application/json',
'Referer': 'https://leetcode-cn.com/problems/' + slug
}
resp = session.post(url, data=json_data, headers=headers, timeout=10)
resp.encoding = 'utf8'
content = resp.json()
# 题目详细信息
# print(content)
question = content['data']['question']
return question

把它们格式化成 json 格式的内容即可。而后把响应按json进行解析,直接获取data:question:里面的项即可。

代码

直接复制粘贴,然后运行就可以啦。哦对了,记得改一下 url

给一个 GitHub Gist
刷不出GitHub Gist的话看gitee

参考

爬虫获取力扣题目信息并转为Markdown
爬取LeetCode题目——如何发送GraphQL Query获取数据


自动爬取 Leetcode 题目,并保存为 Markdown 格式
https://www.yikakia.com/自动爬取-Leetcode-题目,并保存为-Markdown-格式/
作者
Yika
发布于
2021年1月9日
许可协议