本文最后更新于:1 天前
复制,粘贴,运行
前言
在 leetcode 刷题的时候我习惯在本地写好题解,不过每题都复制就太麻烦了,而且也不是Markdown格式的。于是想能不能爬取内容,然后自动转换格式。
正文
自己来写的话因为对爬虫并不是很熟悉(主要是对于抓包并不熟),所以直接搜索,于是发现了一个博客写这个事情。看起来挺不错的,那我就直接用CV大法了。这里主要是用了转换格式和爬取内容的方法。
原理
leetcode 的请求都是用的 GraphQL 技术。相比起 RESTful 格式而言,它更加灵活,能够减少请求数(一次就能查询多个数据)。查询的时候可以只请求部分内容。
对于本例而言,用 F12 ,然后看 Network,搜索graphql的请求后,筛选一下就能得到合适的结果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| curl 'https://leetcode-cn.com/graphql/' \ -H 'authority: leetcode-cn.com' \ -H 'x-timezone: undefined' \ -H 'x-operation-name: questionData' \ -H 'accept-language: zh-CN' \ -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' \ -H 'content-type: application/json' \ -H 'accept: */*' \ -H 'x-csrftoken: balabalabala' \ -H 'dnt: 1' \ -H 'x-definition-name: question' \ -H 'origin: https://leetcode-cn.com' \ -H 'sec-fetch-site: same-origin' \ -H 'sec-fetch-mode: cors' \ -H 'sec-fetch-dest: empty' \ -H 'referer: https://leetcode-cn.com/problems/best-time-to-buy-and-sell-stock-iii/' \ -H 'cookie: gbalabalabala' \ --data-binary $'{"operationName":"questionData","variables":{"titleSlug":"best-time-to-buy-and-sell-stock-iii"},"query":"query questionData($titleSlug: String\u0021) {\\n question(titleSlug: $titleSlug) {\\n questionId\\n questionFrontendId\\n boundTopicId\\n title\\n titleSlug\\n content\\n translatedTitle\\n translatedContent\\n isPaidOnly\\n difficulty\\n likes\\n dislikes\\n isLiked\\n similarQuestions\\n contributors {\\n username\\n profileUrl\\n avatarUrl\\n __typename\\n }\\n langToValidPlayground\\n topicTags {\\n name\\n slug\\n translatedName\\n __typename\\n }\\n companyTagStats\\n codeSnippets {\\n lang\\n langSlug\\n code\\n __typename\\n }\\n stats\\n hints\\n solution {\\n id\\n canSeeDetail\\n __typename\\n }\\n status\\n sampleTestCase\\n metaData\\n judgerAvailable\\n judgeType\\n mysqlSchemas\\n enableRunCode\\n envInfo\\n book {\\n id\\n bookName\\n pressName\\n source\\n shortDescription\\n fullDescription\\n bookImgUrl\\n pressImgUrl\\n productUrl\\n __typename\\n }\\n isSubscribed\\n isDailyQuestion\\n dailyRecordStatus\\n editorType\\n ugcQuestionId\\n style\\n __typename\\n }\\n}\\n"}' \ --compressed
|
可以看到还是蛮多的请求项的,不过这样太花眼了,转换一下格式吧。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
| { operationName: "questionData" query: "query questionData($titleSlug: String!) { question(titleSlug: $titleSlug) { questionId questionFrontendId boundTopicId title titleSlug content translatedTitle translatedContent isPaidOnly difficulty likes dislikes isLiked similarQuestions contributors { username profileUrl avatarUrl __typename } langToValidPlayground topicTags { name slug translatedName __typename } companyTagStats codeSnippets { lang langSlug code __typename } stats hints solution { id canSeeDetail __typename } status sampleTestCase metaData judgerAvailable judgeType mysqlSchemas enableRunCode envInfo book { id bookName pressName source shortDescription fullDescription bookImgUrl pressImgUrl productUrl __typename } isSubscribed isDailyQuestion dailyRecordStatus editorType ugcQuestionId __typename } } " variables: {titleSlug: "merge-two-sorted-lists"} }
|
而转换成 Python 代码的就大概是这个样子
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
| def get_all(slug): user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36" session = requests.Session() url = "https://leetcode-cn.com/graphql" params = { 'operationName': "getQuestionDetail", 'variables': { 'titleSlug': slug }, 'query': '''query getQuestionDetail($titleSlug: String!) { question(titleSlug: $titleSlug) { questionId questionFrontendId title titleSlug content translatedTitle translatedContent difficulty topicTags { name slug translatedName __typename } codeSnippets { lang langSlug code __typename } __typename } }''' } json_data = json.dumps(params).encode('utf8') headers = { 'User-Agent': user_agent, 'Connection': 'keep-alive', 'Content-Type': 'application/json', 'Referer': 'https://leetcode-cn.com/problems/' + slug } resp = session.post(url, data=json_data, headers=headers, timeout=10) resp.encoding = 'utf8' content = resp.json() question = content['data']['question'] return question
|
把它们格式化成 json 格式的内容即可。而后把响应按json进行解析,直接获取data:question:
里面的项即可。
代码
直接复制粘贴,然后运行就可以啦。哦对了,记得改一下 url
给一个 GitHub Gist
刷不出GitHub Gist的话看gitee的
参考
爬虫获取力扣题目信息并转为Markdown
爬取LeetCode题目——如何发送GraphQL Query获取数据