与 goquery 集成
概述
使用 Go 写爬虫程序往往会用到 goquery,本文给出 goquery 与 req 结合来实现更健壮,更容易排障的爬虫示例。
代码示例
package main
import (
"bytes"
"errors"
"fmt"
"github.com/PuerkitoBio/goquery"
"github.com/imroc/req/v3"
"log"
"strings"
)
var globalClient *req.Client
func init() {
globalClient = req.C().
// Enable dump at the request-level for each request, and only
// temporarily stores the dump content in memory, so we can call
// resp.Dump() to get the dump content when needed in response
// middleware.
EnableDumpEachRequest().
OnAfterResponse(func(client *req.Client, resp *req.Response) error {
if resp.Err != nil { // Ignore when there is an underlying error, e.g. network error.
return nil
}
// Treat non-successful responses as errors, record raw dump content in error message.
if !resp.IsSuccessState() { // Status code is not between 200 and 299.
resp.Err = fmt.Errorf("bad response, raw content:\n%s", resp.Dump())
}
return nil
})
}
func crawl(url string, callback func(doc *goquery.Document) error) error {
// Send request.
resp, err := globalClient.R().Get(url)
if err != nil {
return err
}
// Pass resp.Body to goquery.
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil { // Append raw dump content to error message if goquery parse failed to help troubleshoot.
return fmt.Errorf("failed to parse html: %s, raw content:\n%s", err.Error(), resp.Dump())
}
err = callback(doc)
if err != nil {
err = fmt.Errorf("%s, raw content:\n%s", err.Error(), resp.Dump())
}
return err
}
func main() {
// Crawl the weekly github trending page and print it out.
err := crawl("https://github.com/trending?since=weekly", func(doc *goquery.Document) error {
buf := new(bytes.Buffer)
doc.Find(".Box .Box-row").Each(func(i int, s *goquery.Selection) {
href, ok := s.Find("h1 a").First().Attr("href")
if !ok || href == "" {
return
}
repo := strings.TrimPrefix(href, "/")
starsTotal := s.Find("div.f6 a").First().Text()
starsWeek := s.Find("div.f6 span").Last().Text()
starsTotal = strings.TrimSpace(starsTotal)
starsWeek = strings.TrimSpace(starsWeek)
buf.WriteString(fmt.Sprintf("No.%d %s\t%s stars total\t%s\n", i+1, repo, starsTotal, starsWeek))
})
if buf.Len() == 0 {
return errors.New("failed to parse trending")
}
fmt.Println("GitHub Trending:")
fmt.Println(buf.String())
return nil
})
if err != nil {
log.Fatal(err)
}
}
EnableDumpEachRequest
是便捷的语法糖,内部实际是利用 Request 中间件,为每个请求单独开启 dump,暂存 dump 内容到内存,在需要用的时候调用Response.Dump()
来获取。- 使用
OnAfterResponse
添加 Response 中间件,统一处理异常,将非正常响应(状态码不在 200~299 之间)均认为是错误,将dump原始内容记录到error中,抛给上层调用方。 - 虽然默认会自动读取 Body,但
resp.Body
会自动还原,可以直接传给 goquery 进行解析。 - 如果 goquery 解析 html 报错,通常是 github 自身故障,或代理异常,此时将 dump 下来的原始内容打印出来以便排查问题。
- 当没有解析出 trending 数据时,可能是 github 页面布局有改动,将 dump 内容记录到 error 中以便排查问题。
- 该示例只爬了一个页面,由于利用中间件对所有请求进行了统一的异常处理,所以可以很容易扩展其它页面的爬取,只需专注业务逻辑。
运行结果
$ go run .
GitHub Trending:
No.1 toeverything/AFFiNE 7,334 stars total 3,269 stars this week
No.2 dragonflydb/dragonfly 10,613 stars total 1,365 stars this week
No.3 novuhq/novu 7,488 stars total 2,432 stars this week
No.4 withastro/astro 16,101 stars total 2,688 stars this week
No.5 craftzdog/dotfiles-public 3,240 stars total 582 stars this week
No.6 punk-security/dnsReaper 788 stars total 466 stars this week
No.7 MatrixTM/MHDDoS 6,018 stars total 974 stars this week
No.8 moyix/fauxpilot 5,090 stars total 1,137 stars this week
No.9 duckdb/duckdb 5,926 stars total 222 stars this week
No.10 jina-ai/discoart 2,496 stars total 489 stars this week
No.11 pesser/stable-diffusion 607 stars total 173 stars this week
No.12 termux/termux-app 14,817 stars total 269 stars this week
No.13 TheAlgorithms/Python 142,155 stars total 1,076 stars this week
No.14 iptv-org/iptv 54,544 stars total 896 stars this week
No.15 ethereum/solidity 17,896 stars total 142 stars this week
No.16 facebook/folly 22,958 stars total 123 stars this week
No.17 pointfreeco/swift-composable-architecture 6,713 stars total 83 stars this week
No.18 actions/runner-images 6,410 stars total 69 stars this week
No.19 raysan5/raylib 10,367 stars total 135 stars this week
No.20 gofiber/fiber 21,691 stars total 365 stars this week
No.21 TeamNewPipe/NewPipe 20,417 stars total 349 stars this week
No.22 coder/coder 1,723 stars total 266 stars this week
No.23 MiCode/Xiaomi_Kernel_OpenSource 6,706 stars total 36 stars this week
No.24 utmapp/UTM 14,608 stars total 406 stars this week
No.25 bitwarden/server 10,260 stars total 53 stars this week
测试异常情况
尝试修改下 URL 来触发内容解析异常,比如改成 https://www.baidu.com
,然后再运行看下效果:
$ go run .
2022/08/16 20:55:16 failed to parse trending, raw content:
GET / HTTP/1.1
Host: www.baidu.com
User-Agent: req/v3 (https://github.com/imroc/req)
Accept-Encoding: gzip
HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: no-cache
Connection: keep-alive
Content-Length: 227
Content-Type: text/html
Date: Tue, 16 Aug 2022 12:55:16 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Pragma: no-cache
Server: BWS/1.1
Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: BIDUPSID=0021315124C1DCBD6D6542551E4524D3; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1660654516; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BAIDUID=0021315124C1DCBD79E08683C3E42600:FG=1; max-age=31536000; expires=Wed, 16-Aug-23 12:55:16 GMT; domain=.baidu.com; path=/; version=1; comment=bd
Strict-Transport-Security: max-age=0
Traceid: 1660654516037233921014200821962745403164
X-Frame-Options: sameorigin
X-Ua-Compatible: IE=Edge,chrome=1
<html>
<head>
<script>
location.replace(location.href.replace("https://","http://"));
</script>
</head>
<body>
<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>