Integrate with goquery
Overview
goquery is often used to write crawler programs in Go. This article gives an example of a crawler that is more robust and easier to troubleshoot by combining goquery and req.
Code Example
package main
import (
"bytes"
"errors"
"fmt"
"github.com/PuerkitoBio/goquery"
"github.com/imroc/req/v3"
"log"
"strings"
)
var globalClient *req.Client
func init() {
globalClient = req.C().
// Enable dump at the request-level for each request, and only
// temporarily stores the dump content in memory, so we can call
// resp.Dump() to get the dump content when needed in response
// middleware.
EnableDumpEachRequest().
OnAfterResponse(func(client *req.Client, resp *req.Response) error {
if resp.Err != nil { // Ignore when there is an underlying error, e.g. network error.
return nil
}
// Treat non-successful responses as errors, record raw dump content in error message.
if !resp.IsSuccessResult() { // Status code is not between 200 and 299.
resp.Err = fmt.Errorf("bad response, raw content:\n%s", resp.Dump())
}
return nil
})
}
func crawl(url string, callback func(doc *goquery.Document) error) error {
// Send request.
resp, err := globalClient.R().Get(url)
if err != nil {
return err
}
// Pass resp.Body to goquery.
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil { // Append raw dump content to error message if goquery parse failed to help troubleshoot.
return fmt.Errorf("failed to parse html: %s, raw content:\n%s", err.Error(), resp.Dump())
}
err = callback(doc)
if err != nil {
err = fmt.Errorf("%s, raw content:\n%s", err.Error(), resp.Dump())
}
return err
}
func main() {
// Crawl the weekly github trending page and print it out.
err := crawl("https://github.com/trending?since=weekly", func(doc *goquery.Document) error {
buf := new(bytes.Buffer)
doc.Find(".Box .Box-row").Each(func(i int, s *goquery.Selection) {
href, ok := s.Find("h1 a").First().Attr("href")
if !ok || href == "" {
return
}
repo := strings.TrimPrefix(href, "/")
starsTotal := s.Find("div.f6 a").First().Text()
starsWeek := s.Find("div.f6 span").Last().Text()
starsTotal = strings.TrimSpace(starsTotal)
starsWeek = strings.TrimSpace(starsWeek)
buf.WriteString(fmt.Sprintf("No.%d %s\t%s stars total\t%s\n", i+1, repo, starsTotal, starsWeek))
})
if buf.Len() == 0 {
return errors.New("failed to parse trending")
}
fmt.Println("GitHub Trending:")
fmt.Println(buf.String())
return nil
})
if err != nil {
log.Fatal(err)
}
EnableDumpEachRequest
is a convenient syntax sugar. It actually uses the Request middleware internally to enable dump separately for each request, temporarily store the dump content in memory, and callResponse.Dump()
to get it when needed.- Use
OnAfterResponse
to add Response middleware, handle exceptions uniformly, consider abnormal responses (status codes not between 200 and 299) as errors, record the raw dump content into error, and throw it to the upper caller. - Although the Body will be automatically read by default,
resp.Body
will be automatically restored and can be directly passed to goquery for parsing. - If goquery reports an error in parsing html, it is usually due to github’s own failure, or an exception to the proxy. At this time, the raw dump content will be printed out for troubleshooting.
- When the trending data is not parsed, it may be that the layout of the github UI has changed, and the raw dump content is recorded in the error for troubleshooting.
- This example only crawls one page. Since the middleware is used for unified exception handling for all requests, it is easy to extend the crawling of other pages, and only need to focus on business logic.
Run and Result
$ go run .
GitHub Trending:
No.1 toeverything/AFFiNE 7,334 stars total 3,269 stars this week
No.2 dragonflydb/dragonfly 10,613 stars total 1,365 stars this week
No.3 novuhq/novu 7,488 stars total 2,432 stars this week
No.4 withastro/astro 16,101 stars total 2,688 stars this week
No.5 craftzdog/dotfiles-public 3,240 stars total 582 stars this week
No.6 punk-security/dnsReaper 788 stars total 466 stars this week
No.7 MatrixTM/MHDDoS 6,018 stars total 974 stars this week
No.8 moyix/fauxpilot 5,090 stars total 1,137 stars this week
No.9 duckdb/duckdb 5,926 stars total 222 stars this week
No.10 jina-ai/discoart 2,496 stars total 489 stars this week
No.11 pesser/stable-diffusion 607 stars total 173 stars this week
No.12 termux/termux-app 14,817 stars total 269 stars this week
No.13 TheAlgorithms/Python 142,155 stars total 1,076 stars this week
No.14 iptv-org/iptv 54,544 stars total 896 stars this week
No.15 ethereum/solidity 17,896 stars total 142 stars this week
No.16 facebook/folly 22,958 stars total 123 stars this week
No.17 pointfreeco/swift-composable-architecture 6,713 stars total 83 stars this week
No.18 actions/runner-images 6,410 stars total 69 stars this week
No.19 raysan5/raylib 10,367 stars total 135 stars this week
No.20 gofiber/fiber 21,691 stars total 365 stars this week
No.21 TeamNewPipe/NewPipe 20,417 stars total 349 stars this week
No.22 coder/coder 1,723 stars total 266 stars this week
No.23 MiCode/Xiaomi_Kernel_OpenSource 6,706 stars total 36 stars this week
No.24 utmapp/UTM 14,608 stars total 406 stars this week
No.25 bitwarden/server 10,260 stars total 53 stars this week
Test for Exceptions
Try to modify the URL to trigger the content parsing exception, such as https://www.baidu.com
, and then run it again to see the effect:
$ go run .
2022/08/16 20:55:16 failed to parse trending, raw content:
GET / HTTP/1.1
Host: www.baidu.com
User-Agent: req/v3 (https://github.com/imroc/req)
Accept-Encoding: gzip
HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: no-cache
Connection: keep-alive
Content-Length: 227
Content-Type: text/html
Date: Tue, 16 Aug 2022 12:55:16 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Pragma: no-cache
Server: BWS/1.1
Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: BIDUPSID=0021315124C1DCBD6D6542551E4524D3; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1660654516; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BAIDUID=0021315124C1DCBD79E08683C3E42600:FG=1; max-age=31536000; expires=Wed, 16-Aug-23 12:55:16 GMT; domain=.baidu.com; path=/; version=1; comment=bd
Strict-Transport-Security: max-age=0
Traceid: 1660654516037233921014200821962745403164
X-Frame-Options: sameorigin
X-Ua-Compatible: IE=Edge,chrome=1
<html>
<head>
<script>
location.replace(location.href.replace("https://","http://"));
</script>
</head>
<body>
<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>