jexjws/readme.md

typecho-robots.txt

我的一些robots.txt写法，参考了创建并提交 robots.txt 文件 | Google 搜索中心以及其他网站的一些robots.txt。

0.

对现在高度智能的爬虫来说，1.中的限制可能是一种负担。

User-agent: *
Disallow: 

Sitemap: /站点地图URL

1.使用站点地图，仅爬取文章

不使用地址重写功能

User-agent: *
# 白名单制度，默认不允许所有，这么做可以让爬虫只爬取被明确允许的URL，好处是减轻爬虫、网站负担（主题文件这种对bot无关紧要的爬虫就不会爬），坏处是有时会忘记允许爬虫爬一些资源（我一开始就忘了允许爬取文章内图片）
Disallow: /
# 文章主体
Allow: /index.php/archives/
# 文章中的图片文件等资源
Allow: /usr/uploads/
Allow: /站点地图URL

Sitemap: /站点地图URL

使用地址重写功能

User-agent: *
Disallow: /
Allow: /archives/
Allow: /usr/uploads/
Allow: /站点地图URL

Sitemap: /站点地图URL

按需添加

Allow: /category/
Allow: /tag/

jexjws/readme.md

Select an option

No results found

Select an option

No results found

typecho-robots.txt

0.

1.使用站点地图，仅爬取文章

不使用地址重写功能

使用地址重写功能