729 lines
13 KiB
Markdown
729 lines
13 KiB
Markdown
---
|
||
description: 生成或优化 robots.txt 文件
|
||
argument-hint: [domain]
|
||
---
|
||
|
||
为 Next.js 项目生成 SEO 友好的 robots.txt 文件。包含基本的爬虫规则、Sitemap 引用和常见目录的禁止规则。
|
||
|
||
## 功能
|
||
|
||
- ✅ 生成标准的 robots.txt 文件
|
||
- ✅ 自动包含 Sitemap 引用
|
||
- ✅ 禁止敏感目录和文件
|
||
- ✅ 允许所有主要爬虫
|
||
- ✅ 支持爬虫延迟设置
|
||
- ✅ 适配 Next.js 项目结构
|
||
|
||
## 参数
|
||
|
||
- `$1` 或 `$ARGUMENTS`: 网站域名(可选)
|
||
- 用于生成 Sitemap URL
|
||
- 示例:`https://yourdomain.com`
|
||
- 如果不提供,Sitemap 行会被注释或省略
|
||
|
||
## 使用示例
|
||
|
||
### 示例 1:基本用法
|
||
|
||
```bash
|
||
/robots-txt
|
||
```
|
||
|
||
输出:
|
||
```markdown
|
||
# robots.txt 文件
|
||
|
||
为你的 Next.js 项目生成的 robots.txt 文件:
|
||
|
||
```txt
|
||
# Allow all crawlers
|
||
User-agent: *
|
||
Allow: /
|
||
|
||
# Disallow common Next.js and private directories
|
||
Disallow: /api/
|
||
Disallow: /_next/
|
||
Disallow: /static/
|
||
Disallow: /.well-known/
|
||
|
||
# Disallow admin and private areas (if applicable)
|
||
# Disallow: /admin/
|
||
# Disallow: /private/
|
||
# Disallow: /dashboard/
|
||
|
||
# Sitemap (uncomment and update with your domain)
|
||
# Sitemap: https://yourdomain.com/sitemap.xml
|
||
```
|
||
|
||
## 文件位置
|
||
|
||
将此文件放置在项目根目录的 `public/` 文件夹中:
|
||
|
||
**路径:** `public/robots.txt`
|
||
|
||
复制上面的内容到 `public/robots.txt` 文件。
|
||
|
||
## 验证
|
||
|
||
创建后,访问 `https://yourdomain.com/robots.txt` 验证文件可访问。
|
||
|
||
## Next.js 自动生成
|
||
|
||
如果你使用 Next.js 13.3+,也可以使用 `generateRobotTxt` 函数自动生成:
|
||
|
||
**`app/robots.ts` (推荐):**
|
||
|
||
```typescript
|
||
import { MetadataRoute } from 'next'
|
||
|
||
export default function robots(): MetadataRoute.Robots {
|
||
return {
|
||
rules: [
|
||
{
|
||
userAgent: '*',
|
||
allow: '/',
|
||
disallow: ['/api/', '/_next/', '/static/'],
|
||
},
|
||
],
|
||
sitemap: 'https://yourdomain.com/sitemap.xml',
|
||
}
|
||
}
|
||
```
|
||
|
||
这会自动生成 `/robots.txt` 路由。
|
||
```
|
||
|
||
### 示例 2:指定域名
|
||
|
||
```bash
|
||
/robots-txt https://yourdomain.com
|
||
```
|
||
|
||
输出:
|
||
```markdown
|
||
# robots.txt 文件(包含 Sitemap)
|
||
|
||
```txt
|
||
# Allow all crawlers
|
||
User-agent: *
|
||
Allow: /
|
||
|
||
# Disallow common Next.js directories
|
||
Disallow: /api/
|
||
Disallow: /_next/
|
||
Disallow: /static/
|
||
|
||
# Disallow private areas
|
||
Disallow: /admin/
|
||
Disallow: /_next/image/
|
||
Disallow: /private/
|
||
|
||
# Crawl delay (optional)
|
||
# Crawl-delay: 1
|
||
|
||
# Sitemap
|
||
Sitemap: https://yourdomain.com/sitemap.xml
|
||
```
|
||
|
||
## 部署
|
||
|
||
将上述内容保存到 `public/robots.txt`,部署后可通过以下 URL 访问:
|
||
https://yourdomain.com/robots.txt
|
||
|
||
## 注意事项
|
||
|
||
- robots.txt 文件必须位于网站根目录
|
||
- 文件名必须小写:`robots.txt`
|
||
- 文件更新后可能需要几天才能被搜索引擎重新抓取
|
||
- 使用 Google Search Console 的 robots.txt 测试工具验证
|
||
```
|
||
|
||
### 示例 3:博客项目
|
||
|
||
```bash
|
||
/robots-txt https://blog.yourdomain.com
|
||
```
|
||
|
||
输出:
|
||
```markdown
|
||
# robots.txt for Blog
|
||
|
||
```txt
|
||
# Allow all crawlers
|
||
User-agent: *
|
||
Allow: /
|
||
|
||
# Disallow admin and management areas
|
||
Disallow: /admin/
|
||
Disallow: /api/
|
||
Disallow: /_next/
|
||
Disallow: /wp-admin/
|
||
Disallow: /wp-includes/
|
||
|
||
# Disallow search and filter pages (avoid duplicate content)
|
||
Disallow: /search?
|
||
Disallow: /filter?
|
||
Disallow: /tag/*/page/
|
||
Disallow: /category/*/page/
|
||
|
||
# Allow specific feeds
|
||
Allow: /feed/
|
||
Allow: /feed/rss/
|
||
|
||
# Disallow other feeds
|
||
Disallow: */feed/
|
||
Disallow: */feed/rss/
|
||
Disallow: */feed/atom/
|
||
|
||
# Sitemap
|
||
Sitemap: https://blog.yourdomain.com/sitemap.xml
|
||
```
|
||
|
||
## 博客特定规则说明
|
||
|
||
**禁止搜索和分页页面:**
|
||
- 避免重复内容问题
|
||
- 节省爬虫预算
|
||
|
||
**允许主要 feed:**
|
||
- 允许主 feed 被抓取
|
||
- 禁止其他 feed 变体
|
||
|
||
**防止爬虫陷入无限循环:**
|
||
- 禁止带参数的搜索 URL
|
||
- 禁止深层分页
|
||
```
|
||
|
||
### 示例 4:电商项目
|
||
|
||
```bash
|
||
/robots-txt https://store.yourdomain.com
|
||
```
|
||
|
||
输出:
|
||
```markdown
|
||
# robots.txt for E-commerce Store
|
||
|
||
```txt
|
||
# Allow all crawlers
|
||
User-agent: *
|
||
Allow: /
|
||
|
||
# Disallow admin and backend
|
||
Disallow: /admin/
|
||
Disallow: /api/
|
||
Disallow: /_next/
|
||
Disallow: /checkout/
|
||
Disallow: /cart/
|
||
Disallow: /my-account/
|
||
Disallow: /wishlist/
|
||
|
||
# Disallow search and filter pages
|
||
Disallow: /search?
|
||
Disallow: /catalogsearch/
|
||
Disallow: /filter?
|
||
|
||
# Disallow sorted and filtered URLs
|
||
Disallow: /*?sort=
|
||
Disallow: /*?order=
|
||
Disallow: /*?filter=
|
||
Disallow: /*?price=
|
||
|
||
# Disallow session IDs and tracking parameters
|
||
Disallow: /*?PHPSESSID=
|
||
Disallow: /*?utm_source=
|
||
Disallow: /*?utm_medium=
|
||
Disallow: /*?utm_campaign=
|
||
|
||
# Allow product images
|
||
Allow: /media/catalog/
|
||
Allow: /products/images/
|
||
|
||
# Disallow temporary files
|
||
Disallow: /tmp/
|
||
Disallow: /var/
|
||
Disallow: /cache/
|
||
|
||
# Specific crawler rules
|
||
# Block aggressive crawlers (optional)
|
||
User-agent: SemrushBot
|
||
Disallow: /
|
||
|
||
User-agent: AhrefsBot
|
||
Disallow: /
|
||
|
||
# Sitemap
|
||
Sitemap: https://store.yourdomain.com/sitemap.xml
|
||
Sitemap: https://store.yourdomain.com/sitemap-product.xml
|
||
Sitemap: https://store.yourdomain.com/sitemap-category.xml
|
||
```
|
||
|
||
## 电商特定规则说明
|
||
|
||
**禁止结账流程页面:**
|
||
- /checkout/, /cart/ 等页面不需要被索引
|
||
|
||
**禁止搜索和过滤 URL:**
|
||
- 避免重复内容
|
||
- 节省爬虫预算
|
||
|
||
**允许产品图片:**
|
||
- 图片搜索流量有价值
|
||
|
||
**阻止 SEO 工具爬虫:**
|
||
- SemrushBot, AhrefsBot 等可能消耗服务器资源
|
||
|
||
**多个 Sitemap:**
|
||
- 电商网站通常有多个 sitemap(产品、分类、静态页面)
|
||
```
|
||
|
||
### 示例 5:多语言项目
|
||
|
||
```bash
|
||
/robots-txt https://yourdomain.com
|
||
```
|
||
|
||
输出:
|
||
```markdown
|
||
# robots.txt for Multi-language Site
|
||
|
||
```txt
|
||
# Allow all crawlers
|
||
User-agent: *
|
||
Allow: /
|
||
|
||
# Disallow common directories
|
||
Disallow: /api/
|
||
Disallow: /_next/
|
||
Disallow: /static/
|
||
|
||
# Sitemap for all languages
|
||
Sitemap: https://yourdomain.com/sitemap.xml
|
||
Sitemap: https://yourdomain.com/en/sitemap.xml
|
||
Sitemap: https://yourdomain.com/zh/sitemap.xml
|
||
Sitemap: https://yourdomain.com/es/sitemap.xml
|
||
```
|
||
|
||
## 多语言 Sitemap 策略
|
||
|
||
**选项 1:主 sitemap 引用子 sitemap**
|
||
```xml
|
||
<!-- sitemap.xml -->
|
||
<sitemapindex>
|
||
<sitemap>
|
||
<loc>https://yourdomain.com/en/sitemap.xml</loc>
|
||
</sitemap>
|
||
<sitemap>
|
||
<loc>https://yourdomain.com/zh/sitemap.xml</loc>
|
||
</sitemap>
|
||
</sitemapindex>
|
||
```
|
||
|
||
**选项 2:每个语言的 sitemap 单独引用**
|
||
- 在 robots.txt 中列出所有语言的 sitemap
|
||
- 确保每个语言的 sitemap URL 正确
|
||
```
|
||
|
||
## 高级配置
|
||
|
||
### Crawl Delay(爬虫延迟)
|
||
|
||
```txt
|
||
User-agent: *
|
||
Crawl-delay: 1
|
||
|
||
# 或针对特定爬虫
|
||
User-agent: Googlebot
|
||
Crawl-delay: 0
|
||
|
||
User-agent: Bingbot
|
||
Crawl-delay: 1
|
||
```
|
||
|
||
**说明:**
|
||
- 设置爬虫请求之间的延迟(秒)
|
||
- Google 通常忽略 crawl-delay
|
||
- 对小网站可能有用
|
||
|
||
### 阻止特定爬虫
|
||
|
||
```txt
|
||
# 阻止 SEO 工具爬虫
|
||
User-agent: SemrushBot
|
||
Disallow: /
|
||
|
||
User-agent: AhrefsBot
|
||
Disallow: /
|
||
|
||
User-agent: MJ12bot
|
||
Disallow: /
|
||
|
||
# 阻止存档网站
|
||
User-agent: archive.org_bot
|
||
Disallow: /
|
||
```
|
||
|
||
**注意:** 只阻止消耗大量资源的爬虫,不要阻止主要搜索引擎。
|
||
|
||
### 允许特定资源
|
||
|
||
```txt
|
||
User-agent: *
|
||
Allow: /images/
|
||
Allow: /css/
|
||
Allow: /js/
|
||
|
||
Disallow: /api/
|
||
Disallow: /admin/
|
||
```
|
||
|
||
**说明:** 在禁止规则之前使用 Allow 规则。
|
||
|
||
### 针对不同爬虫的不同规则
|
||
|
||
```txt
|
||
# Googlebot
|
||
User-agent: Googlebot
|
||
Allow: /
|
||
Disallow: /private/
|
||
|
||
# Bingbot
|
||
User-agent: Bingbot
|
||
Allow: /
|
||
Disallow: /private/
|
||
|
||
# 其他所有爬虫
|
||
User-agent: *
|
||
Allow: /
|
||
Disallow: /private/
|
||
Disallow: /admin/
|
||
```
|
||
|
||
## robots.txt 语法
|
||
|
||
### 基本指令
|
||
|
||
**User-agent:**
|
||
```
|
||
User-agent: * # 所有爬虫
|
||
User-agent: Googlebot # 特定爬虫
|
||
```
|
||
|
||
**Disallow:**
|
||
```
|
||
Disallow: / # 禁止整个网站
|
||
Disallow: /admin/ # 禁止目录
|
||
Disallow: /file.html # 禁止文件
|
||
Disallow: /*.pdf # 禁止所有 PDF
|
||
```
|
||
|
||
**Allow:**
|
||
```
|
||
Allow: / # 允许整个网站
|
||
Allow: /images/ # 允许目录
|
||
```
|
||
|
||
**Sitemap:**
|
||
```
|
||
Sitemap: https://yourdomain.com/sitemap.xml
|
||
Sitemap: https://yourdomain.com/sitemap-2.xml
|
||
```
|
||
|
||
**Crawl-delay:**
|
||
```
|
||
Crawl-delay: 1 # 1 秒延迟
|
||
```
|
||
|
||
### 通配符和模式
|
||
|
||
**`*` 匹配任意字符:**
|
||
```
|
||
Disallow: /*? # 禁止所有带参数的 URL
|
||
Disallow: /*.pdf # 禁止所有 PDF 文件
|
||
Disallow: /private* # 禁止以 /private 开头的所有路径
|
||
```
|
||
|
||
**`$` 结束标记:**
|
||
```
|
||
Disallow: /*.pdf$ # 禁止以 .pdf 结尾的 URL
|
||
Disallow: /print$ # 禁止以 /print 结尾的 URL
|
||
```
|
||
|
||
## 最佳实践
|
||
|
||
### 1. 保持简单
|
||
|
||
```txt
|
||
User-agent: *
|
||
Allow: /
|
||
Disallow: /admin/
|
||
```
|
||
|
||
**大多数网站只需如此简单。**
|
||
|
||
### 2. 不要依赖 robots.txt 保护敏感内容
|
||
|
||
- robots.txt 是公开的,任何人都可以查看
|
||
- 使用身份验证保护敏感内容
|
||
- robots.txt 只是建议,不是强制
|
||
|
||
### 3. 禁止重复内容
|
||
|
||
```txt
|
||
# 禁止搜索结果、过滤、排序页面
|
||
Disallow: /search?
|
||
Disallow: /filter?
|
||
Disallow: /*?sort=
|
||
Disallow: /*?order=
|
||
```
|
||
|
||
### 4. 明确允许重要资源
|
||
|
||
```txt
|
||
User-agent: *
|
||
Allow: /images/
|
||
Allow: /css/
|
||
Allow: /js/
|
||
|
||
Disallow: /api/
|
||
```
|
||
|
||
### 5. 测试和验证
|
||
|
||
使用以下工具测试:
|
||
- Google Search Console - robots.txt 测试工具
|
||
- Bing Webmaster Tools - robots.txt 测试工具
|
||
- 在线验证工具
|
||
|
||
## 常见错误
|
||
|
||
### 错误 1:禁止整个网站
|
||
|
||
```txt
|
||
# 错误
|
||
User-agent: *
|
||
Disallow: /
|
||
```
|
||
|
||
**修复:**
|
||
```txt
|
||
# 正确
|
||
User-agent: *
|
||
Allow: /
|
||
Disallow: /admin/
|
||
```
|
||
|
||
### 错误 2:错误的路径
|
||
|
||
```txt
|
||
# 错误 - 缺少尾部斜杠
|
||
Disallow: /admin
|
||
|
||
# 正确 - 目录应该有尾部斜杠
|
||
Disallow: /admin/
|
||
```
|
||
|
||
### 错误 3:阻止搜索引擎
|
||
|
||
```txt
|
||
# 错误 - 阻止主要搜索引擎
|
||
User-agent: Googlebot
|
||
Disallow: /
|
||
```
|
||
|
||
**除非有充分理由,否则不要阻止主要搜索引擎。**
|
||
|
||
### 错误 4:过度复杂的规则
|
||
|
||
```txt
|
||
# 错误 - 过于复杂
|
||
User-agent: *
|
||
Allow: /images/jpeg/
|
||
Allow: /images/png/
|
||
Disallow: /images/gif/
|
||
Disallow: /images/webp/
|
||
Disallow: /images/svg/
|
||
# ... 更多规则
|
||
```
|
||
|
||
**保持简单:**
|
||
```txt
|
||
User-agent: *
|
||
Allow: /images/
|
||
Disallow: /private/
|
||
```
|
||
|
||
## Next.js 集成
|
||
|
||
### 选项 1:静态文件(简单)
|
||
|
||
将 `robots.txt` 放在 `public/` 目录:
|
||
```
|
||
public/
|
||
robots.txt
|
||
sitemap.xml
|
||
```
|
||
|
||
### 选项 2:自动生成(推荐)
|
||
|
||
**`app/robots.ts` 或 `app/robots.js`:**
|
||
|
||
```typescript
|
||
import { MetadataRoute } from 'next'
|
||
|
||
export default function robots(): MetadataRoute.Robots {
|
||
const baseUrl = 'https://yourdomain.com'
|
||
|
||
return {
|
||
rules: [
|
||
{
|
||
userAgent: '*',
|
||
allow: '/',
|
||
disallow: [
|
||
'/api/',
|
||
'/_next/',
|
||
'/static/',
|
||
'/admin/',
|
||
],
|
||
},
|
||
{
|
||
userAgent: ['SemrushBot', 'AhrefsBot'],
|
||
disallow: '/',
|
||
},
|
||
],
|
||
sitemap: `${baseUrl}/sitemap.xml`,
|
||
}
|
||
}
|
||
```
|
||
|
||
**环境变量配置:**
|
||
|
||
```typescript
|
||
import { MetadataRoute } from 'next'
|
||
|
||
export default function robots(): MetadataRoute.Robots {
|
||
const baseUrl = process.env.NEXT_PUBLIC_BASE_URL || 'https://yourdomain.com'
|
||
|
||
return {
|
||
rules: [
|
||
{
|
||
userAgent: '*',
|
||
allow: '/',
|
||
disallow: ['/api/', '/_next/'],
|
||
},
|
||
],
|
||
sitemap: `${baseUrl}/sitemap.xml`,
|
||
}
|
||
}
|
||
```
|
||
|
||
**动态配置:**
|
||
|
||
```typescript
|
||
import { MetadataRoute } from 'next'
|
||
|
||
export default function robots(): MetadataRoute.Robots {
|
||
return {
|
||
rules: [
|
||
{
|
||
userAgent: '*',
|
||
allow: '/',
|
||
disallow: process.env.NODE_ENV === 'production'
|
||
? ['/admin/', '/private/']
|
||
: ['/'],
|
||
},
|
||
],
|
||
sitemap: 'https://yourdomain.com/sitemap.xml',
|
||
}
|
||
}
|
||
```
|
||
|
||
## 验证和测试
|
||
|
||
### 1. 本地验证
|
||
|
||
```bash
|
||
# 开发环境
|
||
curl http://localhost:3000/robots.txt
|
||
|
||
# 生产环境
|
||
curl https://yourdomain.com/robots.txt
|
||
```
|
||
|
||
### 2. Google Search Console
|
||
|
||
1. 打开 Google Search Console
|
||
2. 选择你的网站
|
||
3. 导航到"设置" → "robots.txt 测试工具"
|
||
4. 输入你的 robots.txt 内容
|
||
5. 测试特定 URL 是否被允许或禁止
|
||
|
||
### 3. 在线验证工具
|
||
|
||
- Google robots.txt 测试工具
|
||
- Bing robots.txt 测试工具
|
||
- Screaming Frog SEO Spider
|
||
|
||
### 4. 手动检查清单
|
||
|
||
- [ ] 文件位于根目录:`/robots.txt`
|
||
- [ ] 文件名小写:`robots.txt`
|
||
- [ ] 文件可访问:返回 200 状态码
|
||
- [ ] 语法正确:没有错误
|
||
- [ ] 规则合理:不会意外阻止重要内容
|
||
- [ ] Sitemap 引用正确:URL 完整且有效
|
||
|
||
## 常见问题
|
||
|
||
### Q: robots.txt 更新后多久生效?
|
||
|
||
A: 通常需要几天到一周。搜索引擎不会立即重新抓取。
|
||
|
||
### Q: 如何隐藏敏感内容?
|
||
|
||
A: 不要使用 robots.txt。使用:
|
||
- 身份验证(登录)
|
||
- noindex meta 标签
|
||
- HTTP 身份验证
|
||
- IP 限制
|
||
|
||
### Q: 禁止目录后,该目录的图片还能被索引吗?
|
||
|
||
A: 不能。禁止目录会禁止该目录下的所有资源。
|
||
|
||
### Q: 如何阻止特定参数的 URL?
|
||
|
||
A: 使用通配符:
|
||
```
|
||
Disallow: /*?parameter=
|
||
Disallow: /*?utm_source=
|
||
```
|
||
|
||
### Q: 需要为每个爬虫单独设置规则吗?
|
||
|
||
A: 通常不需要。大多数网站对所有爬虫使用相同规则。
|
||
|
||
### Q: robots.txt 与 meta robots 标签有什么区别?
|
||
|
||
A:
|
||
- **robots.txt**: 控制整个网站或目录的爬取
|
||
- **meta robots**: 控制单个页面的索引和跟随
|
||
|
||
通常结合使用以获得最佳效果。
|
||
|
||
## 相关命令
|
||
|
||
- `/llm-txt` - 生成 llm.txt 文件
|
||
- `/seo-audit` - 检查 robots.txt 配置
|
||
- `/seo-check` - 验证 robots.txt 可访问性
|
||
|
||
## 注意事项
|
||
|
||
- robots.txt 文件必须位于网站根目录
|
||
- 文件名必须小写:`robots.txt`
|
||
- robots.txt 是公开的,任何人都可以查看
|
||
- 不使用 robots.txt 保护敏感内容
|
||
- 定期测试和验证规则
|
||
- 更新后可能需要几天才能生效
|
||
- 使用 Google Search Console 监控爬取错误
|