Robots.txt and sitemap pages using Next.js and a Headless CMS

Robots.txt and sitemap pages using Next.js and a Headless CMS

Brief introduction to CMS's and SEO related topics

Search Engine Optimization (SEO) is one of those frontend things that can always get tricky. You can have really good HTML practices, the fastest load times, meta tags or social media images. All of that is going to help a lot to increase the positioning of your site. However, there are always 2 special pages that every site that wants to be well-indexed and crawled by search crawlers must have: the robots.txt and sitemap.xml pages.

In this post, we'll go through the details of what these pages are and how to build them in a Next.js project fetching data from a Headless Content Management System (CMS). But first of all, what is a CMS?

CMS Definition

"A CMS, short for content management system, is a software application that allows users to build and manage a website without having to code it from scratch or know how to code at all. [...] With a CMS, you can create, manage, modify, and publish content in a user-friendly interface." (source)

Okay, we know what a CMS is, but what's with a Headless CMS?

You can think of headless as "detached or decoupled from the website that serves the content, mainly consumed via API". To summarize, a Headless CMS is a Content Management System that is decoupled from the main application and serves its content via API. If you want to go deeper into the definition you can visit the official explanation of one of the current biggest headless CMS.

There are many CMS out there: Storyblok, Drupal, WordPress, Contentful, Strapi, Sanity, among others. Today I'll use Contentful because it's the Headless CMS I have used the most (: but the example should apply quite the same for any.

DISCLAIMER: This is not a post about Contentful or the basics of any Headless CMS. If you're not familiar with these I encourage you to take a look into the most popular available options that best suit your needs.

Let's start talking about the main topics of this post.

Robots.txt

"Robots.txt is a text file webmasters create to instruct web robots or crawlers (typically search engine robots) how to crawl pages on their website." (source)

We can achieve the above by telling which robot can crawl our site and which pages they can crawl. The basics for these are the following properties:

  • User-agent: it defines which robots can crawl the site.

  • Disallow rules: pages that cannot be crawled.

  • Allow rules (google-bot only): pages that can be crawled.

  • Crawl-delay: How many seconds a crawler should wait before loading and crawling page content.

  • Sitemap: Where the sitemap page is located.

The Next.js implementation for this page is pretty straightforward and it does not need any CMS, but I didn't want to create a post just to paste this code fragment so here we are, putting everything together c:

// pages/robots.txt.tsx
import { Component } from 'react'
import { GetServerSidePropsContext } from 'next'

const isArrowed = process.env.NEXT_PUBLIC_ALLOW_CRAWLING // 'true' or 'false'
const siteUrl = process.env.ORIGIN_URL // https://my.site.com

const allow = `User-agent: *
Disallow: /500
Disallow: /404
Disallow: /403
Allow: /
Sitemap: ${siteUrl}/sitemap.xml
`

const disallow = `User-agent: *
Disallow: /
`

export default class RobotTxt extends Component {
  static async getInitialProps({ res }: GetServerSidePropsContext): Promise<void> {
    const robotFile = isArrowed === 'true' ? allow : disallow
    res.writeHead(200, {
      'Content-Type': 'text/plain',
    })
    res.end(robotFile)
  }
}

If you're not using Typescript you can just remove the types from the code. What we're doing is applying a config based on if the site supports crawling or not, this is defined using an environment (from now on "env") variable. The code has been written thinking of having multiple envs for the site (where you don't want your development or staging envs to be crawled). If your site only has one you can ignore these and just use the "allow" configuration. The same principle applies to the ORIGIN_URL variable.

The disallow variable defines that all search engine robots cannot crawl any page of the site.

On the other hand, the allow variable defines that every user agent (or just robots) is not allowed to crawl 403, 404 and 500 pages. This is mainly because we don't care about the robots crawling those due to they don't have relevant content (unless your error pages are flashy, funny and have interesting information).

Sitemap.xml

Now the real challenging section (kind of) (:

First of all, what a sitemap is? Based on my official patented, personal and not-stolen description:

"Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL [...] so that search engines can more intelligently crawl the site." (source).

So basically it defines the pages our site has and some additional metadata. The format for this page is the following:

<?xml version="1.0" encoding="UTF-8"?>
  <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
      <loc>https://my.site.com</loc>
      <changefreq>daily</changefreq>
      <lastmod>2023-01-03</lastmod>
      <priority>0.8</priority>
    </url>
    ...
  </urlset>

For each page, we have to define a <url> block with some metadata (there are more properties but these are the most common):

  1. loc: page's URL.

  2. changefreq: frequency the crawler should check for page changes.

  3. lastmod: page last modified date.

  4. priority: that the page has on the site.

Now the question is, how can we generate this kind of page using data from our favorite Headless CMS?

When using this kind of CMS, pages normally live as entries with fields filled with content (titles, page slugs, banners, sections, headers, footers, etc). Then we fetch these pages and build the UI using some frontend library or framework like Next.js.

First things first, we need to define the Sitemap page component in our application:

// pages/sitemap.xml.tsx
export default class Sitemap extends Component {
  static async getInitialProps({ res }: GetServerSidePropsContext): Promise<void> {
    const pages = await getPages()
    res.writeHead(200, { 'Content-Type': 'text/xml' })
    res.write(createSitemap(pages))
    res.end()
  }
}

Note that we're using 2 functions here:

  1. An async function called getPages that fetches some page data. This will help us to retrieve pages data from our CMS (in this case, Contentful) in an array.

  2. A function called createSitemap. It receives the pages data as a parameter.

Let's dive into the second one first (also the easiest one). First of all, I'm gonna define a type for the Contentful pages (you can skip this if you're using vanilla JS):

export type ContentfulPage = {
  title: string
  slug: string
  header: ContentfulOrHeader
  blocks?: ContentfulBlock[]
  footer: ContentfulOrFooter
  updatedAt?: string
}

This is basically how our page is built in Contentful. It has a title, a slug, a header component, some block components that conform the page itself (like banners, sections, cards, etc), a footer and the updatedAt date for the page.

Now the function itself:

const createSitemap = (pages: ContentfulPage[]) => {
  return `<?xml version="1.0" encoding="UTF-8"?>
  <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    ${generateLinks(pages)}
  </urlset>`
}

You can see that we're using another function called generateLinks that receives our pages data. Let's take a look into it:

const generateLinks = (pages: ContentfulPage[]) => {
  const pageItems = pages.map((page) => {
    const slugPath = page.slug === '/' ? '' : `/${page.slug}`
    const url = `${process.env.ORIGIN_URL}${slugPath}`
    return `
        <url>
          <loc>${url}</loc>
          <changefreq>daily</changefreq>
          <lastmod>${page.updatedAt}</lastmod>
          <priority>0.8</priority>
        </url>
      `
  })
  return pageItems.join('')
}

Here we're using the pages data to build a string containing the required format for each item in our sitemap page, ultimately we return all the items joined in a single string. This is the one that it's finally been inserted in our <urlset> tag.

Now that we have gone through the createSitemap function, let's start with getPages. For the sake of simplicity, I'm using an already defined Contentful client and importing it from the services/contentful file, click here if you want to go deeper with the Contentful JS SDK implementation and how to initialize a client to consume the data from the CMS.

import { client } from 'services/contentful'

export const getPages = async (): Promise<ContentfulPage[]> => {
  const collection = await client.getEntries({
    'content_type': 'page',
  })
  const pages = collection?.items && collection.items?.length ? collection.items : null

  if (pages) return pages.map((page) => ({
    title: page.fields.title,
    slug: page.fields.slug,
    header: page.fields.header,
    blocks: page.fields.blocks,
    footer: page.fields.footer,
    updatedAt: page.sys.updatedAt,
  }))
  return []
}

What we're basically doing is using the client to fetch entries that have the 'page' type using the getEntries client method. If there are any, we store the page items on the pages variable. Then we map each page to the ContentfulPage type we defined previously to use them in the createSitemap function.

Contentful gives us the data in the following format:

{
  "fields": {
     "title": "string",
     "slug": "string",
     "header": { ... },
     ...
   },
  "sys": {
    "updatedAt": "2023-01-04T00:50:34.525Z",
    ...
  }
}

Being fields the entry fields themselves, like the title or slug, and the sys object where some metadata is defined like the updatedAt date.

And that's it! With this, we have fetched pages data from our CMS and created a sitemap.xml page for our headless site. The approach is kind of the same for other Headless CMS, the basic concept is that pages live as components in these CMS and we have to consume the data via API to build the page with our favorite language and technology. In this case, using JS and Next.js.

This was my first post so hope all of you like it (: any feedback will be well received.

I'd also like to know your thoughts, did you think this is the way these pages can be built? Have you used headless CMS before? What would have you done differently? (:

Last but not least, thanks for reading!