Don't index page 1, page 2, ..., page n #970

ploeh · 2024-12-01T13:55:22Z

As a standard blog site, ploeh blog has a set of pages that a user may navigate using next and previous buttons. I don't think that the average user would use that feature much, but according to Google Analytics, Pages is ranked number 31 on the site.

The top page on the site (ranked 1) is, perhaps not surprising, the 'home page' at https://blog.ploeh.dk.

All that said, I sometimes need to find stuff on the site, and while I often search the source code (i.e. the HTML files), I also occasionally use a site-specific web search, and I've noticed that web search results often list, say page 18 or page 58, simply because the crawler found a particular keyword on that page at that time.

These pages are 'aggregation pages', and articles move around on these pages as they get pushed further into the past. Therefore these search results aren't useful.

What's the best way to tell search engines to not index these pages? robots.txt?

I'm not up to date with modern SEO techniques, so would appreciate input if robots.txt isn't the best option.

The text was updated successfully, but these errors were encountered:

arthurgubaidullin · 2024-12-05T11:06:31Z

Hi @ploeh,

I don't think it's a good idea to disable indexing in your case, as there's no sitemap available for the search engine. Instead, I suggest adding canonical URLs to the meta information of the relevant pages. That should ideally be enough to address the issue.

Here's a helpful link.

Best of luck with your article writing!

michalfi · 2024-12-05T13:00:44Z

Maybe a wild idea, but a different angle of approach: As your blog has pretty sustained rate of new content, why don't you change paging to be time-invariant?

E.g. having a page per month. Your homepage could always show posts from the current month plus the last one (to ensure there's always at least a month of content), then a link to "October 2024", etc.

Or, maybe even simpler, just count the pages from the end?

ploeh · 2024-12-05T17:45:09Z

I don't think it's a good idea to disable indexing in your case, as there's no sitemap available for the search engine.

Would adding one help? After all, the archive contains almost everything of interest on the site, apart from the About page and perhaps a few other pages.

I don't think it'd be hard to get Jekyll to generate a similar sitemap file. Be that as it may, Google's documentation seems to indicate that it's not really required:

"If your site's pages are properly linked, Google can usually discover most of your site. Proper linking means that all pages that you deem important can be reached through some form of navigation, be that your site's menu or links that you placed on pages."

It goes on to talk a bit more about large sites, where it can be difficult to ensure that all pages are being linked to, but that's hardly an issue here, as the archive links to everything of interest, again apart from a few special pages. Those, however, are linked from the 'top menu' on each page.

And to be clear, the Archive page is automatically generated by Jekyll.

Instead, I suggest adding canonical URLs to the meta information of the relevant pages. That should ideally be enough to address the issue.

I apologize for being dense, but even after perusing the link you provided, I don't understand how that helps. It's not that I have alternative URLs pointing to the same page... Could I perhaps ask you to elaborate a bit?

ploeh · 2024-12-05T17:48:38Z

Maybe a wild idea, but a different angle of approach: As your blog has pretty sustained rate of new content, why don't you change paging to be time-invariant?

E.g. having a page per month. Your homepage could always show posts from the current month plus the last one (to ensure there's always at least a month of content), then a link to "October 2024", etc.

Or, maybe even simpler, just count the pages from the end?

Both of these would address the issue, I suppose. Still, Page 18 or Page 44 aren't really useful pages as far as I can tell, even if they were stable.

The more I think about it, the more I'm considering entirely getting rid of all of those extra pages...

arthurgubaidullin · 2024-12-05T20:14:07Z

@ploeh I was wrong and the site has a sitemap. I'm sorry about that.

I'm afraid that if you start disallowing indexing, it might break something, I've heard of things like that.

I think the least invasive option is to add canonical links. If that doesn't work, then you can think further.

I apologize for being dense, but even after perusing the link you provided, I don't understand how that helps. It's not that I have alternative URLs pointing to the same page... Could I perhaps ask you to elaborate a bit?

The idea of canonical links is that the search engine knows where the original source is. It will then display links to the original in the results.

If there are no canonical links, Google chooses the links itself. In your case, it is wrong.

Of course, search engines can be wrong.

ploeh added enhancement help wanted labels Dec 1, 2024

arthurgubaidullin mentioned this issue Dec 5, 2024

Adds canonical URLs to all post pages #971

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't index page 1, page 2, ..., page n #970

Don't index page 1, page 2, ..., page n #970

ploeh commented Dec 1, 2024

arthurgubaidullin commented Dec 5, 2024

michalfi commented Dec 5, 2024

ploeh commented Dec 5, 2024

ploeh commented Dec 5, 2024

arthurgubaidullin commented Dec 5, 2024

Don't index page 1, page 2, ..., page n #970

Don't index page 1, page 2, ..., page n #970

Comments

ploeh commented Dec 1, 2024

arthurgubaidullin commented Dec 5, 2024

michalfi commented Dec 5, 2024

ploeh commented Dec 5, 2024

ploeh commented Dec 5, 2024

arthurgubaidullin commented Dec 5, 2024