Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't index page 1, page 2, ..., page n #970

Open
ploeh opened this issue Dec 1, 2024 · 5 comments
Open

Don't index page 1, page 2, ..., page n #970

ploeh opened this issue Dec 1, 2024 · 5 comments

Comments

@ploeh
Copy link
Owner

ploeh commented Dec 1, 2024

As a standard blog site, ploeh blog has a set of pages that a user may navigate using next and previous buttons. I don't think that the average user would use that feature much, but according to Google Analytics, Pages is ranked number 31 on the site.

The top page on the site (ranked 1) is, perhaps not surprising, the 'home page' at https://blog.ploeh.dk.

All that said, I sometimes need to find stuff on the site, and while I often search the source code (i.e. the HTML files), I also occasionally use a site-specific web search, and I've noticed that web search results often list, say page 18 or page 58, simply because the crawler found a particular keyword on that page at that time.

These pages are 'aggregation pages', and articles move around on these pages as they get pushed further into the past. Therefore these search results aren't useful.

What's the best way to tell search engines to not index these pages? robots.txt?

I'm not up to date with modern SEO techniques, so would appreciate input if robots.txt isn't the best option.

@arthurgubaidullin
Copy link
Contributor

Hi @ploeh,

I don't think it's a good idea to disable indexing in your case, as there's no sitemap available for the search engine. Instead, I suggest adding canonical URLs to the meta information of the relevant pages. That should ideally be enough to address the issue.

Here's a helpful link.

Best of luck with your article writing!

@michalfi
Copy link

michalfi commented Dec 5, 2024

Maybe a wild idea, but a different angle of approach: As your blog has pretty sustained rate of new content, why don't you change paging to be time-invariant?

E.g. having a page per month. Your homepage could always show posts from the current month plus the last one (to ensure there's always at least a month of content), then a link to "October 2024", etc.

Or, maybe even simpler, just count the pages from the end?

@ploeh
Copy link
Owner Author

ploeh commented Dec 5, 2024

I don't think it's a good idea to disable indexing in your case, as there's no sitemap available for the search engine.

Would adding one help? After all, the archive contains almost everything of interest on the site, apart from the About page and perhaps a few other pages.

I don't think it'd be hard to get Jekyll to generate a similar sitemap file. Be that as it may, Google's documentation seems to indicate that it's not really required:

"If your site's pages are properly linked, Google can usually discover most of your site. Proper linking means that all pages that you deem important can be reached through some form of navigation, be that your site's menu or links that you placed on pages."

It goes on to talk a bit more about large sites, where it can be difficult to ensure that all pages are being linked to, but that's hardly an issue here, as the archive links to everything of interest, again apart from a few special pages. Those, however, are linked from the 'top menu' on each page.

And to be clear, the Archive page is automatically generated by Jekyll.

Instead, I suggest adding canonical URLs to the meta information of the relevant pages. That should ideally be enough to address the issue.

I apologize for being dense, but even after perusing the link you provided, I don't understand how that helps. It's not that I have alternative URLs pointing to the same page... Could I perhaps ask you to elaborate a bit?

@ploeh
Copy link
Owner Author

ploeh commented Dec 5, 2024

Maybe a wild idea, but a different angle of approach: As your blog has pretty sustained rate of new content, why don't you change paging to be time-invariant?

E.g. having a page per month. Your homepage could always show posts from the current month plus the last one (to ensure there's always at least a month of content), then a link to "October 2024", etc.

Or, maybe even simpler, just count the pages from the end?

Both of these would address the issue, I suppose. Still, Page 18 or Page 44 aren't really useful pages as far as I can tell, even if they were stable.

The more I think about it, the more I'm considering entirely getting rid of all of those extra pages...

@arthurgubaidullin
Copy link
Contributor

@ploeh I was wrong and the site has a sitemap. I'm sorry about that.

I'm afraid that if you start disallowing indexing, it might break something, I've heard of things like that.

I think the least invasive option is to add canonical links. If that doesn't work, then you can think further.

I apologize for being dense, but even after perusing the link you provided, I don't understand how that helps. It's not that I have alternative URLs pointing to the same page... Could I perhaps ask you to elaborate a bit?

The idea of canonical links is that the search engine knows where the original source is. It will then display links to the original in the results.

If there are no canonical links, Google chooses the links itself. In your case, it is wrong.

Of course, search engines can be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants