HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEB SITE

How to Find All Existing and Archived URLs on a web site

How to Find All Existing and Archived URLs on a web site

Blog Article

There are lots of causes you may perhaps will need to find all the URLs on an internet site, but your exact aim will ascertain what you’re seeking. By way of example, you may want to:

Discover each indexed URL to research challenges like cannibalization or index bloat
Accumulate present and historic URLs Google has witnessed, specifically for site migrations
Discover all 404 URLs to Recuperate from put up-migration glitches
In Each individual scenario, a single tool won’t give you every thing you will need. Regrettably, Google Search Console isn’t exhaustive, and a “web site:instance.com” research is restricted and difficult to extract knowledge from.

During this publish, I’ll walk you through some instruments to create your URL checklist and ahead of deduplicating the data employing a spreadsheet or Jupyter Notebook, determined by your web site’s dimension.

Aged sitemaps and crawl exports
When you’re seeking URLs that disappeared in the Are living web-site not too long ago, there’s a chance another person on your own staff can have saved a sitemap file or perhaps a crawl export before the variations ended up manufactured. In case you haven’t presently, look for these files; they might often present what you will need. But, when you’re studying this, you probably didn't get so Blessed.

Archive.org
Archive.org
Archive.org is a useful Device for Search engine optimization tasks, funded by donations. If you try to find a domain and choose the “URLs” choice, you are able to entry up to 10,000 stated URLs.

Nevertheless, There are many limits:

URL Restrict: You are able to only retrieve approximately web designer kuala lumpur 10,000 URLs, that's insufficient for bigger web sites.
Top quality: Numerous URLs could be malformed or reference useful resource information (e.g., pictures or scripts).
No export solution: There isn’t a constructed-in technique to export the list.
To bypass the lack of an export button, use a browser scraping plugin like Dataminer.io. Even so, these constraints signify Archive.org might not give a complete Answer for bigger websites. Also, Archive.org doesn’t indicate no matter whether Google indexed a URL—but if Archive.org observed it, there’s a great prospect Google did, as well.

Moz Pro
Even though you might normally use a website link index to search out external web sites linking for you, these equipment also uncover URLs on your site in the method.


How you can use it:
Export your inbound backlinks in Moz Professional to obtain a brief and simple listing of target URLs out of your web page. In case you’re managing a huge Web site, consider using the Moz API to export information past what’s workable in Excel or Google Sheets.

It’s crucial to note that Moz Professional doesn’t validate if URLs are indexed or uncovered by Google. Nonetheless, considering that most web sites implement exactly the same robots.txt procedures to Moz’s bots since they do to Google’s, this method usually operates very well to be a proxy for Googlebot’s discoverability.

Google Look for Console
Google Research Console delivers several beneficial resources for creating your listing of URLs.

Links experiences:


Much like Moz Professional, the Back links segment delivers exportable lists of target URLs. Regretably, these exports are capped at one,000 URLs Every single. It is possible to implement filters for specific internet pages, but because filters don’t apply into the export, you may perhaps should depend upon browser scraping tools—limited to five hundred filtered URLs at any given time. Not perfect.

General performance → Search Results:


This export offers you a summary of internet pages acquiring look for impressions. While the export is proscribed, You should use Google Look for Console API for larger datasets. There's also totally free Google Sheets plugins that simplify pulling extra comprehensive data.

Indexing → Webpages report:


This part provides exports filtered by concern form, though these are generally also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is an excellent resource for gathering URLs, having a generous Restrict of 100,000 URLs.


A lot better, you may apply filters to develop diverse URL lists, proficiently surpassing the 100k limit. One example is, in order to export only blog site URLs, abide by these steps:

Step one: Insert a section for the report

Stage two: Click “Make a new segment.”


Move 3: Determine the section using a narrower URL sample, for example URLs made up of /website/


Observe: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they offer worthwhile insights.

Server log documents
Server or CDN log data files are perhaps the ultimate Instrument at your disposal. These logs capture an exhaustive record of every URL path queried by customers, Googlebot, or other bots through the recorded period.

Considerations:

Data sizing: Log information is usually substantial, a lot of web-sites only keep the final two months of knowledge.
Complexity: Examining log data files can be difficult, but different equipment can be found to simplify the process.
Combine, and good luck
Once you’ve collected URLs from all of these resources, it’s time to combine them. If your web site is small enough, use Excel or, for much larger datasets, resources like Google Sheets or Jupyter Notebook. Be certain all URLs are persistently formatted, then deduplicate the listing.

And voilà—you now have an extensive listing of present-day, aged, and archived URLs. Excellent luck!

Report this page