Problems Indexing an External Web Site with the SharePoint Search Engine

(c) Sean Bordner

(c) Sean Bordner

From time to time I have to troubleshoot SharePoint Search Engine issues dealing with it not properly indexing some external web site.  Just thought I’d share the steps I typically take in case you are trying to figure out what might be causing the problem.  I find it helpful to isolate where the problem is before doing anything else.  Here’s how I approach determining where the problem is.

The first thing I do is make sure the SharePoint Search Engine is working, period.  Is it at lease indexing the SharePoint site properly?  If not, it’s pretty obvious the problem is with the search engine configuration, or at least something on the SharePoint side of the street, including the network it lives on.  However, if it’s not having any problems indexing the SharePoint site, I move on and see if it is (or will) index any other external web sites properly. 

Here is where it starts getting good.  If it’s indexing other external web sites just fine, but not the one in question, then I turn my attention to the web site it won’t index.  Many things might be causing it to fail when trying to crawl.  JavaScript and Flash navigation to name two top suspects.  In fact, I find it helpful to see what the search engine spiders are seeing, and this can be done very easily.  I’ve seen the start page of an external site look perfectly ok from my browser, but contain nothing but bad links from the perspective of a search spider.  This can happen with application based redirects which tend to make it difficult for any search engine attempting to crawl.  There are plenty of free tools available to see what a search engine is seeing, one of them is at:  http://www.seochat.com/seo-tools/spider-simulator/

The ‘spider view’ of the page you have configured your SharePoint Search Engine content source to start from is the best place to start.  Paste the URL into the spider-simulator and let it rip.  If it can’t crawl the page, you have found the problem.  If it can crawl the page, carefully check the results.  Examine the internal links as well as the external links.  This is when you will find out if what your browser sees is the same as what the spider sees.  Copy a link out and paste it your browser address bar and see if it loads.  If you get a 404, the problem is with the links on the start page using some type of black-arts to be handled by the application which is preventing search engines from crawling properly. 
By now I have usually pin-pointed the problem and can begin taking steps to resolve it.  Resolution steps include pointing to a different start page from within my content source, or re-writing the start page with clean HTML, or ensuring the security on the external site is not preventing a crawl, etc… You get the point, but the important thing is we have quickly isolated where the problem is before we started fixing it.

Advertisements

About Sean Bordner

CEO, Solution Architect, Co-Author of SharePoint for Nonprofits, Contributing Author NothingButSharePoint.com MCT, MCTS, MCSD, MCP, MCAD
This entry was posted in Search, SharePoint, SharePoint Search Engine. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s