Implementing Sitecore Search in real-world projects often involves many small lessons that you only discover through hands-on experience. After working across multiple crawlers, sources, schemas, and front-end integrations, I’ve compiled some practical learnings that can save you hours of debugging and rework.

Whether you’re just starting with Sitecore Search or fine-tuning an existing implementation, these insights will help you build a cleaner, more reliable, and better-performing search experience.

1. Ensure All Required Fields Are Present Before Indexing Documents

One of the most common issues during indexing is missing required fields, and the most critical one is id.

For API-based crawling, if your structured or semi-structured documents don’t include an id, indexing will silently skip or fail entries. This leads to confusing situations where crawlers run “successfully,” yet no documents appear in the collection.

Tip:
Before ingesting, validate that every document contains:

id (mandatory)
name
url
type
Any custom attributes your filters or sorting depend on

For API-based crawling, if any record in the response does not include an id, generate one dynamically during extraction (for example, using a random or hash-based value) and assign it as the document’s id.

2. If Sitecore Content Is Missing from the Sitemap, Use the Edge API Instead

Many Sitecore implementations depend on sitemaps for crawling, but not everything you want indexed will appear there.

For example, non-page content items such as resource cards are often excluded from public sitemaps.

When the sitemap doesn’t represent the full content tree, it’s better to use the Sitecore Experience Edge Delivery API as your crawl source. It provides structured JSON, allowing you to extract precisely the fields you need for indexing.

This approach is cleaner, faster, and far more reliable than trying to retrofit everything into the sitemap.

3. Use the Cheerio Sandbox to Test Extraction Logic

The Cheerio sandbox is one of the most useful features.

It lets you:

Test your CSS selectors
Validate your extraction rules
Preview exactly what the crawler will extract
Debug failures before running a full crawl

Using the sandbox early prevents:

Broken selectors
Missing metadata
Incorrect field mappings
Duplicate or empty values

If something doesn’t show up in the sandbox, it won’t show up in the search index.

4. Use Search-Specific Metadata Tags for Better Indexing

Adding search metadata helps the crawler understand the role and type of each page.

Example:

<meta property="search:pagetype" value="article" />

They’re especially helpful for large sites with mixed content types such as articles, resources, products, FAQs, or documentation.

Make metadata part of your content authoring guidelines from the start.

5. Vercel-Protected URLs Require Extra Care

If your DEV/TEST/UAT Next.js site on Vercel is protected with mechanisms like password protection or IP restrictions, crawlers will not be able to access it.

Key point:
For protected Next.js apps, you may need to enable the Vercel automation bypass using the VERCEL_AUTOMATION_BYPASS_SECRET to allow crawlers through.

Make sure to plan for this early—especially in staging environments—so indexing doesn’t break unexpectedly.

6. Duplicate Sources Are Useful—But Use Them Carefully

The Duplicate Source option is convenient when you need multiple sources with similar configurations.

However, be aware that:

Copied-over settings may cause unexpected behavior

Double-check all the settings for the newly duplicated source.

7. Sources Cannot Be Deleted—Only Archived

Sitecore Search does not allow permanent deletion of sources. You can only archive them.

Over time, this can clutter the UI, especially when experimenting during early project phases.
This makes naming conventions even more important (see next point).

8. Establish a Clear Naming Convention for All Sources

With multiple teams and environments, clean names save a lot of time.

Recommended naming pattern:

{Project}-{Environment}-{Type}-{Description}

Examples:

MarketingPortal-Prod-HTML-Articles
Corporate-Staging-API-Resources
DeveloperPortal-Dev-HTML-Docs

Clear naming conventions help significantly with:

Debugging
Onboarding new developers
Avoiding accidental edits
Managing many sources in large multi-tenant setups

9. Clean Up Sorting Options When Related Entities Are Removed

If you delete or rename fields in your schema or remove entities entirely, make sure to update your sorting options too.

Otherwise, you’ll run into:

Broken sort dropdowns
API errors
Failed query responses
UI crashes in SDK-based search components

Treat sorting options as part of your schema maintenance checklist.

10. API Crawler does NOT support OAuth

API crawlers currently do not support OAuth. As a result, your indexing strategy must be built around publicly accessible or non-protected endpoints.

Conclusion

Implementing Sitecore Search can be incredibly powerful, but small details make a huge difference in the overall success of indexing, relevance, and performance. These real-world lessons—from crawler configuration to metadata to naming conventions—can help you avoid common pitfalls and create a smoother search experience for both developers and users.

Gopi's Blog

6x Sitecore Technology MVP 2018-2023

Sitecore Search—Implementation Learnings (Real-World Tips & Pitfalls)

1. Ensure All Required Fields Are Present Before Indexing Documents

2. If Sitecore Content Is Missing from the Sitemap, Use the Edge API Instead

3. Use the Cheerio Sandbox to Test Extraction Logic

4. Use Search-Specific Metadata Tags for Better Indexing

5. Vercel-Protected URLs Require Extra Care

6. Duplicate Sources Are Useful—But Use Them Carefully

7. Sources Cannot Be Deleted—Only Archived

8. Establish a Clear Naming Convention for All Sources

9. Clean Up Sorting Options When Related Entities Are Removed

10. API Crawler does NOT support OAuth

Conclusion

Leave a comment Cancel reply

1. Ensure All Required Fields Are Present Before Indexing Documents

2. If Sitecore Content Is Missing from the Sitemap, Use the Edge API Instead

3. Use the Cheerio Sandbox to Test Extraction Logic

4. Use Search-Specific Metadata Tags for Better Indexing

5. Vercel-Protected URLs Require Extra Care

6. Duplicate Sources Are Useful—But Use Them Carefully

7. Sources Cannot Be Deleted—Only Archived

8. Establish a Clear Naming Convention for All Sources

9. Clean Up Sorting Options When Related Entities Are Removed

10. API Crawler does NOT support OAuth

Conclusion

Share this:

Related

Leave a comment Cancel reply