Knowing the Sitecore content: Opening up Solr's maximum returned results

18. April 2016 06:10 by Mark Servais in Sitecore  //  Tags:   //   Comments (0)

So I got a gentle reminder from a Sitecore project recently that Solr is going to only return 500 maximum results by default per the ContentSearch.SearchMaxResults setting.

As you can guess, I needed more than 500 results to be retrieved from the index.

So it was as easy as modifying the number and I was in a happier place.

But being me, always looking at the periphery, what if I wanted all results for everything. In older versions of Sitecore you were able to place a '0' for a value and receive the whole collection from the index for your search.

Well that was in older versions of Sitecore, so that didn't go. It actually gives me 0 results. Really useful by the way having a search forced to return zero results.

Regardless through a brief Slack chat conversation Mark Cassidy that the new unlimited results value is actually leaving it as an empty string (""). That did seem to bring everything back so great - curiosity extinguished - right?

Yes, but that flame was sparked again with a comment from Richard Seal. "... just a caveat when doing that. That will send int.MaxValue to Solr as the max rows." For those that don't have that memorized - that value would be 2,147,483,647 for an int32 and for an int64 it is 9,223,372,036,854,775,807. Yes I had to look those up a well.

In our case -  int.MaxValue is equivilent to an int32 value. The int64 number was impressive to me so I through it in to both educate and confuse.

Disclaimer: I have not used a decompilation tool to determine the exact interfaces from Sitecore to Solr for this blog, so I do not know how Sitecore addresses things like modification of FETCH-SIZE and etc. with Solr. Those tasks I do when I need to know and for this - I'm on a don't need to know basis.

So I return up to 2 billion records back. That can be a bit taxing.

In my original scenario of increasing the number of SearchMaxResults, I increased it to 15000. Why? Because the feed I was generating from the index had 3,987 (or something like that) pieces of content. Well over the 500.

But then why 15000? Simply for growth of content. New pages, articles, and events will both accumulate and drop out of content. It will take some time for them to hit that ceiling, and when they do - we can change the number.

Going after 2 billion results would still have only surfaced the 3987 pieces of content but it seems a little overhead might be coming for those 3987 records doing it this way.

Being Solr and Lucene are roomates of sort (Solr being built on top of Lucene), it is a bit interesting. As the amount of data increases in the index, I can assume that the algorithm will need an increasing amount of resources to parse records. This is a pretty safe assumption I think. If the algorithm to return resulting sets uses the same algorithm for numeric transformations that the steps should increase as well.

ref Schindler, U, Diepenbroek, M, 2008. Generic XML-based Framework for Metadata Portals. Computers & Geosciences 34 (12), 1947-1955. doi:10.1016/j.cageo.2008.02.023

Quote: Because Apache Lucene is a full-text search engine and not a conventional database, it cannot handle numerical ranges (e.g., field value is inside user defined bounds, even dates are numerical values). We have developed an extension to Apache Lucene that stores the numerical values in a special string-encoded format with variable precision (all numerical values like doubles, longs, floats, and ints are converted to lexicographic sortable string representations and stored with different precisions (for a more detailed description of how the values are stored, see NumericUtils). A range is then divided recursively into multiple intervals for searching: The center of the range is searched only with the lowest possible precision in the trie, while the boundaries are matched more exactly. This reduces the number of terms dramatically.

For the variant that stores long values in 8 different precisions (each reduced by 8 bits) that uses a lowest precision of 1 byte, the index contains only a maximum of 256 distinct values in the lowest precision. Overall, a range could consist of a theoretical maximum of 7*255*2 + 255 = 3825 distinct terms (when there is a term for every distinct value of an 8-byte-number in the index and the range covers almost all of them; a maximum of 255 distinct values is used because it would always be possible to reduce the full 256 values to one term with degraded precision). In practice, we have seen up to 300 terms in most cases (index with 500,000 metadata records and a uniform value distribution).

Lots of words isn't it? I think it is just easier (at least for my math skills) to generate calculations around content and expected returned results. Breaking up indexes potentially by ontologies and related categorizations of relationships will also assist with limited the scope set being returned.

Best said from an old man I once had the pleasure of having conversation with - "Take only what you really need".


Calendar

<<  June 2019  >>
MonTueWedThuFriSatSun
272829303112
3456789
10111213141516
17181920212223
24252627282930
1234567

View posts in large calendar

Month List