vBulletin ModsThe Official vBulletin Modifications Site
https://www.vbulletin.org/forum/showthread.php?t=127868 |
2 Attachment(s)
Sphinx Implementation for vBulletin:
Version 0.1 Hooray! Just sharing as usual, let the discussions begin (in b4 TECK "MINE IS BETTER") Only tested with Sphinx-0.9.8-rc2 (r1234; Mar 29, 2008). If you are upgrading from my old tutorial, backup your search.php (you know, just in case you need the old hacked up version again) and restore the original from the zip/tar, no more file modifications! http://sphinxsearch.com/downloads.html Tested on 3.6.10, should work on 3.7 if you modify /*insert query*/ on Line 522 (I removed 'prefixchoice' field because it doesn't exist in 3.6) No support for tags/thread prefix yet, because I don't have access to a 3.7 installation at the moment Similar threads is also being worked on Alpha release for some feedback, hopefully it will be production ready soon :p I assume you already have Sphinx up and running... see attached sphinx.conf.example for a minimalistic setup Installation notes inside search_sphinx.php Well yeah enjoy. And PM me if you need help The old post is here: http://www.vbulletin.org/forum/showp...&postcount=387 The Good:
The Bad:
The Ugly:
*The Infamous Post Sorting Quirk What happens here is that when you "Search Entire Posts" and "Show Results as Threads", do you want you threads sorted by:
Our Sphinx setup does not have first post and last post dateline stored in its post index (and it would be pretty much useless too) so the first two options are not available. vBulletin offers a function called "sort_search_items()" (search.php:633 3.7) which could, in theory, be used to sort the threads by last post dateline. It does not fix the problem though. Let's assume we set maxresults to 5. We are searching for threads for "funny". We have 7 threads created today: 1. Thread "Cows", Created 08:00, Last Post 17:00 | "Funny Cows", Created 09:00 2. Thread "Cats", Created 09:00, Last Post 14:00 | "Funny Cats", Created 14:00 3. Thread "Dogs", Created 10:00, Last Post 12:00 | "Funny Dogs", Created 11:00 4. Thread "Mice", Created 11:00, Last Post 15:00 | "Funny Mice", Created 13:00 5. Thread "Rats", Created 12:00, Last Post 13:00 | "Funny Rats", Created 12:00 6. Thread "Eels", Created 13:00, Last Post 19:00 | "Funny Eels", Created 18:00 7. Thread "Fish", Created 14:00, Last Post 18:00 | "Funny Fish", Created 17:00 Do we want to show threads 6, 7, 2, 4, 5 (Sphinx)? Or do we want to show threads 6, 7, 1, 4, 2 (vB)? vBulletin finds all 7 posts, orders them by last post descending, and grabs the top 5. Sphinx will find the newest 5 matching posts and then returns you the associated threads. Reordering search results with "sort_search_items()" does not fix the problem because there might be older threads with very recent replies that Sphinx won't even consider. Let's consider an 8th thread: 8. Thread "Bees", Created 2002, Last Post 20:00 | "Funny Bees", Created 2002 vBulletin will list this one on top, Sphinx will not consider it. So even re-sorting the search items will not make this thread appear. |
Nice find! I'll play around with it once I get some time.
|
Obviously the only options you will have on the advanced search page are:
Key Words: Search In: Thread Titles/Posts Sort Results by: Relevancy, Date Asc, Date Desc Search in Forums: And I guess searching by username will still be the built in way. (As in, without a search term, just list his posts.) Gonna try to hack that up, when I make it work I'll release it I hope :) But the fact you can index 4k posts/second is absolutely insane, and that was with 800 users online... :D |
Hmm, yes, that looks interesting, bookmarked for later. :)
|
Also means I can remove that 400mb fulltext index from post table making MySQL even faster.
The right tool for the job. :) Filtering by forumid already works, so does sorting by date. And it still says 0.000003 seconds. Incredible. |
Hmm good timing. I got on here today to see if there were any other resources out there for searching and vbulletin and this showed up in the results.
We've had soooo much trouble keeping our search up. We're using the fulltext search right now with the search on its own server on tables reduced in size. Huge pain and it still doesn't return some results. Keep us updated please, this looks cool. |
Awsome!
If I get some time tonight (probably not!) I will download Sphinx and give it a look. What kind of data do you have to test this with? We're looking at about 9 million records on our live post table (millions more archived). I'm very curious how well this would hold up to that amount of data. |
Can I get a peek at your sphinx.conf?
|
wow, you are fast! thanks. I'm tossing it 24 million posts to see what it does :)
|
*waits for post index to build*
So far so good. It ripped through 1,652,726 thread titles in about 2 minutes, on a machine replicating a very active forum, and one running a test upgrade from 3.5.5 to 3.6.1 :) So far, I'm happy! I think with a little work this could be amazing. The api is a little unfriendly when it comes to errors and what not, but with some polishing and figuring out the targeting of searches and by name, and we're good to go. Orban you are a hero among men! Just FYI: thread table: collected 1658976 docs, 48.1 MB sorted 5.1 Mhits, 100.0% done total 1658976 docs, 48070959 bytes total 148.426 sec, 323872.56 bytes/sec, 11177.16 docs/sec post table: collected 8860446 docs, 1416.9 MB sorted 140.2 Mhits, 100.0% done total 8860446 docs, 1416892676 bytes total 3168.862 sec, 447129.84 bytes/sec, 2796.10 docs/sec that is word length of 4 and no stopwords. |
Quote:
I'm headed out fishing, but I'm going to play with your updated changes later :) |
Weird....
|
Yeah, and I recreated it a few times (with stopwords, diff min word length, etc). Not exactly sure why yet.
|
Maybe some posts are too short? Like no words longer than 4 characters?
But then again that'd never be 2/3th of the posts. I really have no idea :( |
Sphinx 0.9.7 will feature an arbitrary number of group id's, so it would be possible to handle "search this thread" and search by user in Sphinx.
Meanwhile, it's easy to hack Sphinx to support 3 groupid columns instead of one by some copy-pasting. Naturally, the index size is larger with additional group id's, 5GB for 6mln post database. We've been running it for some months already with great success. |
Mind sharing the patch and maybe your implementation in vB? Or at least outlining it?
Would be nice...! |
orban, what kind of changes do i need to make to my search.php to have it search both the main and delta index?
Prior to setting up the delta index on my end, I noticed that I could search for words in post bodies and not return results, but if I look in my query.log I would see many many results. |
mmmmm
you don't have to modify search.php, create a fake index that contains the two other indices. |
hm ok, i think my config just might be a bit goofy. On my dev board, I created a new post after creating all 4 indexes. Anyway, my test post had a made up word in it, and after I posted I reran the delta updates, saw them pick up one doc, but I don't get any results returned if I use the "search" tool with sphinx.
I'm going to double check my config now. |
weird....
make sure the indexes get created (check the data files) |
Quote:
|
sphinx
|
Hm. this is very strange. I have verified that my config is the same as yours (minus the names of the indexes), and have emptied my sphinx_counter table, nuked all my indexes, and rebuilt.
[[email protected] var]# /home/httpd/sphinx/bin/search -c /home/httpd/sphinx/etc/sphinx.conf purple Sphinx 0.9.6 Copyright (c) 2001-2006, Andrew Aksyonoff - loaded 591 stopwords from '/home/httpd/sphinx/etc/sphinx.stopwords' index 'vbpostidx': query 'purple ': returned 0 matches of 0 total in 0.000 sec - loaded 591 stopwords from '/home/httpd/sphinx/etc/sphinx.stopwords' index 'vbpostdeltaidx': query 'purple ': returned 0 matches of 0 total in 0.000 sec - loaded 591 stopwords from '/home/httpd/sphinx/etc/sphinx.stopwords' index 'vbthreadidx': query 'purple ': returned 0 matches of 0 total in 0.000 sec - loaded 591 stopwords from '/home/httpd/sphinx/etc/sphinx.stopwords' index 'vbthreaddeltaidx': query 'purple ': returned 0 matches of 0 total in 0.000 sec I broke something, but I don't know what :) Ah, I found the problem I think. For whatever reason, on my initial index, despite having used --rotate, it is leaving *new* index files in my var dir: [[email protected] var]# ls -la *new* -rw-r--r-- 1 root root 1356935444 Oct 2 13:39 vbpost.new.spd -rw-r--r-- 1 root root 10644727 Oct 2 13:39 vbpost.new.spi -rw-r--r-- 1 root root 54322284 Oct 2 13:42 vbthread.new.spd -rw-r--r-- 1 root root 879893 Oct 2 13:42 vbthread.new.spi Sphinx won't search against these, but I'm not sure why they didn't roll over. |
Yeah I don't have .new. ones, just .old. ones.
Permissions? |
It is what I believe a bug in sphinx. If you start searchd w/ no indexes preexisting, then index with --rotate, it won't rotate. The solution is to stop searchd, nuke everything, index, then start searchd.
It took me a while to figure it out, I'm not sure why it isn't smart enough to see that there aren't preexisting indexes when searchd tries to rotate. |
Oh :(
Yeah I did the first index without searchd started. Report it so it can be fixed :) |
I was just curious is your search_sphinx.php posted a few threads back the most current one? or have you made more adjustments?
|
Just an FYI, make sure you limit access to that search on your dev boxes if you don't potentially want people searching for info in your private forums :)
I guess now we just get to wait patiently for 0.9.7 to come out... |
What you mean? Other users with ssh access?
Yeah or kmike can share his patch http://www.vbulletin.org/forum/showp...2&postcount=21 :( |
Quote:
|
search can be called by anyone with server access on the command line
so he gets access to all your indexes and thus to all your posts so if you have a designer ssh access to upload stuff he can basically read your private forums |
Quote:
|
Neither here, I'm the only with access.
|
Quote:
|
They aren't, but all posts/threads are filtered again on the results page.
|
1 Attachment(s)
Attached is the patch for Sphinx 0.9.5 which adds two more group columns.
You'll have to have something like this in your sphinx.conf:
The part with IF(post.userid=0) is needed because Sphinx doesn't like zero column values (you'll have them if a board has some posts by the guests or deleted users), so we replace them with an arbitrary high number (99999999) which is guaranteed not to happen in the real data. sphinxapi.php supports two more grouping functions: SetGroup2(array) and SetGroup3(array). So search.php will have to call $sphinx->SetGroups2($userids) when searching by user(s), where $userids is an array containing their userid's. And similarly, $sphinx->SetGroups3(array($searchthreadid)) will be called when searching in a thread. |
Thank you. Gonna try this out :)
|
Quote:
|
Thanks Orban (and others) for this solution.
0.9.6 is out, it fixes the following issues: - added support for empty indexes (solves the previous issues we had with indexes) - added support for multiple sql_query_pre/post/post_index - fixed timestamp ranges filter in "match any" mode - fixed configure issues with --without-mysql and --with-pgsql options - fixed building on Solaris 9 |
Yes, but the patch for more than one group won't work for this...
I'm trying to get a snapshot of 0.9.7.... |
Unfortunately, 0.9.7-dev is still too buggy to be used in production.
|
Quote:
|
Quote:
*edit* Ah, found it. You're mistaken - 4096 is the limit on a number of groupid's listed in one request. A groupid is an unsigned 32bit integer AFAIK, so the limit of 4GB should be enough for everybody (the famous last words) Quote:
|
Go ahead and post it. :)
Thanks Orban. |
Indeed, conf, patch and search would be fantastic :)
|
Cool, I'll give this a go this morning and see what happens..
Edit: http://dragy.de/public/sphinx.api.diff the file is giving a 404 back :( |
I'm getting a 404 on http://dragy.de/public/sphinx.api.diff, and am having some issues getting the src patch to apply, has anyone else managed to get it to apply?
orban, is there a reason you've removed the "Sort results by", "Find threads with", and "Find posts from" options from your search_forums template? They are still "doable" with multiple groups in sphinx, right? Ideally I'm looking to replicate the existing vb search, minus the "find as posts and threads" option because I just think that is confusing. |
I'm dumb, I didn't realize you didn't "make clean" prior to creating your diff, and didn't notice it was breaking on the lack of a Makefile as I was building off of a pristine src dir.
|
Yeah the patch is fine, if you've run configure before. I hadn't as I was using a fresh tarball so it won't apply cleanly. If you were to "make distclean" prior to generating your diff, it would apply cleanly for someone who had just untar'd the 0.9.6 source :)
I am rebuilding my indexes now, this is exciting! I think with date ranges this would probably be good enough to go live with! |
Fixed the diffs now, yeah the configure was the problem. Sorry about this.
date ranges: I added them...(changed template search_forums, search.php and includes/sphinx.php, it's all edited in my howto post already)...I didn't realise this was built in because it's not used in api/test.php or "search". (It is though in sphinxapi.php). Now I got a few users wanting the "Show as threads" "Show as posts" back, what did vB think when they added that >.< I mean what does the search show when you are searching for posts and select "display as posts"? The first post in the thread? And when searching in thread titles and choose "display as threads"? All threads the posts that are found are in? The latter is impossible to run on large forums becuase let's say you get 150.000 posts back, then you'd have to sort 150.000 threadids...I think those were those queries I had in my slow log with hundreds of thousands threadids in them...that were killing the server....smart vB. |
Quote:
|
Could anybody create this as a standard plugin?
And... are others seeing the same extraordinary benefits? |
It is not possible to make this a plugin unless they add a ton of hooks to search.php.
Not to speak of general *n*x knowledge you need to install this anyway. Owwwww I forgot a step Copy the sphinxapi.php to..hmm..some folder. :) |
gorman: there is simply no comparison at all between MySQL embedded fulltext search and Sphinx-based search, both in terms of speed and relevance.
BTW, that's what I meant when I was replying to you at vb.com forums, about custom search solution. |
Quote:
|
Quote:
|
Quote:
|
Quote:
|
Quote:
|
Quote:
|
Yeah I have yet to find a better forum solution. And at this point with Threads: 269,003, Posts: 1,588,154, Members: 175,576 I am not going to try and move it. We average like 4,000 users online at once during peaks and alot of that is search traffic.
So cheers orban for finding and sharing this search solution. I was about to implement a google search wrap in the forum template. (ghetto style) |
I wish google offered a service to crawl your website in a closed environment for $ and then a search form. Like that google search appliance (?) but as an online service.
Oh well :) |
Shrug yep.. Gee I knew people would complain about three letter searches... :)
|
Haha ;)
|
Any thoughts on adding the option to display results as threads/posts?
|
Well let's have a look:
Search posts and display as threads: Let's say somebody searches for "book" and returns 150.000 posts. Those 150.000 posts are in 40.000 threads. If you find any way to fetch all 150.000 threadids, sort them and make a unique list of them, then let me know, but I really have no idea how to do that. I also think that this is a major problem of the vB search...(there are queries with several tens of thousands threadids in them). Search threads and display as posts I assume that "posts" mean "first posts in a thread"? You can probably add "firstpostid" as a new group for the thread index and then grab those... Curse vB for adding those options :( |
Quote:
|
I don't understand anyway what exactly the problem is....
If you are searching in thread titles, then the search returns a list of threads. If you are searching in posts, then the search returns a list of posts. |
Quote:
|
Quote:
|
Hmm got an interesting issue going on. When doing a search from forumdisplay i get this..
Warning: assert(): Assertion failed in /includes/sphinxapi.php on line 249 Query failed: searchd error: invalid or truncated request. |
Looking at the html source of the forumdisplay it looks like its getting set..
<input type="hidden" name="forumchoice[]" value="73" /> |
Yeah.. exactly what I have.
|
Add a "if ($vbulletin->userinfo['userid'] == 1) echo $forumchoice;" somewhere....to check if the value gets set...dunno... :(
|
Yeah its getting set.. hrm... What version of php do you use?
137 Warning: assert(): Assertion failed in /includes/sphinxapi.php on line 249 Query failed: searchd error: invalid or truncated request. |
Ok going to try this out now..
Ok that seemed to clean up the assertion issue. Last issue it seems is.. Query failed: searchd error: invalid group5 count 272485 (should be in 0..4096 range). Hmm ok this seems to be related to how the groupid is counted in the searchd.cpp hrm.. |
Quote:
Do only searches that HIT searchd get logged in query.log? My searches from the command line are working fine, but I can't seem to get them to hit searchd anymore via my test site. Ok, i added that last bit of code but it doesn't seem to be fixed for me. Here's the output of a search targeted to a specific forum: SphinxClient Object ( [_host] => db2 [_port] => 3312 [_offset] => 0 [_limit] => 250 [_mode] => 0 [_weights] => Array ( [0] => 100 [1] => 1 ) [_groups] => Array ( [0] => 394 ) [_groups2] => Array ( ) [_groups3] => Array ( ) [_groups4] => Array ( ) [_groups5] => Array ( ) [_sort] => 1 [_min_id] => 0 [_max_id] => 4294967295 [_min_ts] => 0 [_max_ts] => 4294967295 [_min_gid] => 0 [_max_gid] => 4294967295 [_error] => searchd error: invalid or truncated request [_warning] => ) Query failed: searchd error: invalid or truncated request. |
Quote:
|
Ok here's what is and isn't working for me:
1) Searching all open forums for keywords - Works 2) Searching in a specific forum by keyword - Does not work 3) Searching all open forums by username - Works 4) Searching in a specific forum by username - Works |
Quote:
|
I'm using 5.1.5 at the moment. If I printout $forumchoice in sphinx.php, it is getting set, as well as making its way into the Sphinx request array.. I'm a bit puzzled.
|
Hmm nothing yet from me.. You making any progress?
|
nada
|
The one that puzzles me is..
Query failed: searchd error: invalid group5 count 271308 (should be in 0..4096 range). |
No errors on the command line..
|
Try to use the sphinxapi.php from my tar.gz.....?
Download it again I think I found the error...I fixed something in my sphinxapi.php and didn't copy it back to /api/ I'm so sorry :( |
sphinxapi from the gz worked perfect... Orban you rock! Can i send you a donation for this effort? :)
|
<- has had a few beers already.. Yeah i'll send them a donation for sure! Will you be updating this when 0.9.7 is released?
|
Sure thing.
I'll also try to to make "show as posts" "show as threads" happen, but just right now I don't see how it's possible. But you never know what I come up with ;) |
Right on.. :)
|
Yay! That fixed it for me too!
So, I'm thinking if you do plan on cleaning things up and releasing it at some point as a hack, that it would be best to gather up the "settings" into one file or at the top of the sphinx include. For example I have a multi server setup, so I specify the searchd server's ip rather than localhost, and I've renamed my indexes. To the average joe they might not notice or know how to make those changes to get things working. I'm going to do some more testing later on but things are looking very good :) |
Yeah...to be honest I intend to do that when 0.9.7 comes out where more than one group is supported natively and things should be a lot cleaner (and prolly faster too).
I also hope I can figure out the show as posts and show as thread until then (tho I believe best would be to use subscriptions for that [a member told me he was searching for his posts + show as threads to track threads he posted in: SUBSCRIPTIONS]). |
Quote:
Just collect all threadid's in an array, throw out the duplicates using array_unique, and voila, you have your results as threads. Quote:
|
Quote:
Quote:
I think it returns the first post of all threads found.... |
Orban just curious, how often do you re-index the big index?
|
Right on, I'll give it a go. I do have a weird one though. If I do a search for just a username and leave everything else default it will pull only older threads nothing new. hrm...
|
If you don't enter any search terms the default vB search should be used....
|
Quote:
So I guess you have more reasonable limit to the number of returned search results (around 1000?). At which point going through all of them suddenly doesn't look so bad. Quote:
Now, the original vB search implementation (non-fulltext one) is following this logic. But vB fulltext implementation is throwing this concept away and searches within the titles of the threads, displaying only first posts in the threads found. I'll let you judge if this is correct or not. Personally, I, too, think it's too confusing, but it's the legacy of the decision to allow each post to have its own title. Most of the members don't bother to type anything in a post title field when replying, and even if they do, it's completely inconspicuous in the default vB layout (and in the most vB layouts I've seen, for that matter). But it's there, and it's there for good, so we should bear with it. *edit*: cool, 100 posts! I'll let it sit there for some time ;-) |
Quote:
Let's assume you have thread1 - 100 times "word" thread2 - 50 times "word" thread3 - 10 times "word" thread4-50 5 times "word" A search for "word" will return us 2500 posts. BUT there are only 50 different threads. If your limit is 1000 (like mine) this will only return like 30 threads. So you're missing out 20......I'm actually seeing this on very common words (when searching post and "show as threads"). ---------- 1. Search Titles Only - Show as Threads = full text index on thread titles 2. Search Titles Only - Show as Posts = full text index on post titles 3. Search Entire Posts - Show as Threads = full text index on posts but grab threadids and display them, basically grouped by thread 4. Search Entire Posts - Show as Posts = full text index on posts 1., 3., 4. is working already. 2. is not (yet). I'll need to fix this then. (At the moment it's searching thread titles only and displaying the first post). Also it's not weighting post titles/bodies yet (I think). |
Guys, when you compiled Sphinx, did you specified the mysql directory or you simple used --with-mysql?
Thanks. |
All times are GMT. The time now is 09:11. |
Powered by vBulletin® Version 3.8.14
Copyright © 2021, MH Sub I, LLC dba vBulletin. All Rights Reserved. vBulletin® is a registered trademark of MH Sub I, LLC
Copyright ©2001 - , vbulletin.org. All rights reserved.