Sunday, February 21, 2010

Speed Searches using search engine:

Do you want to implement a full text search engine into your website? People are seeking information on the web. so why not speed up your search by using Lucene , Sphinx or other open source search engine. There are other search engines available out there but I found so far that this two are much easier to integrate. So use which one for your web site is the question.

May be your web site is now running on the sever and most of us use mysql for web application. Considering this situation I think Sphinx would better choice. Though lucene is much advanced then Sphinx but it will require some code written in java to pull data from mysql and then index them infect there is no api for php too (check nutch may be have something for you ). On the contrary sphinx will require no coding as log as you do not want to deal with the live index. If you want to deal with the live index then you just have to write a small pice of shell code to run indexing and merging operation.

I run sphinx on Windows and Linux. I am going to show you how to run sphinx on window and do indexing and merging. I hope this will help you.


Install Sphinx:
Download sphinx(I use sphinx-0.9.9-win32) from Sphinx site as I told you that I run it on windows so download zip version an unzip it.


Sphinx Configuration:
If you do not want to deal with live index then insert following configurations into your sphinx.conf (you can put it any where) :

source Files
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass =root
sql_db = data_base
sql_port = 3306
sql_query_pre = SET NAMES utf8

sql_query =SELECT id, name, size,address FROM Files;
sql_ranged_throttle = 0

}

index Files
{

source = Files
#docinfo = extern
path = /path/to/index
min_word_len = 3
}


indexer
{
mem_limit = 32M
}


searchd
{
port=9313
log = /path/to/searchd.log
query_log = /path/to/query.log
pid_file = /path/to/searchd.pid
}


Now just open command prompt and run following command and run following command

For indexing:
Path/to/indexer --config path/to/sphinx.conf Files [or put –all in place of Index name]

For Searching :
Path/to/search --config path/to/sphinx.conf Files 765432

so how you gona use it for web pages. Sphinx provide nice api for php, java and ruby yon get it from you downloaded zip. Let se how we can use it in php.

Run following command in the prompt:
Path/to/searchd --config Path/to/sphinx.conf(do not close command prompt)

Run the following code in you webserver(I use apache)

require('sphinxapi.php'); (check path/tp/sphinx/api)
$sp = new SphinxClient();
$sp->SetServer('localhost', 9313);
$sp->SetMatchMode(SPH_MATCH_ALL);
$sp->SetMatchMode(SPH_MATCH_ALL);
$sp->SetArrayResult(true);
$results = $sp->Query('kamal');
print_r($results);

Isn’t it easy to use sphinx?

Now what about live index. What is this live index is. Actually this is most critical part because you might want to keep your user updated as a result you need to update index. May be you have a solution that you are going to reindex. But indexing is slow believe me when you have gigabyte of data to index it will take lots of time. Further if you check sphinx document you could found that Sphinx at the moment is designed for maximum indexing and searching speed as a result indexing is slow.

So thats why merging come into the picture. Merging is relatively easier then the reindexing and inexpensive check the following configuration for live indexing:

source Files
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass =root
sql_db = data_base
sql_port = 3306# optional, default is 3306
sql_query_pre = SET NAMES utf8

sql_query =SELECT id, name, size,address FROM Files;
sql_ranged_throttle = 0
}


source delta : Files
{
sql_query_pre = SET NAMES utf8
sql_query =SELECT id, name, size,address FROM Files;

}


index Files
{

source = Files
path = path/to/index
min_word_len = 3
}


indexer
{
mem_limit = 32M

}

searchd
{
port=9313
pid_file = /path/to/log/searchd.pid

}


Now run the follwing commands to index and merging:

In the first phase:
indexer --config path/to/sphinx.conf files
Form the rest of the phase
indexer --config path/to/sphinx.conf delta
indexer --merge main delta --rotate --merge-dst-range deleted 0 0

--rotate switch will be required if DSTINDEX is already being served by search and last one will delete the existing document from the index

Never ever index on files coz this is our main index where user will search

I hope this will help you to implement sphinx and speed up your search


You can write a shell script or java thread(as i am java developer) to do this process automated.

The Dream is not what you see in sleep; Dream is the thing which doesn't let you sleep. --(Dr. APJ. Abdul Kalam)

AWS Services

      1.         Identity Access Management (IAM): Used to control Identity (who) Access (what AWS resources).                   1....