I’m pleased to announce the release of the Narada Search Engine Application, version 0.202, as well as a ready-to-use AMI, ami-140df07d.
The source release can be found at Launchpad – lp:narada and and a packaged version at http://patg.net/downloads/narada-0.202.tar.gz
What is Narada? Narada is a search engine application. It does have a web application component, but it also has a very novel approach on the back-end in how it implements various functionalities by using Gearman, memcached, Sphinx and either MySQL or Drizzle as a data store. It is somewhat of a proof-of-concept that was borne from an idea I had one late night while working on my first book, Developing Web Applications with Apache, MySQL, memcached, and Perl. I wanted to show how useful using all these great new tools– Gearman, memcached and Sphinx — for scaling out functionality and data, along with MySQL, and later Drizzle.
I realized upon developing this application, not only being an interesting concept to discuss and detail in a book, that it was also a very practical application that others would find very useful, hence I decided to release this project as an open source project that others could put to use and hopefully contribute to.
After initially releasing Narada, there was a PHP added by Eric Day and Java port by the Trond Norbye and the Sun Cloud Computing team that was presented at Java One two years ago. These PHP port is not as up to date as the Perl port and the Java port was lost in the shuffle of mergers, but I intend to at least bring the PHP API up to date soon. One of the ideas I wish to stress is that with Narada, you are not forced to use any particular language for implementation. The workers and web UI can be implemented any way you like. It’s the artichitectural concept of Narada — using Gearman workers, a distributed index like Sphinx, and a write-through cache with memcached, as well as using either MySQL or Drizzle for data store, that is important. All language implementations are included with Narada.
In terms of specific technical functionality, Narada consists of a very simple web application front end with a back end that uses Gearman (Gearman workers specifically) , Sphinx, memcached and MySQL or Drizzle to implement the real functionality of the application. The basic user experience for Narada is that there is what I call a URL queue web UI where one enters URLs that could be thought of as a “seed” URLs.
I thought it was be a great idea to use gearmand to dispatch what I call “url_fetcher” workers that are used obtain content from the remote document of the submitted URL, look for any links and in turn use recursion, to a given depth, fetching yet more remote documents a certain number of times. For each remote document that the url_fetcher worker retrieves, it caches the content of that document temporarily in memcached as well as updates a a very simple ID table in the database that contains a “catalogue” of temporarily cached documents as well as updating a simple counter item in memcached to indicate the number of documents cached. The idea behind this is first for performance so each retrieval doesn’t require insertion of a record containing a blob field into the database, and also an attempt to inevitably “bulk” insert these cached documents at a later point in time, albeit short period of time.
For each document the url_fetcher worker retrieves and caches, it in turn requests another worker through gearmand, the “url_store” worker. The url_store worker, whenever called, checks a the counter item that the url_fetcher worker incremented for each document retrieved and the number checked for to a given value, if matching, determines that the cached documents should be stored in the permanently in the database. These cached documents are bulk-inserted into the database for the benefit of write performance and a simple counter item set to the value of the number of documents stored in the database is cached memcached .
The url_store worker also requests through gearmand an Sphinx reindex worker that when called checks the counter set by the url_store worker for the number of documents stored and if the value matches, the Sphinx indexer is called to rebuild the indexes. The other component to make use of this indexes content is a web UI that allows a user to enter a search term and returns the URLs and an excerpt with the search term highlighted which the user can then navigate to the remote document/site which contains the search term. This functionality is implemented by using Sphinx to return the document ids from that search term to obtain the document content from MySQL and then finally displayed. Sphinx also is used to build the excerpts for the search results displayed.
The image above shows in a diagram the components described in this post. You’ll see that there are several components, all of which are quite scalable making Narada an ideal Search application with scaling built in from conception.
The next image shown is an example of the Search results page.
And finally, an image displaying the URL Queue page, where you enter “seed” URLs.
This release that I have made, 0.202, is mainly changes to the web UI mod_perl handlers, workers as well as core library. Also changed is how I want to develop a language agnostic Sphinx configuration file using YAML, which I have implemented for the Perl port and soon the PHP port. I’ve also made the Perl port as easy to install as possible – with PHP’s deployment model in mind. I’ve also added a lot of documentation about how to install and run Narada.
The AMI I also released, ami-140df07d, is an out-of-the-box ready-to-use AMI that has everything set up for you to immediately run Narada on your website. You will find upon logging in as the “narada” user to your RPM that there is a file in the home directory called “README.1st”. Read this, and you will be on your way to running Narada!
As a final note, next week at the MySQL Users Conference I will be giving a talk and demonstration of Narada. I hope to see you there!