Typically passwords are saved in databases using one way encryption such as md5. In other words if my password is “hello”, the database stores my password as “5d41402abc4b2a76b9719d911017c592″. Each time a user attempts to log in, the md5 algorithm is applied to the provided password and if the result matches the hash stored in the database then access is granted to the user such as in the following
if (md5($_POST['pwd']) == $saved_hash)
// user is logged in
else
// user password was incorrect
Saving this encrypted password is more secure than saving plain text passwords because if a database is temporarily compromised, at least the attacker will not have access to user’s passwords. However, despite not being able to unencrypt the password (remember this is one-way encryption), the intruder might still be able to crack many of your user’s passwords through precomputation.
An attacker could go through the dictionary (or any set of possible passwords) precomputing the md5 hashes. So, if the attacker were to see that my hash was “5d41402abc4b2a76b9719d911017c592″, he would just look it up in his reverse database and see that this hash maps to “hello”. There are many such reverse lookup databases on the web. This one successfully cracks the mentioned password.
Adding Salts
The use of salts greatly decreases the effectiveness of a precomputation attack. A salt is a random string appended to the password before encryption. Typically each user would receive a unique salt.
$saved_hash = md5($pwd . $salt);
Let’s examine the implications when the salt is public (stored in the compromised database) as opposed to when the salt is private:
- Public Salt - The attackers reverse lookup table (commonly known as a rainbow table) will no longer be useful. He will need to generate a rainbow table for the specific user’s salt. While this is still very possible, the attacker will need to perform this operation for each user, which will make it a very challenging process to crack a large number of passwords
- Private Salt - For the attacker to actually compromise a password, he would need to compute the md5 of each possible password appended to each possible salt. If your salts were 32 bits long the attacker would need to compute 800 trillion hashes or so for the English dictionary to be covered. This would be practically impossible.
Therefore, public salts are better than no salts, but private salts are much better than public salts. So, how does one keep their salt private? You can’t store it in your database because all this assumes your database was compromised. My suggestion is to create a salt based on the md5 of immutable data related to that user (and be very careful to not delete/modify that piece of data). For example, the user’s registration timestamp could be used. As long as your attacker was unable to also steal your application code the salt would be safe. This works out as the following:
$salt = md5($registration_timestamp); $saved_hash = md5($password . $salt);
Phing is a great little tool built on PHP for creating project builds. It is based on Apache Ant. For those who are developing in PHP, Phing is a natural choice as both project and build tool can share the same environment.
Reasons to use Phing
- To automate creation of daily builds of your project. The usefulness of daily builds is an essay in itself but essentially it boils down to the ability to constantly see the affects that different contributors are making to a product before it may be too cumbersome to turn back (easier integration). See here and here for more detail.
- Easier deployment. If your project requires several steps to create a build, the full process can be automated in the build.xml file. Pre written commands exist for the most common tasks such as svn checkout/update/etc, file system changes (rm, cp, mv), tarring/untarring, and so on. And if a task does not exist, its simple to extend Phing with your own.
- Database Version Control - One of the largest challenges groups of programmers face is maintaining changes to database schema. Phing would allow you to create a task or set of tasks to download schema changes from either subversion or a database and apply those changes automatically to the developer’s database. (You could of course customize this behavior to suite your needs - for example some people would prefer for phing to create a .sql file that is manually applied)
- Simplicity - Phing really just boils down to two components. You have a set of variables (aka properties). And then you have a list of instructions (the build.xml file). The properties are used to help phing complete the list of instructions.
Example
To get an idea for how simple the xml for Phing is take a look at the following example:
<tar destfile="./build/build.tar.gz" compression="gzip">
<fileset dir="./build">
<include name="*" />
</fileset>
</tar>
The above takes all the files within the build directory and compresses them into a build.tar.gz file. For more examples like the above check out the User’s Guide.
This previous weekend I attended my first BarCamp Boston. I must say it was quite good. BarCamp is a series of “unConferences” which are organized on the fly by attendees, and without any formal registration fee. So, of course, the quality of the talks is not quite up to the standard of formal conferences, but you don’t have to fly around the country to attend (usually to Silicon Valley) and you don’t have to pay $1000+ while you still learn a lot.
Some of my favorite sessions from the weekend included:
- iPhone - Development, Marketing, Best Practices, & App Store Ideas
- Twitter for Business
- Web App Design for Developers
BarCamp Boston is only once a year, but there are some other similar quality groups/events you can participate in throughout the year in Boston.
It’s a very common task for a web application to uniquely identify a visitor by a combination of username and password. However, not as trivial is identifying a third party attempting to use an API to access your web service on behalf of end users of their third party service. You often don’t want to force the end user to create a relationship with your service (such as would be required with OpenID) but instead allow the third party to use your API transparently (such as with Amazon). So, the task at hand is how to uniquely identify the third party making use of your API while preventing forgery and without requiring any sort of login system.
The solution starts with first providing each third party service with a unique public key. The public key is used to determine which third party service the request is claiming to be from. As expected, each public key has an associated private key. The private key is used to encrypt the message request into a signature. The API user will then send along that signature with the request. If the signature sent by the third party service matches the expected signature, then its safe to allow the request.
This method works because only you (the owner of the API) and the third party service have access to the private key. The third party encrypts its message using the private key and then sends along the encrypted version WITH the unencrypted version. The API owner then takes the unencrypted message and encrypts it with the private key (which it looked up based on the public key provided in the request). If the encrypted version generated by the API owner and the encrypted version sent in the request match, it can be trusted that the request came from the owner of the public key.
Here is some php code for the third party side of things. Basically the message is the url with an action of “friends.get”. The message is then encrypted and that encrypted signature is then appended to the url along with the public key. A request is then made to that url. The API owner will then process the request by verifying the identity of the requester (as mentioned above) and send back an appropriate response.
// your assigned public key which will be included in the api request
$public_key = "abcdefghijklmnopqrstuvxyz";
// your assigned private key which will always be hidden
$private key = "zyxvutsrqponmlkjihgfedcba";
// url of the api request which is essentially the message
$url = "http://www.apisite.com/api.php?action=friends.get";
// create a signature based on the api request using the private key
$signature = hash_hmac("sha512", $url, $private_key);
// the final api url with the public key and signature appended
$api_url = $url . "&public_key=" . $public_key . "&signature=" . $signature;
// fetch the url
$api_request_data = file_get_contents($api_url);
Over the last few years Ruby on Rails has been the “hip” thing in the web development world. For various reasons, I haven’t taken more than a cursory glance at the framework or language. Primarily, it’s because I’m very proficient in PHP and I’ve had the opportunity to use the language of my choice which ended up being PHP. But, it’s always good to keep up with trends and not limit oneself to a particular language. Increasingly people want Ruby experience. I would recommend a Web Developer have expert proficiency in one web scripting language (PHP, Ruby, Python, Perl, etc) and intermediate proficiency in at least another.
I was at the bookshop today looking at Ruby on Rails books and one particular book struck my eye: Rails for PHP Developers. The book example by example shows how to achieve particular goals with PHP code and then with Ruby code. For those, like me, who want to quickly understand the differences in the two languages, the book seems like it will be very useful.
Facebook Scribe is a “server for aggregating log data streamed in real time from a large number of servers. It is designed to be scalable, extensible without client-side modification, and robust to failure of the network of any specific machine.” At first glance it looks pretty cool and that it has the potential to fill many needs.
But, yesterday I was trying to research whether Scribe is appropriate for a task that I had in mind. Unfortunately, it seems like documentation and tutorials are very limited when it comes to Scribe. And those that exist are hard to find. Of course I could download Scribe and work with it hands on to determine its suitability. But usually its best to gather information about the product before under taking the time consuming task of installing, configuring, and testing it.
So for others out there who are also struggling to find information about Scribe, here are a few resources to turn to. Let me know if you find others and I will add it to this list.
- Scribe SourceForge Wiki - It’s all of 5 pages right now, but its the best documentation that exists
- Scribe SourceForge Mailing Lists - Activity is sparse but it does seem like the developers reply to the list
- Installing Scribe Tutorial - from Cloudera
- Configuring and Using Scribe for Hadoop Log Collection Tutorial - from Cloudera
- High Scalability Article on Scribe
- Facebook Engineering Blog Post - Explains the major design decisions made while building Scribe
Also keep in mind that the Scribe download package itself has a couple of example configurations that you can reference.
But, as you can see there just is not much written about Scribe on the web (except news posts announcing its open source launch). I don’t think that reflects on the quality of the product, although it might reflect on the usefulness of the product for most websites (which don’t have millions of users). But I’m sure if the Facebook developers would enhance the documentation and provide a few example of end-to-end use cases, it would spur more developers to try the product. More developers trying (and writing) about the product would spur on even more developers to try it. And so on.
Usually Google Analytics or similar tools cover so many metrics that creating your own web analytics tool is redundant. However, for certain custom metrics you might want to collect your own statistics. For example, on RateDesi Hungama I want to know how many videos are played each day.
At first, I put my counter into my web app source code (in php) within the same function that fetched and displayed the video. However, the number of videos played seemed outrageously high compared to the number of pageviews as reported by Google Analytics. Looking into my server logs, I realized over 50% of the played videos at the time were due to bots or spiders and that Google Analytics must have been excluding those visitors. So what was Google Analytics doing that I wasn’t?
Most bots collect the source code for the html of the page visited. The bot also may or may not visit hyperlinks on the page. However, they will not execute the javascript or iFrames that are included in the page. So, Google Analytics was only counting a visitor when its javascript was executed which more accurately reflected the number of visitors. But my script was incrementing the videos played counter each time the source code was downloaded.
So, to exclude such visits, you could use javascript (like Google Analytics does) or iFrames or even an image tag. Regardless of the method you choose, you will be calling back to a server side script that will increment your counter. Your statistics will now be much more accurate.
As I’ve posted about before, Elgg is a very nice open source social networking platform which unfortunately has scaling problems due to it’s one size fits all architecture. I’d like to point out another area where Elgg is making a suboptimal performance decision.
As in most social networks, users (and groups) can upload a profile photo. In Elgg, this photo (as well as other user uploaded data) is stored within a data directory that is not web accessible. Instead, a call is directly made to Elgg’s page handlers which load all of Elgg’s libraries and then find the image in the data directory and finally output it. Clearly, there is major overhead when even images, which one would think are static content, are actually routed through the php code.
Besides the obvious downsides, there are some hidden implications of not having standalone web accessible images. You will not be able to use a lightweight web server, such as nginx, in front of Apache to speed up serving of static content and take load off Apache. Plus, the Elgg code assumes that the image is available over local disk, which will preclude you from storing your data directory on a seperate server unless you use some sort of shared disk (like a SAN).
On the bright side, some of these problems can be corrected and theres a good chance someone will have written a plugin to do so by the time you are reading this. Currently, the profile photo is an instance of an ElggFile which is stored on an ElggFilestore. As of now, the only file store available is the ElggDiskFilestore. However, implementing an ElggFTPFilestore would allow your web server and data server to be seperate. You would still have two performance issues: a) There is still only one ftp location where your images would be stored. You will not be able to load balance your images over several servers. b) Requests for images would still have to go through the Elgg php code.
To solve the second problem, you would need to overwrite the profile photo plugin (called icon) to instead link to the user’s image with a normal image src tag. The user’s image would of course have to be made web accessible. Setting that up would involve more administrative overhead, but you would have the advantage of being able to use a lightweight web server to serve static content.
If Elgg itself was lightweight, the implications of turning static content into dynamic content would not be as severe. However, each page load of Elgg requires dozens (often hundreds) of database queries, so large installations of Elgg would be best served to make your static content truly static.
Many of Google’s products use the Google Data protocol to power their API’s. To make interactaction with the API simpler, most programmers will download a client library for their language of choice. Google recommends the Zend Gdata client library for PHP, which overall is a great client library but does have one major downside.
For RateDesi Hungama, I query the YouTube API (which uses Google Data) to retrieve video feeds and video entries. Each time a video page is loaded, the site needs to retrieve the feed from Youtube to display the video information. To prevent needing an API call on every video page, I have two options… either store the video information in my database and update it periodically or cache for a limited time the video information.
I chose caching. The problem with the Zend Gdata client library is that any Gdata feed retrieved is gigantic. Each video entry object is a good 300kb because a ton of metadata is kept within the object. If you allocated 500MB to your cache you would not be able to store even 1700 videos. In this case, you would probably want to look towards using a file based cache.
However, I usually prefer using memcached which is an in-memory distributed cache. Thankfully, memcached does offer the option to automatically compress data. In PHP, when using memcache_set set the flag to MEMCACHE_COMPRESSED and it will automatically serialize and compress the Gdata object. So, instead of a 300kb cache entry, you will be left with a 17kb or smaller cache entry.
Lessons Learnt: Either roll your own Gdata client library for PHP, or use a file based cache with the Zend Gdata client library, or make sure to compress your cache entries.
I finally got around to adding my portfolio to the website. Please check it out. I’m slowly getting around to making the site more complete. I will add a resume soon too among other things.

