Directory Layout For User Uploads
Posted on 21st August 2008 by SameerSites that accept user uploads (photos, documents, music etc) will need to need to determine an appropriate directory structure to house the large number of files they will collect. At first glance you may decide to just prefix all filenames with a userid and stick them all into one directory. Maybe even broken up into something like:
/uploads /photos /music /documents
If user 75474 uploads a photo, it will be named 75474_randomstring.jpg and put in directory “/uploads/photos”. However, over time the photos (and music and documents) directory will become huge. File systems of practically all kinds do poorly with large directories. Things run slower, become more error prone, and batch operations become difficult. You do not want huge directories
The next step logically is to break up the user uploads into a separate directory for each user:
/uploads
/user1
/photos
/music
/documents
/user2
/photos
/music
/documents
This method, while more organized, will also not scale. The solution: nested subdirectories:
For example, a good solution would be if user 75474’s files ended up in “/uploads/75000/75400/75474/” and user 82145’s files ended up in “/uploads/82000/82100/82145/”. The following function aims to generically solve that problem:
// $numsubdirsperdirectory and $upperboundusers must be a power of 10
// do not change $numsubdirsperdirectory and $upperboundusers
// after initial use
function generatePath ($userid, $numsubdirsperdir = 100, $upperboundusers = 1000000) {
$level = $upperboundusers / $numsubdirsperdir;
$subdir = "";
while ($level > 1) {
$subdir .= intval($userid/$level)*$level . "/";
$level = $level / $numsubdirsperdir;
}
$subdir.=$userid."/";
return $subdir;
}
generatePath(1); // returns "0/0/1/";
generatePath(101); // returns "0/100/101/";
generatePath(1001); // returns "0/1000/1001/";
generatePath(10001); // returns "10000/10000/10001";
generatePath(524973); // returns "520000/524900/524973/";
First, you set an upperbound number of users that you believe your site will never reach. This is to help the function know how many levels of nesting it requires (more possible users means more levels of nesting). Then choose how many subdirectories per directory (I recommend 100 or 1000). But, do not change these variables once your site goes live!
If your upperbound number of users ends up being wrong, don’t worry. For example, if you set upperbound to 1 million and numsubdirs to 100 but your site reaches 2 million users then your top level directory would end up with 200 subdirectories instead of 100. You could live with that easily. If you do guess wrong by a long shot, you could write a script to move things around (and then adjust the variables of the function) or maybe its time for a second files server.
Note: If you want to fully ignore directory structure then check out Amazon S3 or MogileFS which use keys for file lookup.

