khaines / enigo.com wrote:
> On Mon, 23 Oct 2006, Francis Cianfrocca wrote:
>> On 10/22/06, snacktime <snacktime / gmail.com> wrote:
>>>
>>> I've tested out a couple of ways of storing a queue structure and
>>> wondered if others had some better ideas.  I tried a berkeleydb based
>>> queue using it's native queue structure, which gives the best
>>> performance so far but it's for unix only.  I also have a file based
>>> queue where every message is just stored in a sequentially numbered
>>> file.  Almost as fast as berkeleydb and works anywhere.  Just for
>>
>> Kirk Haines was involved in a thread here a couple of months back in 
>> which
>> he made the case that it's perfectly fast and perfectly scalable to 
>> just use
>> the filesystem for moderate requirements. His scheme used filenames that
>> were actually hashes of the contents, and as I recall he had a 
>> mechanism for
>> automatically generating a directory hierarchy based on the content 
>> hashes
>> to keep the inode sizes under control.
> 
> Francis remembers right.  I did this for, essentially, a replacement for 
> pstore that didn't need to read an index into memory for each 
> transaction, so it'd have flat memory usage and speed regarless of the 
> size of number of elements in the store.  It works pretty well.
> 
> In my case I create a hash of the storage key and use that as part of 
> the filename, and then I had the code make subdirectories based on the 
> starting letters of the hash in order to avoid having too many files in 
> any one directory.  My default was to take a chunk of two characters 
> from the beginning of the hash as subdirectory name, twice.
> 
> So a hash starting with 'abd45cc' would have it's file in ab/d4/
> 
> The speed of the approach is tied very directly to the speed of the IO 
> system, but it's relatively fast and very portable so long as your file 
> naming scheme is itself portable.

Kirk,

Did you get a feeling for what the break-even point was? In other words,
how many files do you need before the depth 2 scheme beats a single flat
dir?

Playing around with this idea in fsdb, with 100_000 files, access times 
seem to get worse as depth varies from 0 to 2. Creation time was better 
at depth==1 than in either of the other cases.

I'm not at all confident of benchmarking this though, since it's likely 
to be so sensitive to many things. FWIW, this is on linux 2.6.15, ext3 
with no special params, 1.7GHz centrino, 40Gb Fujitsu laptop drive, 
running at nice -n -20.

-- 
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407