Search
Documents
README
SimpleCDB - Perl-only Constant Database (Displayed)
|
SimpleCDB - Perl-only Constant Database
SimpleCDB - Perl-only Constant Database
use SimpleCDB;
# writer
# - tie blocks until DB is available (exclusive), or timeout
tie %h, 'SimpleCDB', 'db', O_WRONLY
or die "tie failed: $SimpleCDB::ERROR\n";
$h{$k} = $v;
die "store: $SimpleCDB::ERROR" if $SimpleCDB::ERROR;
untie %h; # release DB (exclusive) lock
# reader
# - tie blocks until DB is available (shared), or timeout
tie %h, 'SimpleCDB', 'db', O_RDONLY
or die "tie failed: $SimpleCDB::ERROR\n";
$v = $h{$i};
die "fetch: $SimpleCDB::ERROR" if $SimpleCDB::ERROR;
untie %h; # release DB (shared) lock
This is a simple perl-only DB intended for constant DB applications. A
constant DB is one which, once created, is only ever read from (though
this implementation allows appending of new data). That is, this is an
``append-only DB'' - records may only be added and/or extracted.
Course-grained locking provided to allow multiple users, as per flock
semantics (i.e. write access requires an exclusive lock, read access needs
a shared lock (see notes below re. perl < 5.004)). As (exclusive) updates
may be take some time to complete, shared lock attempts will timeout after
a defined waiting period (returning $! == EWOULDBLOCK). Concurrent update
attempts will behave similarly, but with a longer timeout.
The DB files are simple flat files, with one record per line. Records
(both keys and values) may be arbitrary (binary) data. Records are
extracted from these files via a plain linear search. Unsurprisingly,
this search is a relatively inefficient operation. To improve extraction
speed, records are randomly distributed across N files, with the average
search space is reduced by 1/N compared to a single file. (See below for
some example performance times.) One advantage of this flat file based
solution is that the DB is human readable (assuming the data is), and
with some care can be edited with a plain ol' text editor.
Finally, note that this DB does not support duplicate entries. In practice,
the first record found matching a given key is returned, any duplicates
will be ignored.
I needed to extract single records from a 20k-40k record data set, within
at most 5 seconds on an old Sun 4/40, to feed to an interactive voice
response system. Fine, I thought, an easy job for any old DBM.
Unfortunately, all of the standard ``system'' DBMs (NBDM, SDBM, ODBM) are
broken when it comes to ``large'' data sets (though I don't generally call
20,000 records ``large'') - at least on Solaris 2.5.1 and Solaris
2.6 machines. I found after inserting some 15k records: NDBM dies; SDBM
and ODBM silently ``lose'' data (you can't extract records which you know
you inserted). On an HPUX machine, it took nearer to 100,000 records to
break [NSO]DBM. All worked flawlessly on a Linux 2.2.16 machine. The
program examples/testdbm.pl can be used to exercise the various DBMs.
BerkeleyDB (DB_File) and GDBM work well, but they don't come standard with
many boxes. Further, this package was originally written for an old Solaris
2.5 box which lacked development tools (and the space and management will
to install such tools) to build a ``real'' DB.
And besides, I hadn't played with perl's tie mechanism before...
This modules uses the tie interface, as for DB_File.
The default Fcntl exports are re-exported by default, primarily for the
LOCK_ constants.
There are two public class variables:
$SimpleCDB::DEBUG turn on some debugging messages
$SimpleCDB::ERROR error message for last operation, empty if no error
has ocurred
n/a
It seems not all environments have POSIX::EWOULDBLOCK defined, in which
case this module defines it as a subroutine constant.
This DB may use a significant number of file descriptors, you may want
to increase the user/system resource limits for better performance
(e.g. ulimit -S -n 300 on a Solaris box). My test programs on Solaris don't
seem to want to open more than 256 files at a time, that is even with the
ulimit set to 300, I got EMFILE results as soon as I reached 256 open
file descriptors. This means there is still some file closing/opening
going on... Interestingly, on the non-exhaustive, not-terribly-thorough
testing I did, I noticed that using a smaller number of files gave slightly
better performance wrt. creating the DB. e.g. with ulimit set as above,
over two runs of each on an old Sparc:
nfiles = 256 : real,user,sys time = 3:20, 2:43, 0:01
nfiles = 16 : real,user,sys time = 3:00, 2:40, 0:00
Perhaps this is due to file caching?
Speaking of performance, I used Devel::DProf to find that using crypt to
generate the digest is a real bottleneck (75% CPU time was in generating
the digest :-) Using MD5 reduces this to around 6% (only half of which
is in MD5)! My homebrew digest is not nearly as good as MD5 (both in CPU
and quality), but it more or less does the job when MD5 isn't available.
Here's how it runs:
Records are between 0 and 100 bytes each. Times are user/sys/real, as
either minutes:seconds or just seconds. wr = create the db with the
stated number of records. rd = read one record (next to last inserted, i.e.
should be about the worst case).
Sun 4/40 (sun4c), Solaris 2.5
40,000 records, nfiles = 1: wr =14:48/29/14:03 rd = 26/2.2/28
40,000 records, nfiles = 16: wr =14:04/30/15:15 rd = 4/1.0/ 6
40,000 records, nfiles = 256: wr =14:33/34/15:32 rd = 3/0.8/ 4
i.e. sloooowwwww to build, good enough on extraction.
Sun Ultra/1 (sun4u), Solaris 2.6 (SCSI disks)
40,000 records, nfiles = 1: wr = 53/2.8/57 rd = 3.0/0.1/3.4
40,000 records, nfiles = 16: wr = 53/2.4/57 rd = 0.5/0.0/0.6
40,000 records, nfiles = 256: wr =196/19/240 rd = 0.3/0.0/0.4
x86 C433/64MB, IDE ATA/66 disk, Linux 2.2.16
40,000 records, nfiles = 1: wr = 18/0.7/19 rd = 1.3/0.0/1.4
40,000 records, nfiles = 16: wr = 18/0.8/19 rd = 0.3/0.0/0.3
40,000 records, nfiles = 256: wr = 18/0.8/20 rd = 0.2/0.0/0.2
100,000 records, nfiles = 1: wr = 47/2.0/49 rd = 3.1/0.1/3.2
100,000 records, nfiles = 16: wr = 47/2.0/49 rd = 0.4/0.0/0.4
100,000 records, nfiles = 256: wr = 47/2.2/49 rd = 0.2/0.0/0.2
i.e. I think the o/s is caching the whole bloody lot :-)
Clearly, other overheads limit the benefit of the distributed file hashing,
however the result is useful for my purposes...
The important thing is, it works (as opposed to NDBM and friends).
while (each %h) always shows as much data as you put in :-)
Possibly, though it works for me :-)
I've noted that on a HP-UX B.10.20 box that ALRMs don't seem to trigger
when I expect they should. For example, in _lock, I set alarm for say
5s and then call flock, which I expect to block until either the lock is
granted *or* the alarm goes off (as happens for Solaris and Linux). However,
it is as if the HP-UX box's ALRM signal is delayed until the flock
returns. HP-UX doesn't have a flock call, but does support lockf (which
can lock regions of a file), so perhaps this behaviour is an artefact
of perl's flock emulation...
Copyright (c) 2000 Benjamin Low <b.d.low@ieee.org>.
All rights reserved.
This program is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
Artistic License for more details.
Written as a last resort, and as an excuse to write my first tied module,
by Benjamin Low <b.d.low@ieee.org>, July 2000.
Dan Berstein has a nice constant DB implementation, written in C, at
http://cr.yp.to/cdb.html
If you've a C compiler handy I recommend this library over SimpleCDB.
If you want a read+write DB, go for GDBM - it doesn't support fine-grained
locking but does actually work.
Information
|
This site is currently in testing, it is not yet operating using the full database. Until it is officially launched you may wish to visit Help-Site Computer Manuals. After launch, this site (HelpSpy) will replace Help-Site. Information about the spider which is currently trawling the Internet looking for links to add to this directory can be found here. |
|