[Cialug] Efficiently removing the beginning of a file
Jeffrey C. Ollie
jeff at ocjtech.us
Mon May 21 12:10:58 CDT 2007
On Mon, 2007-05-21 at 11:04 -0500, Daniel A. Ramaley wrote:
> I have a ~70 MB file. The first 3635 bytes need to be removed. What is
> the most efficient way to do that? I did this, knowing it would work
> but would be slow:
> $ dd if=inputfile of=outputfile ibs=1 obs=1M skip=3635
> It did indeed work. But it took 274 seconds (and pegged the CPU the
> entire time), whereas simply copying the file with cp only takes 2
> seconds. Since what i want to do is not *that* different an action from
> just copying the file (at least in terms of the minimum disk operations
> that would be required), it seems to me that there should be a way to
> do it that only takes ~2 seconds. What are some other command line ways
> to do this that would be more efficient?
>
> Actually, before hitting send i tried another test, just flipping the
> "ibs" and "skip" values:
> $ dd if=inputfile of=outputfile2 ibs=3635 obs=1M skip=1
> That only took 2.5 seconds, which is much closer to the theoretical 2
> second time that should be possible. But i guess what i'm curious about
> is the general problem; if there is a large file and you need to remove
> some small number of bytes from the beginning of it, how is that best
> accomplished? If i had needed to remove only 1 byte for example, i
> would have had to have used "ibs=1 skip=1" which would have taken
> around 274 seconds again.
This has everything to do with your input block size. With an input
block size of 1, dd has to do a lot of work (my CPU was spiked during
this test):
$ time dd if=test-in of=test-out ibs=1 obs=10M skip=3635
73396685+0 records in
6+1 records out
73396685 bytes (73 MB) copied, 75.1107 seconds, 977 kB/s
real 1m15.158s
user 0m16.941s
sys 0m55.442s
Switching dd to use and input block size of 3635 makes things go a LOT
faster:
$ time dd if=test-in of=test-out ibs=3635 obs=10M skip=1
20191+1 records in
6+1 records out
73396685 bytes (73 MB) copied, 0.316958 seconds, 232 MB/s
real 0m0.365s
user 0m0.050s
sys 0m0.296s
With a little scripting in Python this can be done a bit more generally:
import time
s = time.time()
i = file('test-in', 'r')
o = file('test-out', 'w')
i.seek(3635)
while 1:
data = i.read(10 * 1024 * 1024)
if data == '':
break
o.write(data)
i.close()
o.close()
e = time.time()
print e - s
$ python test.py
0.329372882843
Another factor to consider is how much RAM you have, the tests above
reflect the end result after a number of runs so that much of data would
have been cached in RAM.
Jeff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://cialug.org/pipermail/cialug/attachments/20070521/07fcebf5/attachment.pgp
More information about the Cialug
mailing list