Steve's blog

Steve About
Steve's blog,
The Words of the Sledge

Subscribe to the RSS feed.


  • Home
  • Debian
  • PlanetDebian
  • Search PlanetDebian
  • Friends

  • Matthew Garrett
  • Jonathan McDowell
  • Jo McIntyre
  • Martin Michlmayr
  • Andrew Mobbs
  • Mike Pitt
  • Daniel Silverstone
  • Andy Simpkins
  • Neil Williams

    Friday, 09 July 2004

    MD5 considered harmful

    The CD/DVD creation process in debian-cd is very very slow for two reasons:

    1. the need to read and write CD- and DVD-sized lumps of data
    2. checksumming that data over and over and over...

    The first part is kind of unavoidable - to be able to make an ISO image, you have to actually read in all the data that will go into that image, and then write it out. To make this go faster, you simply need to supply good disk hardware - there's not really much that can be done algorithmically.

    The second part is the bit we can do something about. At the moment, the CD creation process includes:

    1. mirror check - check the MD5 sums of the files we're going to use against the Packages and Sources files
    2. calculate what will fit on each disk, and lay the files out
    3. apt-ftparchive - create the Packages/Sources files to go on each disk
    4. Bootable - set up the necessary magic to make a CD/DVD bootable, if applicable/possible/necessary
    5. md5sum.txt - list the checksum of every file on each disk in a file in the root directory of that disk
    6. make the image file
    7. generate jigdo files - compress the images by working out which files make up the image and replace those portions with file references instead

    In reverse order:

    Steps 6 & 7: I've already written JTE to make step 7 much faster: generate the jigdo files directly from mkisofs while we still have all the data we need (paths to each file), instead of having to work back from the image by brute force. This makes step 6 slightly slower, but the cost of md5summing data we're already reading and writing is not too bad.

    Step 5: Phil Hands has modified debian-cd to use fast_sums to generate the md5sum.txt file. It uses the pre-caclculated md5 sums from the mirror, rather than reading all the packages again.

    Step 4: Making disks bootable is normally trivial and take almost no time, so it can be ignored

    Step 3: apt-ftparchive currently generates all the md5sums from all the files it will place into a Packages file.

    Step 2: working out what files will fit where and creating the CD trees is also reasonably quick these days. Even "copying" the data into place is fast, as we can simply create trees of hard links rather than actually copy the data.

    Step 1: the mirror check is the next thing I'm looking at for a performance gain. It's necessary for release builds, to make sure that the packages and sources that go on the CDs and DVDs exactly match what's on and haven't been corrupted in transit. However, this step takes a long time, so long that many people disable it when running debian-cd.

    What I've done is to move the md5 check to later in the process. My JTE patch already pushes steps 6 and 7 together into one stage and also calculates md5 sums as it goes. The obvious change to make is to check the files at that point. Instead of checking the mirror up-front, simply build a list of files and md5sums and feed that to mkisofs so it can do the work, almost for free. If any files fail to match when we're building the image, fail at that point. I've written support for this, and it will be in JTE 1.6, coming Real Soon Now (TM).

    I'm not sure of how to progress JTE further - it clearly needs packaging, but that will probably involve forking mkisofs. Joerg is infamously difficult to please in terms of accepting patches for cdrtools, and the current mkisofs maintainers haven't responded to my mails about JTE AT ALL

    In other news, I'm about to commit a debian-cd change to fix the problem I've been seeing of HFS hybrid discs (powerpc and m68k) being too big.

    15:32 :: # :: /debian/JTE :: 1 comment