Steve's blog
   


Steve About
Steve's blog,
The Words of the Sledge
steve@einval.com

Subscribe
Subscribe to the RSS feed.

Links

  • Home
  • Debian
  • PlanetDebian
  • Search PlanetDebian
  • Friends

  • Matthew Garrett
  • Jonathan McDowell
  • Jo McIntyre
  • Martin Michlmayr
  • Andrew Mobbs
  • Mike Pitt
  • Daniel Silverstone
  • Andy Simpkins
  • Neil Williams

  •        
    Sunday, 08 August 2004

    Bandwidth problems with large Packages and Sources files

    I've heard several complaints that modem/slow connection users struggle to keep up - even tracking security updates can take a while. When there is a security update, the client machine will have to download the entire Packages file each time. This problem will get worse as time goes on and the number of packages grows. And for people tracking testing or unstable over a modem (such people do exist!), it already takes ages for them to just sync Packages files, let alone actually downloading and installing the new packages they want.

    How do other people do this?

    Microsoft have a central pool of servers for windows updates which keeps a database of updates. Each client machine connects to an update server and checks for any updates that have not yet been installed on the client. This works for Microsoft, but they have to maintain this huge central server pool which will be hammered solidly, constantly from the millions of client machines scattered across the world.

    We could do something similar too. We'd need to write a new application/server/cgi/something to run on security.debian.org and modify apt and friends to talk to that program. This could be done, but it's reinventing the wheel. We ask people not to mirror the security site, but it happens anyway; people would not be able to use the mirrors for this service unless those mirrors have the same program installed. That's a problem.

    If we want the mirrors to be able to work, we need to push out the information in a standard form (files/directories) that will propagate easily to mirrors via existing channels: HTTP/FTP/rsync/whatever. We need to keep some state over time so that client machines can compare timestamps on the information they already have and then only retrieve the changes to get them up to current state. If the client does not have any state, or if its idea of state is too old, then this should be quickly recognised and the client should download all current state; we don't want to slow these users down any more.

    Various people have discussed ways to do this in the past. Suggestions have included providing periodic (e.g. daily) diffs of the Packages files that clients can download. These have never really taken off.

    There is a much simpler solution to the problem, found after some discussion at the UKUUG Conference 2004. Apt and dpkg already cope with Packages/Sources stanzas containing extra fields that they do not understand; they simply ignore the extra fields. Equally, they do not care about the sort order of the files.

    My proposed way to solve the problem is:

    • add a Timestamp: field to each entry (a simple time_t would be easy) for the date/time that that entry was first added to the file
    • sort the entries in the Packages and Sources files on that timestamp, with the most recent first

    This way, clients can simply download new versions of these files and stop once a timestamp is older than the most recent timestamp of the last version they downloaded. If they do not have an older version or their old version is ancient, they will just end up downloading the entire file this time. The client program doing the download can then merge the old file version with the new information. It would even be possible to create a normal standard-format Packages/Sources file if that is still wanted.

    One issue that this does not cope with is removed packages and sources. There is an easy way to do that too: add a new small stanza for a binary/source package with a new Removed-time: field. When the client sees this stanza, it will know to remove older information about that package/source from its merged output.

    Creating the new Packages/Sources files should (I hope) be easy; in the main archive, Katie already uses a database backend when processing packages so dumping timestamps should take little effort.

    Comments? I'm sure I must have missed something here, but I can't see any holes...

    Thanks to Phil, Wookey and Martin for ideas and discussion.

    11:17 :: # :: /debian/issues :: 3 comments