Debian Planet










Welcome to Debian Planet

Search

Apt-get into it.
Main Menu

  • Home

  • Topics

  • Web Links

  • Your Account

  • Submit News

  • Stats

  • Top 10

  • Debian

    These are important Debian sites one should not be without!

  • Official Debian site

  • Package search

  • Mailing list archives

  • Bug reports

  • Debian on CD

  • Unofficial woody CD ISOs

  • Unofficial APT sources

  • Developers' Corner

    Other great Debian news sources:

  • Debian Weekly News

  • Kernel Cousin Debian

    (Debian mailing lists digested)
  • Community Groups

    Need help? You're not alone on this planet.

  • debianHELP

    (User support site)

  • Debian International

  • DebianForum.de

    (Deutsch)

  • EsDebian

    (español)

  • DebianWorld

    (français)

  • MaximumDebian

    (Italiano)

  • DebianUsers

    (Korean)

  • Debian-BR

    (Português)

  • IRC

    The place to get help on a Debian problem (after reading docs) or to just chat and chill is #debian on irc.debian.org.

    Many of the Debian Planet staff live there so pop by and say hello.

    Wanna write?

    Got that latest or greatest scoop? Perhaps you have some important news for the Debian community? Submit a news item!

    Or perhaps you've written a rather ground breaking insight into some aspect of Debian and you feel compelled to share it with others? Knock up a longer editorial article and send it to the editors.

    Sponsorship

    DP is sponsored by Xinit Systems and kieser.net.

    Domains paid for and hosted by uklinux.net.

    Buy your Debian merchandise at DebianShop.com.

    Who's Online

    There are currently, 48 guest(s) and 4 member(s) that are online.

    You are Anonymous user. You can register for free by clicking here.

      
    apt-get update - why not rsync?
    Contributed by Anonymous on Friday, November 30 @ 05:38:12 GMT

    Ask Debianplanet
    Whilst sitting there waiting for the 'apt-get update' to finish the other night something occured to me: Why does apt-get not use rsync to get it's package list?
    Most of the packages don't change every time you update, so it could save quite a bit of time and bandwidth to do this.

    There is already a lot of rsync mirrors for the cdimages, why not the package list?

    DanielS: I don't see why not, now gzip has been hacked to be rsyncable.

     
    Related Links

  • More about Ask Debianplanet
  • News by DanielS

    Most read story about Ask Debianplanet:
    XFree86 4.2.0

    Last news about Ask Debianplanet:

    Printer Friendly Page  Send this Story to a Friend
  • "apt-get update - why not rsync?" | Login/Create Account | 22 comments
    Threshold


    The comments are owned by the poster. We aren't responsible for their content.

    Packages.gz is an imperfect aproach (Score: 2, Informative)
    by Anonymous on Friday, November 30 @ 09:15:24 GMT

    I think the real problem is that there is much un-necessary information in the Packages.gz if you you just want to check for updates.

    Packages.gz is 5.9MB, (compressed its 1.6 MB), there are 3MB just in the descriptions, its only going to grow.

    lf the user is using stable or otherwise hardly ever updates then this isnt an issue, but for testing or unstable most of the downloaded data is the same. Less than 1% of packages change between most peoples updates, but 99% of duplicate information is always beeing downloaded

    I think there should be a package index file (similar format to the override file) which just has the package name, version and revision of every package in the dist, it would endup only being a couple of hundred kB.

    Complementary to this the metadata for each BINARY package could be merged into the .dsc file, which would only have to be downloaded when its changed (not every time). Probably have to rework the revision numbering scheme to acomadate a changeing .dsc file though.

    A package daemon would be cool i think, that way it could be queried and just spit out new/updated descriptions instead of the user downloading so much duplicate information every time.

    Anyway this is all too radical to be practical at the moment, its how i see it though.

    Rsync is an issue which has been discussed at length on the debian-devel mailing list. Its too CPU intensive, using xdelta to do a binary diff against "milestone' package.gz's would be a better way to do it in a traditional way.

    [ Reply ]


    Re: Packages.gz is an imperfect aproach (Score: 0)
    by Anonymous on Friday, November 30 @ 17:54:43 GMT

    I've got a stupid idea that may (or may not) work:

    Using patches for Packages.gz and the rest.

    First we need to add a version number (maybe a date in ISO format) to the file.

    We find out the differences and create a patch. It is then compressed to get the size down.

    apt-get has to be modified so it sends version information about Packages.gz or to retrieve such information from the server. The latter is probably easier (go to the default directory and get all the packages that are named *date-I-want* or get the latest version and patches).

    Every so long there would be a full release (kinda kernel one) and the new patches will be based on that one, the old ones being left there for people who do not want to upgrade to the latest and greatest.

    Used with the package pools it may also provide with a way to upgrade the system to a known state (let's say last week program "Y" worked, today it doesn't. I really do need "Y" so I set my system to be as last weeks).

    I hope it sounds clear (it probably won't) but it is an idea.

    If I get the idea completely clear in my mind I will try to patch apt (and die trying :P).

    See ya

    Gabriel

    [ Reply ]


    Re: Packages.gz is an imperfect aproach (Score: 1)
    by Shanep (shanep@_DEAD_ANIMAL_ign.com.au) on Tuesday, December 04 @ 06:01:19 GMT
    (User Info)

    Rsync is an issue which has been discussed at length on the debian-devel mailing list. Its too CPU intensive

    How CPU intensive? Would it not be worthwhile upgrading server CPU's and RAM to get long term savings on bandwidth usage?

    I recently purchased a 3 CD set of 2.2r4 in Sydney and got home to find to my great disapointment that CD 1 was corrupt. This has happened to me a few times before from this vendor, but anyway, I made an image of the corrupt CD1 with dd ignoring errors and then I corrected the image with rsync and a local server. Across my 56k MODEM I was amazed how little time it took to correct, especially considering how badly the disc corruption seemed to be (3/4 of the way in and seemingly mostly to the end from there was bung).

    How many clients would be doable on a high end server? Can rsync be set up to re-use past hash data or whatever it does, to get performance from look-up-tables vs working it all out every time?

    Surely these would be much smaller also, than the data they represent? Is rsync even an option along with http and ftp within apt? If it were and I had the money I'd be willing to set up an rsync-only mirror for Sydney.

    [ Reply ]


    Re: Packages.gz is an imperfect aproach (Score: 1)
    by caf on Tuesday, December 04 @ 06:57:47 GMT
    (User Info)

    Re-using past checksum data on the sending side is not feasible with the current rsync protocol. The sending side does a rolling checksum through the file - the total amount of checksum data generated is 4 times the size of the source file.

    See the rsync technical report by the authors.

    - caf.

    [ Reply ]


    Re: Packages.gz is an imperfect aproach (Score: 1)
    by Shanep (shanep@_DEAD_ANIMAL_ign.com.au) on Friday, December 07 @ 13:01:20 GMT
    (User Info)

    Thanks caf,

    I'm reading through that link, interesting stuff.

    I have been thinking of an rsync type of idea since back in the days when I would use Z-MODEM to continue and repair downloads. I've been wondering if it would be comparable in speed and quality to rsync...

    Basically it consists of aranging a data file in a square 2D array, and then checksuming each row and column (since each byte contributes to 2 checksums, I guess a 16-bit checksum would be fine) and comparing with the other end. Any differences should show with at least one row and one column checksum which could allow at least that one byte to be re-sent.

    A 650MB CDROM image could be represented with just a little over 100kB. In addition to this a before and after MD5 check could be done to check whether this checksum is required and thus repaired respectively.

    This checksum data could be saved with the file as file.checksum or whatever and sent as required.

    Does this sound too simple?

    [ Reply ]


    Re: Packages.gz is an imperfect aproach (Score: 1)
    by caf on Friday, December 07 @ 13:42:01 GMT
    (User Info)

    Well... as with most programming problems, it depends on your problem set. A solution similar to yours would do well in the case where you are expecting only a small number of errors in a transfer, relative to the transfer size - allowing it to resend only the errored block if only one block is in error.

    However, it gets worse as the number of blocks in error increases - with 2 blocks in error, it would require at worst 4 blocks to be resent.

    Rsync will also do a lot better than your algorithm in the case where data needs to be inserted at the beginning of the file - in your case, this would cause all the checksums 'below' this point in the matrix to be wrong, so all these blocks would have to be resent.

    - caf.

    [ Reply ]


    Re: Packages.gz is an imperfect aproach (Score: 1)
    by Shanep (shanep@_DEAD_ANIMAL_ign.com.au) on Friday, December 07 @ 16:40:24 GMT
    (User Info)

    Hi caf,

    A solution similar to yours would do well in the case where you are expecting only a small number of errors in a transfer, relative to the transfer size - allowing it to resend only the errored block if only one block is in error.

    OK, why is that?

    However, it gets worse as the number of blocks in error increases - with 2 blocks in error, it would require at worst 4 blocks to be resent.

    How is that? If I have a matrix of 10x10 bytes (or blocks) represented as A0 - J9:

    With B3 and C4 in error, checksums B, C, 3 and 4 would be different and thus identify just those in error.

    With B3 and B4 in error, checksums B, 3 and 4 would be different and also only identify those in error.

    I can however see the problems of false errors coming up for example if B3, C3 and C4 were really in error, but the checksum data would also be interpretted as B4 being in error, when it might not be (or any combination of this example).

    Rsync will also do a lot better than your algorithm in the case where data needs to be inserted at the beginning of the file - in your case, this would cause all the checksums 'below' this point in the matrix to be wrong, so all these blocks would have to be resent.

    Why is that? If A0 and B0 were changed on the server, then checksums A, B and 0 would show the difference and pinpoint those for download to result in correct checksums. The other row and column checksums are independant. Lets say the entire first row were different, this would result in all the colum checksums being different but only the first row checksum being different, so only they would be pinpointed for a differnce.

    I can see real problems coming up if the whole first row and column were changed, making all checksum data different!

    Another method I thought of was, (example) to checksum a whole file and compare with the remote end, if different then checksum the first half and compare, if same checksum third quarter and compare, etc etc until small enough blocks of differing data could be found and re-sent. Almost like a quick search through sorted data.

    [ Reply ]


    Re: Packages.gz is an imperfect aproach (Score: 1)
    by caf on Saturday, December 08 @ 06:36:16 GMT
    (User Info)

    OK, why is that?

    Simply because of the property it has that the amount of redundant resent data increases with the number of errors.

    How is that? If I have a matrix of 10x10 bytes (or blocks) represented as A0 - J9:

    With B3 and C4 in error, checksums B, C, 3 and 4 would be different and thus identify just those in error.

    Not quite true - if we see that checksums B, C, 3 and 4 are wrong, then we don't for sure exactly which blocks are in error (or even how many blocks there are in error) - it could be either B3 and C4, or B4 and C3, or B3, B4 and C3, or B3, B4 and C4...etc etc. In this situation, at best we will have to resend only the blocks in error - but at worst, we will have to resend all four intersecting blocks (even if only two blocks were actually erroneous), depending on which blocks we choose to resend first. The situation gets exponentially worse with more checksums in error.

    Why is that? If A0 and B0 were changed on the server, then checksums A, B and 0 would show the difference and pinpoint those for download to result in correct checksums. The other row and column checksums are independant. Lets say the entire first row were different, this would result in all the colum checksums being different but only the first row checksum being different, so only they would be pinpointed for a differnce.

    I meant that if the difference between the two files included missing data, not just wrong data (or indeed, extra data). You can see that this would cause all the subsequent blocks to be shifted forward or backwards - so all the column checksums, and all the row checksums below this point, would be wrong.

    I can see real problems coming up if the whole first row and column were changed, making all checksum data different!

    Yes this is (one of) the limiting case for problem of increased numbers of errors I mentioned earlier. At this point, we have to resend all N*M blocks, even though only N+M-1 blocks are actually in error.

    Another method I thought of was, (example) to checksum a whole file and compare with the remote end, if different then checksum the first half and compare, if same checksum third quarter and compare, etc etc until small enough blocks of differing data could be found and re-sent. Almost like a quick search through sorted data.

    Well, thats more like a binary search. But it's still a neat idea - this algorithm would be particularly suited for low latency connections (because you would need constant communications to determine which block to checksum next).

    - caf.

    [ Reply ]


    Re: Packages.gz is an imperfect aproach (Score: 1)
    by Shanep (shanep@_DEAD_ANIMAL_ign.com.au) on Tuesday, December 11 @ 10:15:34 GMT
    (User Info)

    You can see that this would cause all the subsequent blocks to be shifted forward or backwards - so all the column checksums, and all the row checksums below this point, would be wrong.

    Ahh! Bugger. ; )

    you would need constant communications to determine which block to checksum next

    What a hairy problem. I see what you mean now. It's a shame rsync uses so much memory for the sums.

    Thanks for the explanation.

    Maybe a slightly smarter apt-get that could request diffs of previous full files is the answer for speeding up system updates?

    [ Reply ]


    Re: Packages.gz is an imperfect aproach (Score: 1)
    by caf on Tuesday, December 11 @ 13:53:55 GMT
    (User Info)


    I reckon the answer is reversing the direction of the protocol. Stay tuned to this channel, as they say...

    [ Reply ]


    Re: apt-get update - why not rsync? (Score: 3, Interesting)
    by caf on Friday, November 30 @ 11:58:59 GMT
    (User Info)

    It'd be useful for more than just the Packages file - you could use any older .debs you have cached as a source for matching data in newer versions of the same package. The assumption is that between close versions of a package, there are often large binary similarities. (This of course, is preconditioned on the packaged being built with the 'rsyncable-gzip').

    It's been suggested several times before, and has so far been shot down every time. See the debian-devel list archives. The main reasons seem to be:

    1) the patch that gives gzip an 'rsyncable' option isn't standard - and there doesn't appear to be a current maintainer of gzip to accept it. (Personally I think this isn't a problem, because it would just have to be patched into the debian packaged gzip)

    2) The current rsync algorithm puts most of the CPU load onto the server side. This obviously isn't popular with server administrators.

    3) There are rumoured to be patent issues with the rsync algorithm. (I've never seen any evidence of this - just rumour, and I'm inclined to discount it until I see something to convince me otherwise).

    I've had some ideas in the direction of number 2 - I think it's possible to move the CPU load onto the client - and in fact, not even require a new daemon on the server at all. I perceive a lot of built-up hostility to the idea of an rsync-like algorithm in apt-get, so I'm reluctant to argue the point until I at least have some proof-of-concept code to back myself up.

    - caf.

    [ Reply ]


    Re: apt-get update - why not rsync? (Score: 1)
    by LFTL on Friday, November 30 @ 17:08:11 GMT
    (User Info) http://www.auburn.edu/~wehresj

    Link to one of the debian-devel list archive discussions for those too lazy to search themselves. 🙂

    LFTL

    [ Reply ]


    Re: apt-get update - why not rsync? (Score: 1)
    by bartman77 on Saturday, December 01 @ 18:17:23 GMT
    (User Info)

    Concerning the high load on the server, this is produced by the rsync-d analyzing the package-file all over.

    Why not produce an indexed Version on every update, that reduces the analyzing over and over again to once analyzing and looking into a larger file with a special sort of Index at the front?

    [ Reply ]


    Re: apt-get update - why not rsync? (Score: 0)
    by Anonymous on Monday, December 03 @ 01:21:24 GMT

    Or as a short term solution offer both gzip and bzip2 versions of the packages file. The bzip will be small and have some breathing room until a better solution is found.

    Gordon.

    [ Reply ]


    Re: apt-get update - why not rsync? (Score: 1)
    by abo on Tuesday, December 04 @ 22:27:02 GMT
    (User Info) http://sourceforge.net/users/abo/

    You save about 5~10%. whoopy-doo! Not even worth considering for the headache, particularly when rsyncable gzip is just around the corner and would require a migration back to gz. If there was the promise of an rsyncable bzip2... then it would be worth considering.

    rsync saves about 90% on the uncompressed Packages file, which equates to about a 60% saving against the compressed version.

    [ Reply ]


    Re: apt-get update - why not rsync? (Score: 1)
    by Integral on Friday, November 30 @ 19:19:20 GMT
    (User Info)

    Doesn't this belong on an FAQ list somewhere?

    Daniel

    [ Reply ]


    Re: apt-get update - why not rsync? (Score: 1)
    by abo on Tuesday, December 04 @ 23:07:23 GMT
    (User Info) http://sourceforge.net/users/abo/

    I'm sure this has all been said before, but I'm gonna say it again anyway.

    rsync rocks, but it has a few problems;

  • high server load (needs to calculate rolling checksums and md4sums for every download)
  • hacked together implementation (a proof of concept implementation became the actual thing)
  • non-standard transfer protocol (rsync is not as widely used as ftp/http, and I believe the protocol is still evolving between versions)
  • sub-optimal deltas (but probably as good as you can get for arbitary updates)

    The first is an artifact of the algorithm. The client calculates and sends a signature, the server calculates and sends deltas. This can be reversed; server sends signature, client calculates and requests deltas. This reversing the algo decreases the upload and slightly increases the download sizes. It also nicely lends itself to a http implementation using ranged gets to fetch the deltas. Unfortunately there might be some patent issues with this.

    The second problem is being addressed by the rproxy project which is working together with the rsync guys on creating rsync2 using a generic rsynclib with a nice zlib-like interface. This should spawn a heap of nice apps, including potentially squid proxy support for rsync:// urls.

    The third is a fact of life with any new protocol. rsync has through sheer briliant performace nearly established itself as a defacto standard, dispite the drawback that it has no standard. Things like rproxy are attempting to bring delta-transfers to http as a backwards-compatible extension.

    The last is limitation of arbitary updates. The only way around this is for the machine doing the delta calculation to have complete access to both versions. This changes the algo; client sends version id, server calculates and sends delta between clients version and current version. This means the server must keep at least deltas between all known versions to the current version. It also means the client must have a complete and uncorrupted copy of a particular version. This is what xdelta uses, and it is very nice for some applications, but I feel the limitations imposed kill it as a general-purpose network delta update.

    Those thinking of inventing stuff... don't! Instead help on rsynclib, rsync2, and rproxy. A heap of scripting wrappers for rsynclib would be a nice start.

    ABO

  • [ Reply ]


    Re: apt-get update - why not rsync? (Score: 1)
    by abo on Wednesday, December 05 @ 00:03:23 GMT
    (User Info) http://sourceforge.net/users/abo/

    Whups, make that rsync 3.0, not rsync2

    [ Reply ]


    Re: apt-get update - why not rsync? (Score: 1)
    by caf on Thursday, December 06 @ 09:14:27 GMT
    (User Info)

    The first is an artifact of the algorithm. The client calculates and sends a signature, the server calculates and sends deltas. This can be reversed; server sends signature, client calculates and requests deltas. This reversing the algo decreases the upload and slightly increases the download sizes. It also nicely lends itself to a http implementation using ranged gets to fetch the deltas. Unfortunately there might be some patent issues with this.

    As I mentioned, I've heard these rumours about patent issues before - do you have any more information on this?

    [ Reply ]


    Re: apt-get update - why not rsync? (Score: 1)
    by abo on Thursday, December 06 @ 23:37:11 GMT
    (User Info) http://sourceforge.net/users/abo/

    No further information... I've not looked into it. It might even just be FUD. The annoying thing about this rumor is it is killing off potential development... people like me are putting off any sort of effort because it might get hit by lawyers.

    Patents suck... perhaps someone should just annoymously implement stuff and freenet it.

    ABO

    [ Reply ]


    Re: apt-get update - why not rsync? (Score: 1)
    by caf on Friday, December 07 @ 13:26:45 GMT
    (User Info)

    No further information... I've not looked into it. It might even just be FUD.

    That's what I was thinking.

    The annoying thing about this rumor is it is killing off potential development... people like me are putting off any sort of effort because it might get hit by lawyers.

    Hmm... I wasn't aware that patents could be used to stop development - only commercial use. I'm no lawyer though.

    Patents suck... perhaps someone should just annoymously implement stuff and freenet it.

    Yeah... submarine patents even more so.

    - caf.

    [ Reply ]


    new type:non-US, non-free, civil-disobedience(aka non-legal) (Score: 0)
    by Anonymous on Saturday, January 05 @ 18:02:52 GMT

    perhaps you could write the program,

    then distribute it and call it 'civil disobedience'

    as a protest of these stupid patent laws.

    you would have to write down all your research thoughts, and

    all the process you go to in developing your algorithm, though,

    because you will have to prove that you thought it up yourself

    and did not use anyone elses ideas. of course, rsync is pretty obvious

    to anyone with a brain, ie "why the hell do i have to download a bunch of crap i already downloaded " --> you do not need rocket science to come up with this idea.

    but to hold up in court, it would really really help

    if you can show the judge, and more importantly the public

    and the world, that you thought the idea up on your own,

    and did not 'steal' it from anyone.

    this is the hardest part, because most open source

    people are so arrognat and stupid that they think

    their ideas are 'too important to write down or explain'

    ie the 'im not a secretary, im a big bad programmer and you should all kiss my boots' philosophy.

    however, once done, everyone who wanted to participate

    in the 'civil-disobedience' could add the 'civil-disobedience' sources

    to their /etc/apt/sources.list and take their place

    in history.

    [ Reply ]


    Based on: PHP-Nuke

    All logos and trademarks in this site are property of their respective owner. The comments are property of their posters, all the rest © 2000 by Debian Planet

    You can syndicate our news using the file backend.php.