Steve's blog

I've posted this analysis to Debian mailing lists already, but I'm thinking it's also useful as a blog post too. I've also fixed a few typos and added a few more details that people have suggested.

This has taken a while in coming, for which I apologise. There's a lot of work involved in rebuilding the whole Debian archive, and many days spent analysing the results. You learn quite a lot, too! :-)

I promised way back before DebConf 18 last August that I'd publish the results of the rebuilds that I'd just started. Here they are, after a few false starts. I've been rebuilding the archive specifically to check if we would have any problems building our 32-bit Arm ports (armel and armhf) using 64-bit arm64 hardware. I might have found other issues too, but that was my goal.

The logs for all my builds are online at

https://www.einval.com/debian/arm/rebuild-logs/

for reference. See in particular

for automated analysis of the build logs that I've used as the basis for the stats below.

Executive summary

As far as I can see we're basically fine to use arm64 hosts for building armel and armhf, so long as those hosts include hardware support for the 32-bit A32 instruction set. As I've mentioned before, that's not a given on all arm64 machines, but there are sufficient machine types available that I think we should be fine. There are a couple of things we need to do in terms of setup - see Machine configuration below.

Methodology

I (naively) just attempted to rebuild all the source packages in unstable main, at first using pbuilder to control the build process and then later using sbuild instead. I didn't think to check on the stated architectures listed for the source packages, which was a mistake - I would do it differently if redoing this test. That will have contributed quite a large number of failures in the stats below, but I believe I have accounted for them in my analysis.

I built lots of packages, using a range of machines in a small build farm at home:

Macchiatobin
Seattle
Synquacer
Multiple Mustangs

using my local mirror for improved performance when fetching build-deps etc. I started off with a fixed list of packages that were in unstable when I started each rebuild, for the sake of simplicity. That's one reason why I have two different numbers of source packages attempted for each arch below. If packages failed due to no longer being available, I simply re-queued using the latest version in unstable at that point.

I then developed a script to scan the logs of failed builds to pick up on patterns that matched with obvious causes. Once that was done, I worked through all the failures to (a) verify those patterns, and (b) identify any other failures. I've classified many of the failures to make sense of the results. I've also scanned the Debian BTS for existing bugs matching my failed builds (and linked to them), or filed new bugs where I could not find matches.

I did not investigate fully every build failure. For example, where a package has never been built before on armel or armhf and failed here I simply noted that fact. Many of those are probably real bugs, but beyond the scope of my testing.

For reference, all my scripts and config are in git at

https://git.einval.com/cgi-bin/gitweb.cgi?p=buildd-scripts.git

armel results

Total source packages attempted	28457
Successfully built	25827
Failed	2630

Almost half of the failed builds were simply due to the lack of a single desired build dependency (nodejs:armel, 1289). There were a smattering of other notable causes:

100 log(s) showing build failures (java/javadoc)
Java build failures seem particularly opaque (to me!), and in many cases I couldn't ascertain if it was a real build problem or just maven being flaky. :-(
15 log(s) showing Go 32-bit integer overflow
Quite a number of go packages are blindly assuming sizes for 64-bit hosts. That's probably fair, but seems unfortunate.
8 log(s) showing Sbuild build timeout
I was using quite a generous timeout (12h) with sbuild, but still a very small number of packages failed. I'd earlier abandoned pbuilder for sbuild as I could not get it to behave sensibly with timeouts.

The stats that matter are the arch-specific failures for armel:

13 log(s) showing Alignment problem
5 log(s) showing Segmentation fault
1 log showing Illegal instruction

and the new bugs I filed:

3 bugs for arch misdetection
8 bugs for alignment problems
4 bugs for arch-specific test failures
3 bugs for arch-specific misc failures

Considering the number of package builds here, I think these numbers are basically "lost in the noise". I have found so few issues that we should just go ahead. The vast majority of the failures I found were either already known in the BTS (260), unrelated to what I was looking for, or both.

See below for more details about build host configuration for armel builds.

armhf results

Total source packages attempted	28056
Successfully built	26772
Failed	1284

FTAOD: I attempted fewer package builds for armhf as we simply had a smaller number of packages when I started that rebuild. A few weeks later, it seems we had a few hundred more source packages for the armel rebuild.

The armhf rebuild showed broadly the same percentage of failures, if you take into account the nodejs difference - it exists in the armhf archive, so many hundreds more packages could build using it.

In a similar vein for notable failures:

89 log(s) showing build failures (java/javadoc)
Similar problems, I guess...
15 log(s) showing Go 32-bit integer overflow
That's the same as for armel, I'm assuming (without checking!) that they're the same packages.
4 log(s) showing Sbuild build timeout
Only 4 timeouts compared to the 8 for armel. Maybe a sign that armhf will be slightly quicker in build time, so less likely to hit a timeout? Total guesswork on small-number stats! :-)

Arch-specific failures found for armhf:

11 log(s) showing Alignment problem
4 log(s) showing Segmentation fault
1 log(s) showing Illegal instruction

and the new bugs I filed:

1 bugs for arch misdetection
8 bugs for alignment problems
10 bugs for arch-specific test failures
3 bugs for arch-specific misc failures

Again, these small numbers tell me that we're fine. I liked to 139 existing bugs in the BTS here.

Machine configuration

To be able to support 32-bit builds on arm64 hardware, there are a few specific hardware support issues to consider.

Alignment

Our 32-bit Arm kernels are configured to fix up userspace alignment faults, which hides lazy programming at the cost of a (sometimes massive) slowdown in performance when this fixup is triggered. The arm64 kernel cannot be configured to do this - if a userspace program triggers an alignment exception, it will simply be handed a SIGBUS by the kernel. This was one of the main things I was looking for in my rebuild, common to both armel and armhf. In the end, I only found a very small number of problems.

Given that, I think we should immediately turn off the alignment fixups on our existing 32-bit Arm buildd machines. Let's flush out any more problems early, and I don't expect to see many.

To give credit here: Ubuntu have been using arm64 machines for building 32-bit Arm packages for a while now, and have already been filing bugs with patches which will have helped reduce this problem. Thanks!

Deprecated / retired instructions

In theory(!), alignment is all we should need to worry about for armhf builds, but our armel software baseline needs two additional pieces of configuration to make things work, enabling emulation for

SWP (low-level locking primitive, deprecated since ARMv6 AFAIK)
CP15 barriers (low-level barrier primitives, deprecated since ARMv7)

Again, there is quite a performance cost to enabling emulation support for these instructions but it is at least possible!

In my initial testing for rebuilding armhf only, I did not enable either of these emulations. I was then finding lots of "Illegal Instruction" crashes due to CP15 barrier usage in armhf Haskell and Mono programs. This suggests that maybe(?) the baseline architecture in these toolchains is incorrectly set to target ARMv6 rather than ARMv7. That should be fixed and all those packages rebuilt at some point.

UPDATES

Peter Green pointed out that ghc in Debian armhf is definitely configured for ARMv7, so maybe there is a deeper problem.
Edmund Grimley Evans suggests that the Haskell problem is coming from how it drives LLVM, linking to #864847 that he filed in 2017.

Bug highlights

There are a few things I found that I'd like to highlight:

In the glibc build, we found an arm64 kernel bug (#904385) which has since been fixed upstream thanks to Will Deacon at Arm. I've backported the fix for the 4.9-stable kernel branch, so the fix will be in our Stretch kernels soon.
There's something really weird happening with Vim (#917859). It FTBFS for me with an odd test failure for both armel-on-arm64 and armhf-on-arm64 using sbuild, but in a porter box chroot or directly on my hardware using debuild it works just fine. Confusing!
I've filed quite a number of bugs over the last few weeks. Many are generic new FTBFS reports for old packages that haven't been rebuilt in a while, and some of them look un-maintained. However, quite a few of my bugs are arch-specific ones in better-maintained packages and several have already had responses from maintainers or have already been fixed. Yay!
Yesterday, I filed a slew of identical-looking reports for packages using MPI and all failing tests. It seems that we have a real problem hitting openmpi-based packages across the archive at the moment (#918157 in libpmix2). I'm going to verify that on my systems shortly.

Other things to think about

Building in VMs

So far in Debian, we've tended to run our build machines using chroots on raw hardware. We have a few builders (x86, arm64) configured as VMs on larger hosts, but as far as I can see that's the exception so far. I know that OpenSUSE and Fedora are both building using VMs, and for our Arm ports now we have more powerful arm64 hosts available it's probably the way we should go here.

In testing using "linux32" chroots on native hardware, I was explicitly looking to find problems in native architecture support. In the case of alignment problems, they could be readily "fixed up / hidden" (delete as appropriate!) by building using 32-bit guest kernels with fixups enabled. If I'd found lots of those, that would be a safer way to proceed than instantly filing lots of release-critical FTBFS bugs. However, given the small number of problems found I'm not convinced it's worth worrying about.

Utilisation of hardware

Another related issue is in how we choose to slice up build machines. Many packages will build very well in parallel, and that's great if you have something like the Synquacer with many small/slow cores. However, not all our packages work so well and I found that many are still resolutely chugging through long build/test processes in single threads. I experimented a little with my config during the rebuilds and what seemed to work best for throughput was kicking off one build per 4 cores on the machines I was using. That seems to match up with what the Fedora folks are doing (thanks to hrw for the link!).

Migrating build hardware

As I mentioned earlier, to build armel and armhf sanely on arm64 hardware, we need to be using arm64 machines that include native support for the 32-bit A32 instruction set. While we have lots of those around at the moment, some newer/bigger arm64 server platforms that I've seen announced do not include it. (See an older mail from me for more details. We'll need to be careful about this going forwards and keep using (at least) some machines with A32. Maybe we'll migrate arm64-only builds onto newer/bigger A64-only machines and keep the older machines for armel/armhf if that becomes a problem?

At least for the foreseeable future, I'm not worried about losing A32 support. Arm keeps on designing and licensing ARMv8 cores that include it...

Thanks

I've spent a lot of time looking at existing FTBFS bugs over the last weeks, to compare results against what I've been seeing in my build logs. Much kudos to people who have been finding and filing those bugs ahead of me, in particular Adrian Bunk and Matthias Klose who have filed many such bugs. Also thanks to Helmut Grohne for his script to pull down a summary of FTBFS bugs from UDD - that saved many hours of effort!

Finally...

Please let me know if you think you've found a problem in what I've done, or how I've analysed the results here. I still have my machines set up for easy rebuilds, so reproducing things and testing fixes is quite easy - just ask!

13:57 :: # :: /debian/arm :: 1 comment

Comments

Re: Rebuilding the entire Debian archive twice on arm64 hardware for fun and profit
Roger Leigh wrote on Thu, 10 Jan 2019 14:43

Regarding use of virtual machines, it should be quite possible to add such a backend to sbuild.

sbuild currently supports two methods of building: sudo and schroot. It should be quite straightforward to add additional methods, e.g. docker or any other container or virtualisation technology. Nowadays, tools like docker have long surpassed the most of the capabilities and features of schroot, and they would most likely be better choices for the longer term.

I would have liked to have evolved schroot into such a tool, but it's clear that the mindshare and market share has long been taken over by other technologies, and I simply didn't have the time or funding to take it to that level, though it would have been technically within our reach to do so.

Regards, Roger

Your Comment