This has taken a while in coming, for which I apologise. There's a
lot of work involved in rebuilding the whole Debian archive,
and many days spent analysing the results. You learn
quite a lot, too! :-)
I promised way back before DebConf 18 last August that I'd publish
the results of the rebuilds that I'd just started. Here they are,
after a few false starts. I've been rebuilding the archive
specifically to check if we would have any problems building
our 32-bit Arm ports (armel and armhf) using 64-bit arm64 hardware. I
might have found other issues too, but that was my goal.
for reference. See in particular
for automated analysis of the build logs that I've used as the basis
for the stats below.
As far as I can see we're basically fine to use arm64 hosts for
building armel and armhf, so long as those hosts
include hardware support for the 32-bit A32 instruction set. As
I've mentioned
before, that's not a given on all arm64 machines,
but there are sufficient machine types available that I think we
should be fine. There are a couple of things we need to do in terms of
setup - see Machine configuration
below.
I (naively) just attempted to rebuild all the source packages in
unstable main, at first using pbuilder to control the build process
and then later using sbuild instead. I didn't think to check on the
stated architectures listed for the source packages, which was a
mistake - I would do it differently if redoing this test. That will
have contributed quite a large number of failures in the stats below,
but I believe I have accounted for them in my analysis.
I built lots of packages, using a range of machines in a small build
farm at home:
using my local mirror for improved performance when fetching
build-deps etc. I started off with a fixed list of packages that were
in unstable when I started each rebuild, for the sake of
simplicity. That's one reason why I have two different numbers of
source packages attempted for each arch below. If packages failed due
to no longer being available, I simply re-queued using the latest
version in unstable at that point.
I then developed a script to scan the logs of failed builds to pick
up on patterns that matched with obvious causes. Once that was done, I
worked through all the failures to (a) verify those patterns, and (b)
identify any other failures. I've classified many of the failures to
make sense of the results. I've also scanned the Debian BTS for
existing bugs matching my failed builds (and linked to them), or filed
new bugs where I could not find matches.
Almost half of the failed builds were simply due to the lack of a
single desired build dependency
(nodejs:armel,
1289). There were a smattering of other notable causes:
Considering the number of package builds here, I think these
numbers are basically "lost in the noise". I have found so few issues
that we should just go ahead. The vast majority of the failures I
found were either already known in the BTS (260), unrelated to what I
was looking for, or both.
The armhf rebuild showed broadly the same percentage of failures,
if you take into account the nodejs difference - it exists in the
armhf archive, so many hundreds more packages could build using
it.
Again, these small numbers tell me that we're fine. I liked to 139
existing bugs in the BTS here.
Machine configuration
To be able to support 32-bit builds on arm64 hardware, there are a
few specific hardware support issues to consider.
Alignment
Our 32-bit Arm kernels are configured to fix up userspace alignment
faults, which hides lazy programming at the cost of a (sometimes
massive) slowdown in performance when this fixup is triggered. The
arm64 kernel cannot be configured to do this - if a
userspace program triggers an alignment exception, it will simply be
handed a SIGBUS by the kernel. This was one of the main things I was
looking for in my rebuild, common to both armel and armhf. In the end,
I only found a very small number of problems.
Given that, I think we should immediately turn off
the alignment fixups on our existing 32-bit Arm buildd machines. Let's
flush out any more problems early, and I don't expect to see many.
To give credit here: Ubuntu
have been using arm64 machines for building 32-bit Arm packages for a
while now, and have already been filing bugs with patches which will
have helped reduce this problem. Thanks!
Deprecated / retired instructions
In theory(!), alignment is all we should need to worry about for
armhf builds, but our armel software baseline needs two additional
pieces of configuration to make things work, enabling emulation
for
SWP
(low-level locking primitive, deprecated since
ARMv6 AFAIK)
CP15
barriers (low-level barrier primitives,
deprecated since ARMv7)
Again, there is quite a performance cost to enabling
emulation support for these instructions but it is at least
possible!
In my initial testing for rebuilding armhf only, I did not enable
either of these emulations. I was then finding lots
of "Illegal Instruction" crashes due to CP15 barrier usage in armhf
Haskell and Mono programs. This suggests that maybe(?) the baseline
architecture in these toolchains is incorrectly set to target ARMv6
rather than ARMv7. That should be fixed and all those packages rebuilt
at some point.
UPDATES
- Peter
Green pointed out that ghc in Debian armhf is definitely
configured for ARMv7, so maybe there is a deeper problem.
- Edmund
Grimley Evans suggests that the Haskell problem is coming from
how it drives LLVM, linking
to #864847 that he
filed in 2017.
Bug highlights
There are a few things I found that I'd like to highlight:
- In the glibc build, we found an arm64 kernel bug
(#904385) which has
since been fixed upstream thanks to Will Deacon at Arm. I've
backported the fix for the 4.9-stable kernel branch, so the fix will
be in our Stretch kernels soon.
- There's something really weird happening with Vim
(#917859). It FTBFS for
me with an odd test failure for both armel-on-arm64 and
armhf-on-arm64 using sbuild, but in a porter box
chroot or directly on my hardware using debuild it works just
fine. Confusing!
- I've filed quite a number of bugs over the last few weeks. Many
are generic new FTBFS reports for old packages that haven't been
rebuilt in a while, and some of them look un-maintained. However,
quite a few of my bugs are arch-specific ones in better-maintained
packages and several have already had responses from maintainers or
have already been fixed. Yay!
- Yesterday, I filed a slew of identical-looking reports for
packages using MPI and all failing tests. It seems that we have a
real problem hitting openmpi-based packages across the archive at
the moment (#918157 in
libpmix2). I'm going to verify that on my systems shortly.
Other things to think about
Building in VMs
So far in Debian, we've tended to run our build machines using
chroots on raw hardware. We have a few builders (x86, arm64)
configured as VMs on larger hosts, but as far as I can see that's the
exception so far. I know that OpenSUSE and Fedora are
both building using VMs, and for our Arm ports now we have more
powerful arm64 hosts available it's probably the way we should go
here.
In testing using "linux32" chroots on native hardware, I was
explicitly looking to find problems in native architecture support. In
the case of alignment problems, they could be readily "fixed up /
hidden" (delete as appropriate!) by building using 32-bit guest
kernels with fixups enabled. If I'd found lots of
those, that would be a safer way to proceed than instantly filing lots
of release-critical FTBFS bugs. However, given the small number of
problems found I'm not convinced it's worth worrying about.
Utilisation of hardware
Another related issue is in how we choose to slice up build
machines. Many packages will build very well in parallel, and that's
great if you have something like the Synquacer with many small/slow
cores. However, not all our packages work so well and I found that
many are still resolutely chugging through long build/test processes
in single threads. I experimented a little with my config during the
rebuilds and what seemed to work best for throughput was kicking off
one build per 4 cores on the machines I was using. That seems to match
up with what
the Fedora
folks are doing (thanks to hrw for the link!).
Migrating build hardware
As I mentioned earlier, to build armel and armhf sanely on arm64
hardware, we need to be using arm64 machines that include native
support for the 32-bit A32 instruction set. While we have lots of
those around at the moment, some newer/bigger arm64 server platforms
that I've seen announced do not include
it. (See an
older mail from me for more details. We'll need to be careful
about this going forwards and keep using (at least) some machines with
A32. Maybe we'll migrate arm64-only builds onto newer/bigger A64-only
machines and keep the older machines for armel/armhf if that becomes a
problem?
At least for the foreseeable future, I'm not worried about losing
A32 support. Arm keeps on designing and licensing ARMv8 cores that
include it...
Thanks
I've spent a lot of time looking at existing FTBFS bugs over the
last weeks, to compare results against what I've been seeing in my
build logs. Much kudos to people who have been finding and filing
those bugs ahead of me, in particular Adrian Bunk and Matthias Klose
who have filed many such bugs. Also thanks to Helmut
Grohne for his script to pull down a summary of FTBFS bugs from UDD -
that saved many hours of effort!
Finally...
Please let me know if you think you've found a problem in what I've
done, or how I've analysed the results here. I still have my machines
set up for easy rebuilds, so reproducing things and testing fixes is
quite easy - just ask!
13:57 ::
# ::
/debian/arm ::
1 comment