Volk: Am I getting all that's available for ARM?

Volk only gives me ‘generic’.

Is there some component available that I can to add to my system to get
more out of it?

Neon is detected, so I think that is why I see the ‘generic_orc’ machine
listed.

The gory details:

gnuradio v3.7.0

Ubuntu 13.04 LTS (ARM distribution, ubuntu-core-13.04-core-armhf.tar.gz)

Part of my cmake config :
cmake -DCMAKE_C_FLAGS:STRING="-I/usr/include/arm-linux-gnueabihf
-mcpu=cortex-a15
-mfpu=neon -mvectorize-with-neon-quad -ffast-math
-funsafe-loop-optimizations" \

Parts of the cmake output:

– The CXX compiler identification is GNU 4.7.3
– The C compiler identification is GNU 4.7.3

– Configuring volk support…
– Enabling volk support.
– Override with -DENABLE_VOLK=ON/OFF
– Found PythonInterp: /usr/bin/python2 (found suitable version “2.7.4”)

– Python checking for python >= 2.5
– Python checking for python >= 2.5 - found

– Python checking for Cheetah >= 2.0.0
– Python checking for Cheetah >= 2.0.0 - found
– Boost version: 1.53.0
– Found the following Boost libraries:
– filesystem
– system
– unit_test_framework
– checking for module ‘orc-0.4 > 0.4.11’
– found orc-0.4 > 0.4.11, version 0.4.17
– Found ORC: /usr/lib/arm-linux-gnueabihf/liborc-0.4.so
– Looking for cpuid.h
– Looking for cpuid.h - not found
– Looking for intrin.h
– Looking for intrin.h - not found
– Looking for fenv.h
– Looking for fenv.h - found
– Looking for dlfcn.h
– Looking for dlfcn.h - found
– Compiler name: GNU
– Performing Test HAVE_WERROR_UNUSED_CMD_LINE_ARG
– Performing Test HAVE_WERROR_UNUSED_CMD_LINE_ARG - Failed
– Performing Test have_maltivec
– Performing Test have_maltivec - Failed
– Performing Test have_mfpu_neon
– Performing Test have_mfpu_neon - Success
– Performing Test have_mfloat_abi_softfp
– Performing Test have_mfloat_abi_softfp - Failed
– Performing Test have_funsafe_math_optimizations
– Performing Test have_funsafe_math_optimizations - Success
– Performing Test have_m32
– Performing Test have_m32 - Failed
– Performing Test have_m64
– Performing Test have_m64 - Failed
– Performing Test have_m3dnow
– Performing Test have_m3dnow - Failed
– Performing Test have_msse4_2
– Performing Test have_msse4_2 - Failed
– Performing Test have_mpopcnt
– Performing Test have_mpopcnt - Failed
– Performing Test have_mmmx
– Performing Test have_mmmx - Failed
– Performing Test have_msse
– Performing Test have_msse - Failed
– Performing Test have_msse2
– Performing Test have_msse2 - Failed
– Performing Test have_msse3
– Performing Test have_msse3 - Failed
– Performing Test have_mssse3
– Performing Test have_mssse3 - Failed
– Performing Test have_msse4a
– Performing Test have_msse4a - Failed
– Performing Test have_msse4_1
– Performing Test have_msse4_1 - Failed
– Performing Test have_mavx
– Performing Test have_mavx - Failed
– Available arch: generic;orc;norc
– Available machines: generic_orc

My /proc/cpuinfo :
Processor : ARMv7 Processor rev 2 (v7l)
processor : 0
BogoMIPS : 13.53
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
CPU implementer : 0x51
CPU architecture: 7
CPU variant : 0x1
CPU part : 0x04d
CPU revision : 2

Thanks in advance,
Tim

Hi Tim,
as far as I can tell, there’s not much you can do to improve your
arithmetic performance
“out of the box”.
However, you already seem to apply some compiler flags; however, as it
seems, detection of
Neon fails. That points to volk/tmpl/volk_cpu.tmpl.c: has_neon() not
being able to detect neon
on your machine; this may have a lot of reasons, from wrong file
permissions on /proc/self/auxv
to the kernel actually displaying neon support without exposing that via
aforementioned file.
Ideas on that:

  • has your kernel’s build config maybe not enabled neon?
  • does executing
    hexdump /proc/self/auxv|sed ‘0000 0010’
    show anything? [1] If not, could you please report the complete output
    of hexdump /proc/self/auxv
  • if you don’t specify any compiler flags when calling cmake in an
    (unused) build folder,
    which compiler flags does cmake set? ($>grep C_FLAGS CMakeCache.txt)

Happy hacking
Marcus M.

[1] note: in kernel terms of auxvec.h this stands for AT_HWCAP, see
http://lxr.free-electrons.com/source/include/uapi/linux/auxvec.h?a=arm#L24
How Volk handles it looks like AT_HWCAP can only start on a
4-byte-aligned address within auxv;
I’m absolutely not sure whether this is correct; on my x86_64 machine,
there is no AT_HWCAP in that case.

On 08/07/2013 02:44 AM, Marcus M. wrote:

[1] note: in kernel terms of auxvec.h this stands for AT_HWCAP, see
http://lxr.free-electrons.com/source/include/uapi/linux/auxvec.h?a=arm#L24
How Volk handles it looks like AT_HWCAP can only start on a 4-byte-aligned
address within auxv;
I’m absolutely not sure whether this is correct; on my x86_64 machine, there is
no AT_HWCAP in that case.
decided to have a look into this instead of sleeping; man proc(5) and
volk are right: AT_HWCAP can only occur every 64 bit in /proc/PID/auxv.
Sorry for the confusion.

On Tue, Aug 6, 2013 at 7:40 PM, Monahan-Mitchell, Tim
[email protected] wrote:

Volk only gives me ‘generic’.

Is there some component available that I can to add to my system to get more out
of it?

Neon is detected, so I think that is why I see the ‘generic_orc’ machine listed.

Tim,

Right now, the only VOLK proto-kernels optimized for ARM are using
Orc. So yes, right now, that’s the only possible optimization you’re
going to get. The entire point of VOLK is, of course, to allow us to
extend the kernel support for different devices. So we need people to
develop proto-kernels for NEON to extend this support.

See more about the VOLK concept here:


Tom
Visit us at GRCon13 Oct. 1 - 4
http://www.trondeau.com/grcon13

Hi, Marcus.

Thanks for your help. Tom seems to have given a pretty definitive reply,
but this is just to completely answer your reply.

On 08/07/2013 02:44 AM, Marcus M. wrote:

[1] note: in kernel terms of auxvec.h this stands for AT_HWCAP, see
http://lxr.free-electrons.com/source/include/uapi/linux/auxvec.h?a=arm#L24
How Volk handles it looks like AT_HWCAP can only start on a 4-byte-aligned
address within auxv;
I’m absolutely not sure whether this is correct; on my x86_64 machine, there is
no AT_HWCAP in that case.
decided to have a look into this instead of sleeping; man proc(5) and volk are
right: AT_HWCAP can only occur every 64 bit in /proc/PID/auxv. Sorry for the
confusion.

Your earlier reply said:

“… however, as it seems, detection of Neon fails.”

But the cmake output I included said:

– Performing Test have_mfpu_neon - Success
– Performing Test have_funsafe_math_optimizations - Success

So I think the neon detection works?

You also asked: “does executing
hexdump /proc/self/auxv|sed ‘0000 0010’
show anything?”

Yes, an error :slight_smile: … Maybe you meant grep? The first line of auxv
matches to your string:

$ hexdump /proc/self/auxv
0000000 0010 0000 b0d7 0001 0006 0000 1000 0000
0000010 0011 0000 0064 0000 0003 0000 8034 0000
0000020 0004 0000 0020 0000 0005 0000 0009 0000
0000030 0007 0000 c000 4002 0008 0000 0000 0000
0000040 0009 0000 8f35 0000 000b 0000 0000 0000
0000050 000c 0000 0000 0000 000d 0000 0000 0000
0000060 000e 0000 0000 0000 0017 0000 0000 0000
0000070 0019 0000 dc07 bed4 001f 0000 dfeb bed4
0000080 000f 0000 dc17 bed4 0000 0000 0000 0000
0000090

And if I run the basic “cmake …/” command, I still get the same output
for Volk.

You asked for the CFLAGS in CMakeCache.txt in that s case, here you go:

// CMAKE_C_FLAGS used) Debug Release RelWithDebInfo MinSizeRel.
CMAKE_C_FLAGS:STRING=
CMAKE_C_FLAGS_DEBUG:STRING=-g
CMAKE_C_FLAGS_MINSIZEREL:STRING=-Os -DNDEBUG
CMAKE_C_FLAGS_RELEASE:STRING=-O3 -DNDEBUG
CMAKE_C_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG
//ADVANCED property for variable: CMAKE_C_FLAGS
CMAKE_C_FLAGS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_DEBUG
CMAKE_C_FLAGS_DEBUG-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_MINSIZEREL
CMAKE_C_FLAGS_MINSIZEREL-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_RELEASE
CMAKE_C_FLAGS_RELEASE-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_RELWITHDEBINFO
CMAKE_C_FLAGS_RELWITHDEBINFO-ADVANCED:INTERNAL=1

Just to compare, running my original cmake command yields the CFLAGS I
manually specified:

grep C_FLAGS CMakeCache.txt
// CMAKE_C_FLAGS used) Debug Release RelWithDebInfo MinSizeRel.
CMAKE_C_FLAGS:STRING=-I/usr/include/arm-linux-gnueabihf -mcpu=cortex-a15
-mfpu=neon -mvectorize-with-neon-quad -ffast-math
-funsafe-loop-optimizations
CMAKE_C_FLAGS_DEBUG:STRING=-g
CMAKE_C_FLAGS_MINSIZEREL:STRING=-Os -DNDEBUG
CMAKE_C_FLAGS_RELEASE:STRING=-O3 -DNDEBUG
CMAKE_C_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG
//ADVANCED property for variable: CMAKE_C_FLAGS
CMAKE_C_FLAGS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_DEBUG
CMAKE_C_FLAGS_DEBUG-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_MINSIZEREL
CMAKE_C_FLAGS_MINSIZEREL-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_RELEASE
CMAKE_C_FLAGS_RELEASE-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_C_FLAGS_RELWITHDEBINFO
CMAKE_C_FLAGS_RELWITHDEBINFO-ADVANCED:INTERNAL=1

Thanks again and all the best,
Tim

I was just a little confused, because I read the volk/gen/archs.xml
template, and figured if
has_neon() returned true, -mfloat_abi=soft and -nfpu=neon would have
been added to the C_FLAGS…
and cmake’s output contained “Performing Test have_mfloat_abi_softfp -
Failed”, so I guessed that (since I was quite sure that your machine
supports softfp) it was disabled by lack of compiler flags…

Thanks for your comprehensive answer and for finding my mistakes (the
sed slipped in there because I was stripping the leading offsets and
whitespaces…)!

Greetings,
Marcus

Monahan-Mitchell, Tim:

On 08/07/2013 09:48 PM, Monahan-Mitchell, Tim wrote:

Because of the ‘abi_softfp’ test failing on my x86, I decided I did not need to
re-build the ARM tool chain to support soft ABI to try and help Volk. Is that
still correct? (I have been able to build and run gnuradio without the soft flag
just fine).
I don’t think -mfloat_abi=softfp (or even =soft) applies to x86; I think
I remember the equivalent flag being something along the ways of
-msoft-float… Or am I misunderstanding you and you talk about
cross-compiling on x86 for arm?

However, whenever you use a float library function (that is, with
abi_softfp whenever you do something with a float), the library needs to
“understand” your kind of float, so, yes, if you want to use softfp in
GNU Radio, I think you’d need to rebuild your ARM toolchain to support
that. I don’t know if that improves performance at all…

But since I’m right now reading my GCC manual[1], I stumbled across one
stance:

[1]
“”“”
http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html:
-mfpu=name
This specifies what floating-point hardware (or hardware emulation) is
available on the target. Permissible names are: vfp, vfpv3, vfpv3-fp16,
vfpv3-d16, vfpv3-d16-fp16, vfpv3xd, vfpv3xd-fp16, neon, neon-fp16,
vfpv4, vfpv4-d16, fpv4-sp-d16, neon-vfpv4, fp-armv8, neon-fp-armv8, and
crypto-neon-fp-armv8.
If -msoft-float is specified this specifies the format of floating-point
values.

If the selected floating-point hardware includes the NEON extension
(e.g. -mfpu=neon), note that floating-point operations are not generated
by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is
also specified. This is because NEON hardware does not fully implement
the IEEE 754 standard for floating-point arithmetic (in particular
denormal values are treated as zero), so the use of NEON instructions
may lead to a loss of precision.
“”"

So it might make sense to include -funsafe-math-optimizations, if
vectorize-with-neon-quad does not do that implicitly.

Hi, Marcus

On 08/07/2013 09:48 PM, Monahan-Mitchell, Tim wrote:

Because of the ‘abi_softfp’ test failing on my x86, I decided I did not need to
re-build the ARM tool chain to support soft ABI to try and help Volk. Is that
still correct? (I have been able to build and run gnuradio without the soft flag
just fine).
I don’t think -mfloat_abi=softfp (or even =soft) applies to x86; I think I
remember the equivalent flag being something along the ways of -msoft-float… Or
am I misunderstanding you and you talk about cross-compiling on x86 for arm?

In my volk travels to answer this email thread’s question, I was
comparing what cmake did on my x86 (where there is more volk support) to
what cmake does on my ARM target. I noted that both had “-- Performing
Test have_mfloat_abi_softfp - Failed”, hence my conclusion that softfp
was not a requirement for volk.

If the selected floating-point hardware includes the NEON extension (e.g.
-mfpu=‘neon’), note that floating-point operations are not generated by GCC’s
auto-vectorization pass unless -funsafe-math-optimizations is also specified. This
is because NEON hardware does not fully implement the IEEE 754 standard for
floating-point arithmetic (in particular denormal values are treated as zero), so
the use of NEON instructions may lead to a loss of precision.

So it might make sense to include -funsafe-math-optimizations, if
vectorize-with-neon-quad does not do that implicitly.

Yes, I have -funsafe-math-optimizations and the cmake test for it
passes: " Performing Test have_funsafe_math_optimizations - Success"

Thanks,
Tim

Some quick notes:

  1. the mfloat_abi option controls which ARM abi is used for function
    calls. hard lets the compiler return floats in “NEON” registers. soft
    just means they are returned as addresses or something. The setting must
    match the one used to build the root file system you are running.

  2. Any feature detection that probes settings on the build machine is
    broken. It will lead to failures for people cross compiling. I basically
    try to force all instruction set settings for gnuradio for just this
    reason.

Philip

Hi, Marcus,

I was just a little confused, because I read the volk/gen/archs.xml template,
and figured if
has_neon() returned true, -mfloat_abi=soft and -nfpu=neon would have been added
to the C_FLAGS…
and cmake’s output contained “Performing Test have_mfloat_abi_softfp -
Failed”, so I guessed that (since I was quite sure that your machine
supports softfp) it was disabled by lack of compiler flags…

Actually, I compared my VMware x86 gnuradio cmake output to my ARM
target, and it also failed the same abi_softfp test (but of course volk
is much more exciting on the x86 target).

But I also fiddled with that flag for a while: if I add -mfloat_abi=soft
to my CFLAGS in the cmake command, it fails early as it can’t find
crti.o and crt1.o. But I was smart enough to find those on may target,
and set LIBRARY_PATH to help. But then I get an error telling me there
is a mismatch between the ABI of my toolchain and the ABI of the
compiler output. So the soft flag came back out.

Because of the ‘abi_softfp’ test failing on my x86, I decided I did not
need to re-build the ARM tool chain to support soft ABI to try and help
Volk. Is that still correct? (I have been able to build and run gnuradio
without the soft flag just fine).

Thanks,
Tim

Just for the record.

If the selected floating-point hardware includes the NEON extension (e.g.
-mfpu=‘neon’), note that floating-point operations are not generated by GCC’s
auto-vectorization pass unless -funsafe-math-optimizations is also specified. This
is because NEON hardware does not fully implement the IEEE 754 standard for
floating-point arithmetic (in particular denormal values are treated as zero), so
the use of NEON instructions may lead to a loss of precision.

Regarding loss of precision: my target is able to use ‘-mfpu=neon-vfpv4’
which selects floating point fused operations instead of chained. I
tried it, but a new test error surfaces due to accuracy (v3.7.0):

/src/gnuradio/build # ctest -V -R qa_ofdm_frame_equalizer_vcvc
UpdateCTestConfiguration from
:/src/gnuradio/build/DartConfiguration.tcl
UpdateCTestConfiguration from
:/src/gnuradio/build/DartConfiguration.tcl
Test project /src/gnuradio/build
Constructing a list of tests
Done constructing a list of tests
Checking test dependency graph…
Checking test dependency graph end
test 142
Start 142: qa_ofdm_frame_equalizer_vcvc

142: Test command: /bin/sh
“/src/gnuradio/build/gr-digital/python/digital/qa_ofdm_frame_equalizer_vcvc_test.sh”
142: Test timeout computed to be: 9.99988e+06
142: …F.
142:

142: FAIL: test_002_static (main.qa_ofdm_frame_equalizer_vcvc)
142:

142: Traceback (most recent call last):
142: File
“/src/gnuradio/gr-digital/python/digital/qa_ofdm_frame_equalizer_vcvc.py”,
line 244, in test_002_static
142: self.assertEqual(tag_dict, expected_dict)
142: AssertionError: {‘frame_len’: 4L, ‘ofdm_sync_chan_taps’: [0j, 0j,
(-2.2037331959268158e-08+1j), [truncated]… != {‘frame_len’: 4,
‘ofdm_sync_chan_taps’: [0, 0, 1j, 1j, 0, 1j, 1j, 0]}
142: + {‘frame_len’: 4, ‘ofdm_sync_chan_taps’: [0, 0, 1j, 1j, 0, 1j, 1j,
0]}
142: - {‘frame_len’: 4L,
142: - ‘ofdm_sync_chan_taps’: [0j,
142: - 0j,
142: - (-2.2037331959268158e-08+1j),
142: - (-2.2037331959268158e-08+1j),
142: - 0j,
142: - (2.2037331959268158e-08+1j),
142: - (-2.2037331959268158e-08+1j),
142: - 0j]}

So I went back to just ‘-mfpu=neon’.

I didn’t log this as a GR bug, since it feels like the “Doctor, it hurts
when I do this” variety of problems.