[Unbound-users] segfault when using dnstap at high load

Discussion:

Rogerio Bastos

2015-04-06 21:36:22 UTC

Hi,

I'm trying to test unbound witk dnstap. It works fine with low load, but
exists with segfault at high load. The segfault only happens when dnstap
is enabled in configuration.

I am using the debian package (version 1.5.3) avaible in [1] and
recompiled with dnstap enabled.
I'm following instruction descripted in [2] and using fstrm version
0.2.0.

To test the server, I'm using dnsblast [3] with the follow command:

./dnsblast <server address> 50000 500

[1] https://packages.debian.org/experimental/unbound
[2] http://dnstap.info/Source/
[3] https://github.com/jedisct1/dnsblast

--
My email was sent by May First/People Link
https://mayfirst.org

Robert Edmonds

2015-04-07 01:44:40 UTC

Permalink

Post by Rogerio Bastos
I'm trying to test unbound witk dnstap. It works fine with low load, but
exists with segfault at high load. The segfault only happens when dnstap is
enabled in configuration.
I am using the debian package (version 1.5.3) avaible in [1] and recompiled
with dnstap enabled.
I'm following instruction descripted in [2] and using fstrm version 0.2.0.
./dnsblast <server address> 50000 500

Hi, Rogerio:

Sorry to hear that. I would be happy to help debug dnstap (I wrote the
dnstap patchset for Unbound). Can I get some information about your
environment?

Can you show the "dnstap:" block of settings from your config, and the
"num-threads" server setting?

Does fstrm's "make check" test suite succeed?

What version of protobuf-c are you using? (Did you compile from source,
or did you use a packaged version?)

What OS version are you using? (Based on your mention of the Debian
package from experimental, I would guess Debian or Ubuntu.)

Are you using a uniprocessor or SMP machine? Also, since there are some
architecture-specific parts in fstrm, what architecture are you using?

Thanks!

--
Robert Edmonds
***@debian.org

Rogerio Bastos

2015-04-07 11:46:51 UTC

Permalink

Post by Robert Edmonds

Sorry to hear that. I would be happy to help debug dnstap (I wrote the
dnstap patchset for Unbound). Can I get some information about your
environment?
Can you show the "dnstap:" block of settings from your config, and the
"num-threads" server setting?

I'm using optimisation settings based on [1] (the Debian version is
compiled with libevent):

server:
num-threads: 2

msg-cache-slabs: 2
rrset-cache-slabs: 2
infra-cache-slabs: 2
key-cache-slabs: 2

rrset-cache-size: 100m
msg-cache-size: 50m

outgoing-range: 8192
num-queries-per-thread: 4096

so-rcvbuf: 4m
so-sndbuf: 4m

I'm using the example from dnstap's site [2]:

dnstap:
dnstap-enable: yes
dnstap-socket-path: "/var/run/unbound/dnstap.sock"
dnstap-send-identity: yes
dnstap-send-version: yes
dnstap-log-resolver-response-messages: yes
dnstap-log-client-query-messages: yes

Post by Robert Edmonds
Does fstrm's "make check" test suite succeed?

Yes, all tests is ok.

Post by Robert Edmonds
What version of protobuf-c are you using? (Did you compile from source,
or did you use a packaged version?)

The packaged version from Debian Jessie (version 1.0.2).

Post by Robert Edmonds
What OS version are you using? (Based on your mention of the Debian
package from experimental, I would guess Debian or Ubuntu.)

Debian Jessie, the next-stable version.

Post by Robert Edmonds
Are you using a uniprocessor or SMP machine? Also, since there are some
architecture-specific parts in fstrm, what architecture are you using?

I'm using a amd64 virtual machine with a two core CPU.

[1] https://www.unbound.net/documentation/howto_optimise.html
[2] http://dnstap.info/Examples/

--
My email was sent by May First/People Link
https://mayfirst.org

Robert Edmonds

2015-04-08 03:43:32 UTC

Permalink

Hi, Rogerio:

Thanks for these details, I can easily spin up a dual core amd64 VM
running Debian jessie soon and try to replicate the problem.

Do you get a segfault immediately, or does it only occur after running
for some time under load?

Can you try testing with "num-threads: 1"? (This will still result in
multiple threads running in the Unbound process, but the dnstap I/O
thread will only be consuming data from a single worker thread.)

Also, can you compile your unbound package with debugging symbols and
obtain a backtrace from a crash? You should be able to build a
debugging enabled package with:

DEB_BUILD_OPTIONS='nostrip debug' dpkg-buildpackage -b -uc -us

Then, run "gdb --args unbound -d" until it crashes, and at the gdb
prompt run:

thread apply all bt full

Thanks!

Post by Robert Edmonds

I'm using optimisation settings based on [1] (the Debian version is compiled
num-threads: 2
msg-cache-slabs: 2
rrset-cache-slabs: 2
infra-cache-slabs: 2
key-cache-slabs: 2
rrset-cache-size: 100m
msg-cache-size: 50m
outgoing-range: 8192
num-queries-per-thread: 4096
so-rcvbuf: 4m
so-sndbuf: 4m
dnstap-enable: yes
dnstap-socket-path: "/var/run/unbound/dnstap.sock"
dnstap-send-identity: yes
dnstap-send-version: yes
dnstap-log-resolver-response-messages: yes
dnstap-log-client-query-messages: yes

Post by Robert Edmonds
Does fstrm's "make check" test suite succeed?

Yes, all tests is ok.

Post by Robert Edmonds
What version of protobuf-c are you using? (Did you compile from source,
or did you use a packaged version?)

The packaged version from Debian Jessie (version 1.0.2).

Post by Robert Edmonds
What OS version are you using? (Based on your mention of the Debian
package from experimental, I would guess Debian or Ubuntu.)

Debian Jessie, the next-stable version.

Post by Robert Edmonds
Are you using a uniprocessor or SMP machine? Also, since there are some
architecture-specific parts in fstrm, what architecture are you using?

I'm using a amd64 virtual machine with a two core CPU.
[1] https://www.unbound.net/documentation/howto_optimise.html
[2] http://dnstap.info/Examples/
--
My email was sent by May First/People Link
https://mayfirst.org
_______________________________________________
Unbound-users mailing list
http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users

--
Robert Edmonds
***@debian.org

Rogerio Bastos

2015-04-08 12:23:42 UTC

Permalink

Hi,

I appreciate you help.

Post by Robert Edmonds
Thanks for these details, I can easily spin up a dual core amd64 VM
running Debian jessie soon and try to replicate the problem.
Do you get a segfault immediately, or does it only occur after running
for some time under load?

segfault occur after some time.

Post by Robert Edmonds
Can you try testing with "num-threads: 1"? (This will still result in
multiple threads running in the Unbound process, but the dnstap I/O
thread will only be consuming data from a single worker thread.)

I get the same error with "num-threads: 1".

Post by Robert Edmonds
Also, can you compile your unbound package with debugging symbols and
obtain a backtrace from a crash? You should be able to build a
DEB_BUILD_OPTIONS='nostrip debug' dpkg-buildpackage -b -uc -us
Then, run "gdb --args unbound -d" until it crashes, and at the gdb
thread apply all bt full

The output is attached.

Post by Robert Edmonds
Thanks!

Post by Robert Edmonds

Post by Robert Edmonds
Does fstrm's "make check" test suite succeed?

Yes, all tests is ok.

Post by Robert Edmonds
What version of protobuf-c are you using? (Did you compile from source,
or did you use a packaged version?)

The packaged version from Debian Jessie (version 1.0.2).

Post by Robert Edmonds
What OS version are you using? (Based on your mention of the Debian
package from experimental, I would guess Debian or Ubuntu.)

Debian Jessie, the next-stable version.

Post by Robert Edmonds
Are you using a uniprocessor or SMP machine? Also, since there are some
architecture-specific parts in fstrm, what architecture are you using?

I'm using a amd64 virtual machine with a two core CPU.
[1] https://www.unbound.net/documentation/howto_optimise.html
[2] http://dnstap.info/Examples/

--
My email was sent by May First/People Link
https://mayfirst.org

Robert Edmonds

2015-04-08 18:26:06 UTC

Permalink

Hi, Rogerio:

Based on the stack trace, it looks like the crash is occurring when a
TCP query times out, and we try to log it as if it were a normal TCP
response. Could you try the attached patch and see if it avoids the
crash?

Thanks!

#0 serviced_tcp_callback (c=0x0, arg=0x555556b853b0, error=-2, rep=0x0) at services/outside_network.c:1596
r2 = {c = 0x3000000010, addr = {ss_family = 59744, __ss_align = 140737488349344,
__ss_padding = "\000T\217\\\337`*\333\004\000\000\000}\000\000\000\000\000\000\000\377\177\000\000}", '\000' <repeats 15 times>, "\270\333vVUU\000\000\200pv\367\377\177\000\000\240\333vVUU", '\000' <repeats 14 times>, "\001\000\000\000\252\\T\367\377\177\000\000\000\000\000\000\004\000\000\002\000T\217\\\337`*\333\000\000\000\000\001\000\000"},
addrlen = 4149479993, srctype = 32767, pktinfo = {v6info = {ipi6_addr = {__in6_u = {
__u6_addr8 = "\000\000\000\000\000\000\000\000\060\300]VUU\000", __u6_addr16 = {0, 0, 0, 0, 49200, 22109,
21845, 0}, __u6_addr32 = {0, 0, 1448984624, 21845}}}, ipi6_ifindex = 1432264672}, v4info = {
ipi_ifindex = 0, ipi_spec_dst = {s_addr = 0}, ipi_addr = {s_addr = 1448984624}}}}
#1 0x00005555555e815a in outnet_tcptimer (arg=0x5555566d9630) at services/outside_network.c:1120
w = 0x5555566d9630
outnet = 0x555555aa20e0
cb = 0x5555555e9fe0 <serviced_tcp_callback>
cb_arg = 0x555556b853b0
__func__ = "outnet_tcptimer"

--
Robert Edmonds
***@debian.org

Rogerio Bastos

2015-04-08 19:32:07 UTC

Permalink

Post by Robert Edmonds
Based on the stack trace, it looks like the crash is occurring when a
TCP query times out, and we try to log it as if it were a normal TCP
response. Could you try the attached patch and see if it avoids the
crash?

Yes, this patch fix the crash.

Thanks!

Post by Robert Edmonds

#0 serviced_tcp_callback (c=0x0, arg=0x555556b853b0, error=-2,
rep=0x0) at services/outside_network.c:1596
r2 = {c = 0x3000000010, addr = {ss_family = 59744, __ss_align = 140737488349344,
__ss_padding =
"\000T\217\\\337`*\333\004\000\000\000}\000\000\000\000\000\000\000\377\177\000\000}",
'\000' <repeats 15 times>,
"\270\333vVUU\000\000\200pv\367\377\177\000\000\240\333vVUU", '\000'
<repeats 14 times>,
"\001\000\000\000\252\\T\367\377\177\000\000\000\000\000\000\004\000\000\002\000T\217\\\337`*\333\000\000\000\000\001\000\000"},
addrlen = 4149479993, srctype = 32767, pktinfo = {v6info =
{ipi6_addr = {__in6_u = {
__u6_addr8 =
"\000\000\000\000\000\000\000\000\060\300]VUU\000", __u6_addr16 = {0,
0, 0, 0, 49200, 22109,
21845, 0}, __u6_addr32 = {0, 0, 1448984624,
21845}}}, ipi6_ifindex = 1432264672}, v4info = {
ipi_ifindex = 0, ipi_spec_dst = {s_addr = 0}, ipi_addr =
{s_addr = 1448984624}}}}
#1 0x00005555555e815a in outnet_tcptimer (arg=0x5555566d9630) at
services/outside_network.c:1120
w = 0x5555566d9630
outnet = 0x555555aa20e0
cb = 0x5555555e9fe0 <serviced_tcp_callback>
cb_arg = 0x555556b853b0
__func__ = "outnet_tcptimer"

--
My email was sent by May First/People Link
https://mayfirst.org

Robert Edmonds

2015-04-10 20:37:28 UTC

Permalink

Post by Rogerio Bastos

Yes, this patch fix the crash.

OK, great. I'll look and see if there are any more error cases that
need to be excluded and submit a proper patch to the Unbound bug
tracker.

Thanks for testing!

--
Robert Edmonds
***@debian.org