How to debug system failure

Discussions about Z-Way software and Z-Wave technology in general
User avatar
PoltoS
Posts: 7594
Joined: 26 Jan 2011 19:36

Re: How to debug system failure

Post by PoltoS »

Hm.. looks weired! We need a lot to help here. "strace -fp <pid>" and 'lsof -n | grep z-way-server".

Is it reproducible? Try to disable apps and see if it stops
mrnmukkas
Posts: 10
Joined: 15 Jan 2016 18:16

Re: How to debug system failure

Post by mrnmukkas »

I guess I should do this after it breaks huh? Right now I'm running a scheduled restart of z-way-server every six hours to prevent this from happening. If I don't it usually breaks withing 12 hours. The funny thing is that it runs normally for quite a while, when I ran lsof after restarting the server and it showed only 77 files being open and kept steady at that number for at least an hour. Then I left for work and got home in the evening, and bam, >64.000 files and the server is down. So it's reproducible alright, I just need to leave it running for some time.

Anyway, I'm exporting all my sensor data to a database at 30 minute intervals, as a side effect this allows me to see when the server stops responding since I get no data during that time. Here's a summary of the crashes so far:

Code: Select all

Down 2017-01-01 19:00, Up 2017-01-02 03:00
Down 2017-01-02 14:30, Up 2017-01-02 16:00
Down 2017-01-03 12:00, Up 2017-01-03 13:30
Down 2017-01-04 09:30, Up 2017-01-04 13:30
Down 2017-01-05 11:00, Up 2017-01-05 13:30
Down 2017-01-05 21:30, Up 2017-01-06 06:30
Up is when I discover that the server is down and manually restart it. Worth noting is that I have no downtime recorded before new years, so this has apparently only been a problem during 2017.

The only thing that I've done differently lately is that I'm running the MQTT app and Node Red, but I started doing that back in december.

I'll disable the restart and get back to you with the results.
mrnmukkas
Posts: 10
Joined: 15 Jan 2016 18:16

Re: How to debug system failure

Post by mrnmukkas »

Ok, so here are roughly 2000 lines from strace and the results from lsof:
http://mukkas.se/blandat/strace-zway.txt
http://mukkas.se/blandat/lsof-zway.txt
User avatar
PoltoS
Posts: 7594
Joined: 26 Jan 2011 19:36

Re: How to debug system failure

Post by PoltoS »

I see a lot of:
[pid 19199] <... dup resumed> ) = -1 EMFILE (Too many open files)
...
[pid 19199] accept(7, 0x742febb8, [16]) = -1 EMFILE (Too many open files)

19199 (based on your lsof file) is zway/sockets thread. Means you have something working with sockets having a lot open. So it is not about files, but about sockets accepted and not closed.

Don't you have some MQTT or something else that forgot to close a socket after communication?

Can you also do "sudo netstat -nap"
mrnmukkas
Posts: 10
Joined: 15 Jan 2016 18:16

Re: How to debug system failure

Post by mrnmukkas »

Ah I see, it's quite possible it's because of the MQTT plugin, we're currently troubleshooting an issue with it over on github:
https://github.com/Edubits/Zway-MQTT/issues/16

When I started looking into the zway crashes I noticed that the MQTT plugin had a lot of errors in the zway log (thousands in a day). I just updated to a new version yesterday and after that the errors disappeared almost entirely. I'll keep the server running without restarting it for a few days and see if zway has stopped crashing as well.

Here's the results from netstat:
http://mukkas.se/blandat/netstat-zway.txt
mosquitto is the MQTT broker btw, also this is from when the server is running normally. I can get back with an update if it breaks again.
User avatar
PoltoS
Posts: 7594
Joined: 26 Jan 2011 19:36

Re: How to debug system failure

Post by PoltoS »

This netstat looks ok - not too many open ports. Looks you have fixed that with the MQTT changes.
BobElHat
Posts: 23
Joined: 27 Dec 2016 21:32

Re: How to debug system failure

Post by BobElHat »

I've been having occasional crashes too. Previously, they seemed to be happening around the time Minux's Hue module ran its refresh, but I rewrote that to use a single http.request and that seemed to have stopped it - but last night it happened again.

Last logs were:

Code: Select all

[2017-01-14 02:25:41.772] [D] [zway] SETDATA controller.data.incomingPacket.nodeId = 12 (0x0000000c)
[2017-01-14 02:25:41.772] [D] [zway] SETDATA controller.data.incomingPacket.frameType = "singlecast"
[2017-01-14 02:25:41.772] [D] [zway] SETDATA controller.data.incomingPacket = **********
[2017-01-14 02:25:41.772] [D] [zway] SETDATA devices.12.data.lastReceived = 0 (0x00000000)
[2017-01-14 02:25:41.772] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.3.deviceScale = 1 (0x00000001)
[2017-01-14 02:25:41.773] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.3.scale = 1 (0x00000001)
[2017-01-14 02:25:41.775] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.3.scaleString = "Lux"
[2017-01-14 02:25:41.775] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.3.val = 0.000000
[2017-01-14 02:25:41.776] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.3 = Empty
[2017-01-14 02:25:41.873] [D] [zway] RECEIVED: ( 01 0B 00 04 00 0C 05 31 05 1B 01 00 D7 )
[2017-01-14 02:25:41.873] [D] [zway] SENT ACK
[2017-01-14 02:25:41.873] [D] [zway] SETDATA controller.data.incomingPacket.nodeId = 12 (0x0000000c)
[2017-01-14 02:25:41.873] [D] [zway] SETDATA controller.data.incomingPacket.frameType = "singlecast"
[2017-01-14 02:25:41.874] [D] [zway] SETDATA controller.data.incomingPacket = **********
[2017-01-14 02:25:41.874] [D] [zway] SETDATA devices.12.data.lastReceived = 0 (0x00000000)
[2017-01-14 02:25:41.874] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.27.deviceScale = 0 (0x00000000)
[2017-01-14 02:25:41.874] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.27.scale = 0 (0x00000000)
[2017-01-14 02:25:41.875] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.27.scaleString = "UV index"
[2017-01-14 02:25:41.876] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.27.val = 0.000000
[2017-01-14 02:25:41.876] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.27 = Empty
That looks like it getting an update from my Aeon Labs Multisensor 6. The last Hue refresh was 13 seconds before the final log entry - but as it runs every 30s I'm not sure how much we can read into that.

Nothing of note in /var/log/messages or syslog around the apparent crash time and I'd restarted before doing the lost and netstat so nothing there either.

The frequency of crashes seemed to reduce in proportion to the lower frequency of http.request calls from the Hue module. I also get the following in my logs every time the Hue refresh runs:

Code: Select all

[2017-01-14 02:17:27.831] [I] [core] Phillips Hue Refresh All Hue Lights
[2017-01-14 02:17:27.844] [I] [core] [object Object]
The [core] [object Object] line is curious - there is no console.log in the module that's doing that, but it appears consistently. No idea where it's coming from.

I don't use the MQTT module, but I guess that uses http.request a lot too?
User avatar
PoltoS
Posts: 7594
Joined: 26 Jan 2011 19:36

Re: How to debug system failure

Post by PoltoS »

can you run strace -fp <pid> and wait for the crash to send us last 1000 lines. Also worth to attach with gdb and once it crash send us "info thr" and "bt" output. We have seen three different customers reporting for crashes, we will try to collect all info and find the problem. Please send logs and output directly to ps@
BobElHat
Posts: 23
Joined: 27 Dec 2016 21:32

Re: How to debug system failure

Post by BobElHat »

OK, strace is running in a screen session and I'll report back if I see another crash.

Turns out the weird [core] [object Object] message actually was from the module code, so that's probably unrelated.
Boxson
Posts: 29
Joined: 21 Sep 2016 20:00

Re: How to debug system failure

Post by Boxson »

Hi,

after disabling the app "periodical Switch polling" i didn't face any more crashes.

regards

Boxson
Post Reply