How to debug system failure

Post by **PoltoS** » 06 Jan 2017 20:12

Hm.. looks weired! We need a lot to help here. "strace -fp <pid>" and 'lsof -n | grep z-way-server".

Is it reproducible? Try to disable apps and see if it stops

mrnmukkas · Post by **mrnmukkas** » 07 Jan 2017 01:58

I guess I should do this after it breaks huh? Right now I'm running a scheduled restart of z-way-server every six hours to prevent this from happening. If I don't it usually breaks withing 12 hours. The funny thing is that it runs normally for quite a while, when I ran lsof after restarting the server and it showed only 77 files being open and kept steady at that number for at least an hour. Then I left for work and got home in the evening, and bam, >64.000 files and the server is down. So it's reproducible alright, I just need to leave it running for some time.

Anyway, I'm exporting all my sensor data to a database at 30 minute intervals, as a side effect this allows me to see when the server stops responding since I get no data during that time. Here's a summary of the crashes so far:

Code: Select all

Down 2017-01-01 19:00, Up 2017-01-02 03:00
Down 2017-01-02 14:30, Up 2017-01-02 16:00
Down 2017-01-03 12:00, Up 2017-01-03 13:30
Down 2017-01-04 09:30, Up 2017-01-04 13:30
Down 2017-01-05 11:00, Up 2017-01-05 13:30
Down 2017-01-05 21:30, Up 2017-01-06 06:30

Up is when I discover that the server is down and manually restart it. Worth noting is that I have no downtime recorded before new years, so this has apparently only been a problem during 2017.

The only thing that I've done differently lately is that I'm running the MQTT app and Node Red, but I started doing that back in december.

I'll disable the restart and get back to you with the results.

mrnmukkas · Post by **mrnmukkas** » 10 Jan 2017 01:56

Ok, so here are roughly 2000 lines from strace and the results from lsof:
http://mukkas.se/blandat/strace-zway.txt
http://mukkas.se/blandat/lsof-zway.txt

Post by **PoltoS** » 10 Jan 2017 22:57

I see a lot of:
[pid 19199] <... dup resumed> ) = -1 EMFILE (Too many open files)
...
[pid 19199] accept(7, 0x742febb8, [16]) = -1 EMFILE (Too many open files)

19199 (based on your lsof file) is zway/sockets thread. Means you have something working with sockets having a lot open. So it is not about files, but about sockets accepted and not closed.

Don't you have some MQTT or something else that forgot to close a socket after communication?

Can you also do "sudo netstat -nap"

mrnmukkas · Post by **mrnmukkas** » 11 Jan 2017 04:31

Ah I see, it's quite possible it's because of the MQTT plugin, we're currently troubleshooting an issue with it over on github:
https://github.com/Edubits/Zway-MQTT/issues/16

When I started looking into the zway crashes I noticed that the MQTT plugin had a lot of errors in the zway log (thousands in a day). I just updated to a new version yesterday and after that the errors disappeared almost entirely. I'll keep the server running without restarting it for a few days and see if zway has stopped crashing as well.

Here's the results from netstat:
http://mukkas.se/blandat/netstat-zway.txt
mosquitto is the MQTT broker btw, also this is from when the server is running normally. I can get back with an update if it breaks again.

Post by **PoltoS** » 11 Jan 2017 09:32

This netstat looks ok - not too many open ports. Looks you have fixed that with the MQTT changes.

BobElHat · Post by **BobElHat** » 14 Jan 2017 16:09

I've been having occasional crashes too. Previously, they seemed to be happening around the time Minux's Hue module ran its refresh, but I rewrote that to use a single http.request and that seemed to have stopped it - but last night it happened again.

Last logs were:

Code: Select all

[2017-01-14 02:25:41.772] [D] [zway] SETDATA controller.data.incomingPacket.nodeId = 12 (0x0000000c)
[2017-01-14 02:25:41.772] [D] [zway] SETDATA controller.data.incomingPacket.frameType = "singlecast"
[2017-01-14 02:25:41.772] [D] [zway] SETDATA controller.data.incomingPacket = **********
[2017-01-14 02:25:41.772] [D] [zway] SETDATA devices.12.data.lastReceived = 0 (0x00000000)
[2017-01-14 02:25:41.772] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.3.deviceScale = 1 (0x00000001)
[2017-01-14 02:25:41.773] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.3.scale = 1 (0x00000001)
[2017-01-14 02:25:41.775] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.3.scaleString = "Lux"
[2017-01-14 02:25:41.775] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.3.val = 0.000000
[2017-01-14 02:25:41.776] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.3 = Empty
[2017-01-14 02:25:41.873] [D] [zway] RECEIVED: ( 01 0B 00 04 00 0C 05 31 05 1B 01 00 D7 )
[2017-01-14 02:25:41.873] [D] [zway] SENT ACK
[2017-01-14 02:25:41.873] [D] [zway] SETDATA controller.data.incomingPacket.nodeId = 12 (0x0000000c)
[2017-01-14 02:25:41.873] [D] [zway] SETDATA controller.data.incomingPacket.frameType = "singlecast"
[2017-01-14 02:25:41.874] [D] [zway] SETDATA controller.data.incomingPacket = **********
[2017-01-14 02:25:41.874] [D] [zway] SETDATA devices.12.data.lastReceived = 0 (0x00000000)
[2017-01-14 02:25:41.874] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.27.deviceScale = 0 (0x00000000)
[2017-01-14 02:25:41.874] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.27.scale = 0 (0x00000000)
[2017-01-14 02:25:41.875] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.27.scaleString = "UV index"
[2017-01-14 02:25:41.876] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.27.val = 0.000000
[2017-01-14 02:25:41.876] [D] [zway] SETDATA devices.12.instances.0.commandClasses.49.data.27 = Empty

That looks like it getting an update from my Aeon Labs Multisensor 6. The last Hue refresh was 13 seconds before the final log entry - but as it runs every 30s I'm not sure how much we can read into that.

Nothing of note in /var/log/messages or syslog around the apparent crash time and I'd restarted before doing the lost and netstat so nothing there either.

The frequency of crashes seemed to reduce in proportion to the lower frequency of http.request calls from the Hue module. I also get the following in my logs every time the Hue refresh runs:

Code: Select all

[2017-01-14 02:17:27.831] [I] [core] Phillips Hue Refresh All Hue Lights
[2017-01-14 02:17:27.844] [I] [core] [object Object]

The [core] [object Object] line is curious - there is no console.log in the module that's doing that, but it appears consistently. No idea where it's coming from.

I don't use the MQTT module, but I guess that uses http.request a lot too?

Post by **PoltoS** » 14 Jan 2017 16:35

can you run strace -fp <pid> and wait for the crash to send us last 1000 lines. Also worth to attach with gdb and once it crash send us "info thr" and "bt" output. We have seen three different customers reporting for crashes, we will try to collect all info and find the problem. Please send logs and output directly to ps@

BobElHat · Post by **BobElHat** » 14 Jan 2017 19:17

OK, strace is running in a screen session and I'll report back if I see another crash.

Turns out the weird [core] [object Object] message actually was from the module code, so that's probably unrelated.

Boxson · Post by **Boxson** » 15 Jan 2017 02:49

Hi,

after disabling the app "periodical Switch polling" i didn't face any more crashes.

regards

Boxson

forum

How to debug system failure

Re: How to debug system failure

Re: How to debug system failure

Re: How to debug system failure

Re: How to debug system failure

Re: How to debug system failure

Re: How to debug system failure

Re: How to debug system failure

Re: How to debug system failure

Re: How to debug system failure

Re: How to debug system failure