ZWAY server fails at 03:00 regularly

Discussions about RaZberry - Z-Wave board for Raspberry computer
ScotsDon
Posts: 50
Joined: 28 Jul 2015 20:55

ZWAY server fails at 03:00 regularly

Post by ScotsDon »

I am a private individual who has developed a Z-Wave system using Python and not any "commercial" app. This system has been running faultlessly for the last 7 years on a Raspberry Pi 1 (since Raspbiam Wheezy, then Jessie) using the Razberry PCB. I changed the Razberry PCB to the newer Gen-5 one over 3 years years ago. The system is small, comprising 2 "Secure" 2-pole boiler relays, 2 13A plug-in plugs, 1 Gen-5 range extender and I Fibaro multi-sensor.

I recently had a spare RPi-3B+, so I decided to swap this for the Rpi-1. I did a clean install of Bullseye, set up all the other stuff (VNC, Samba, a connection to a NAS (useful maybe), zway v3.2.2, ssmtp email, NUT for battery backup, Fibaro PIR bindings, CRON), restored the Zway system from a backup, and transfered all my Python software across, and the system seemed to run fine. I did discover and fix one timing bug which was due to the significantly increased speed of the pi-3.

CRON runs the main application every 15 minutes 24/7. This application uses various data to set the positions of the relays and writes out those settings every time (this is so that, in the event of a power cut, the settings are restored within a maximum of 15 minutes - being in rural Wales, we get power cuts not infrequently).

So every 24 hours, this app run 24x4 = 96 times and does not fail, except at 03:00 overnight. One of the relays does change state at this time, but the sequence of ZWay commands does not change. The failure occurs on the first connection to ZWay at 03:00, which is a Get() - I tried a different command and it still failed. After the failure, the Zway server does not respond at all, nor will it do a Soft Reset. A Soft Reboot of the pi itself does restore everything. I get error mesages by email and the main error message in the Python Traceback reports is:

ConnectionRefusedError: [Errno 111] Connection refused

I can supply masses of data, logs etc, but because of the size of it, I think it better to supply what is required on request. I have looked at the system log and the z-way-server log and I can't see why this is happening. The only entry in the z-way-server log at 03:00 is:

[2022-01-10 03:00:00.875] [E] [zway] Dropping packet to myself (node 1)


The last few entries in the log are:

[2022-01-10 02:45:44.123] [D] [zway] SENT ACK
[2022-01-10 02:45:44.123] {I} [zway] Job 0x13 (SwitchBinary Get to node 7): Delivered
[2022-01-10 02:45:44.124] [D] [zway] SendData Response with callback 0xac received: received by recipient
[2022-01-10 02:45:44.124] [D] [zway] SETDATA devices.7.data.lastSendInternal = **********
[2022-01-10 02:45:44.124] [D] [zway] SETDATA devices.7.data.lastSend = 3460226 (0x0034cc82)
[2022-01-10 02:45:44.124] [D] [zway] Job 0x13 (SwitchBinary Get to node 7): success
[2022-01-10 02:45:44.124] {I} [zway] Waiting for job reply: SwitchBinary Get from node 7
[2022-01-10 02:45:44.133] [D] [zway] RECEIVED: ( 01 09 00 04 00 07 03 25 03 00 D0 )
[2022-01-10 02:45:44.133] [D] [zway] SENT ACK
[2022-01-10 02:45:44.134] [D] [zway] SETDATA devices.7.data.lastReceived = 0 (0x00000000)
[2022-01-10 02:45:44.134] [D] [zway] Received reply on job (SwitchBinary Get to node 7)
[2022-01-10 02:45:44.134] [D] [zway] SETDATA devices.7.instances.0.commandClasses.37.data.level = False
[2022-01-10 02:45:44.399] [D] [zway] Job 0x13: deleted from queue
[2022-01-10 02:45:45.232] [D] [zway] Job 0x13: deleted from queue
[2022-01-10 02:45:49.883] [D] [zway] Job 0x13: deleted from queue
[2022-01-10 02:45:57.095] [D] [zway] Job 0x13: deleted from queue
[2022-01-10 02:46:04.256] [D] [zway] Job 0x13: deleted from queue
[2022-01-10 02:59:00.258] {I} [zway] Saving configuration data to config/zddx/cf22bda8-DevicesData.xml
[2022-01-10 03:00:00.875] [E] [zway] Dropping packet to myself (node 1)

(Note the {I} had square brackets but I have changed it because of a formatting issue).

The relevant entries around 03:00 in the Syslog are:

Jan 10 02:45:02 raspberrypi-3 CRON[3149]: (pi) CMD (/home/pi/Documents/Script_Run_control_heating.sh)
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: bits received from HRNG source: 360064
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: bits sent to kernel pool: 303072
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: entropy added to kernel pool: 303072
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: FIPS 140-2 successes: 18
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: FIPS 140-2 failures: 0
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: FIPS 140-2(2001-10-10) Monobit: 0
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: FIPS 140-2(2001-10-10) Poker: 0
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: FIPS 140-2(2001-10-10) Runs: 0
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: FIPS 140-2(2001-10-10) Long run: 0
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: FIPS 140-2(2001-10-10) Continuous run: 0
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: HRNG source speed: (min=556.970; avg=747.530; max=959.531)Kibits/s
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: FIPS tests speed: (min=25.196; avg=28.357; max=56.430)Mibits/s
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: Lowest ready-buffers level: 2
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: Entropy starvations: 0
Jan 10 02:58:59 raspberrypi-3 rngd[679]: stats: Time spent starving for entropy: (min=0; avg=0.000; max=0)us
Jan 10 03:00:01 raspberrypi-3 CRON[3177]: (pi) CMD (/home/pi/Documents/Script_Run_control_heating.sh)
Jan 10 03:00:04 raspberrypi-3 sSMTP[3180]: Creating SSL connection to host
Jan 10 03:00:04 raspberrypi-3 sSMTP[3180]: SSL connection using ECDHE_RSA_AES_256_GCM_SHA384
Jan 10 03:00:06 raspberrypi-3 sSMTP[3180]: Sent mail for pi@raspberrypi-3 (221 2.0.0 closing connection q3sm5548604wrr.55 - gsmtp) uid=1000 username=pi outbytes=3126
Jan 10 03:10:01 raspberrypi-3 CRON[3202]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Jan 10 03:15:01 raspberrypi-3 CRON[3215]: (pi) CMD (/home/pi/Documents/Script_Run_control_heating.sh)
Jan 10 03:15:04 raspberrypi-3 sSMTP[3219]: Creating SSL connection to host
Jan 10 03:15:04 raspberrypi-3 sSMTP[3219]: SSL connection using ECDHE_RSA_AES_256_GCM_SHA384
Jan 10 03:15:06 raspberrypi-3 sSMTP[3219]: Sent mail for pi@raspberrypi-3 (221 2.0.0 closing connection 9sm6759479wrz.90 - gsmtp) uid=1000 username=pi outbytes=3126
Jan 10 03:17:01 raspberrypi-3 CRON[3223]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jan 10 03:25:01 raspberrypi-3 CRON[3241]: (pi) CMD (/home/pi/Documents/Script_Reboot.sh)
Jan 10 03:30:01 raspberrypi-3 CRON[3253]: (pi) CMD (/home/pi/Documents/Script_Run_control_heating.sh)

The ZWay commands I use in Python are:

#Refresh data store in Pi
responseAction = urllib.request.urlopen('http://localhost:8083/ZWaveAPI/Run/devices[' + Device + '].instances[' + DevInst+ '].commandClasses.SwitchBinary.Get()')

#Test if Device is on
ResponseDeviceStatus = urllib.request.urlopen('http://localhost:8083/ZWaveAPI/Run/devices[' + Device + '].data.isFailed.value')

#Test relay state - get the last update time and the last relay state
responseUpdTime = urllib.request.urlopen('http://localhost:8083/ZWaveAPI/Run/devices[' + Device + '].instances[' + DevInst+ '].commandClasses.SwitchBinary.data.level.updateTime').read()

RelayStatus = urllib.request.urlopen('http://localhost:8083/ZWaveAPI/Run/devices[' + Device + '].instances[' + DevInst+ '].commandClasses.SwitchBinary.data.level.value').read()

#Set relay
responseAction = urllib.request.urlopen('http://localhost:8083/ZWaveAPI/Run/devices[' + Device + '].instances[' + DevInst+ '].commandClasses.SwitchBinary.Set(' + DevState + ')')

Where "Device" is the device number, e.g. '2', "DevInst" is '1' or '2' if multicontact, "DevState" = '0' or '255'.


I am stumped!! Can anybody advise?
ScotsDon
Posts: 50
Joined: 28 Jul 2015 20:55

Re: ZWAY server fails at 03:00 regularly

Post by ScotsDon »

Further information:

My process runs at 00, 15, 30, 45 minutes past the hour (CRON).

I set the time of the Pi manually to 02:59 and waited a minute. At 03:00, my process ran and the ZWay server failed again, as previously.

I rebooted and set the time to 03:14, 03:29 and 03:59. In all 3 cases, a minute later, my process ran without error.

It looks like the time 03:00 has significance, but I have no idea what that might be.
ScotsDon
Posts: 50
Joined: 28 Jul 2015 20:55

Re: ZWAY server fails at 03:00 regularly

Post by ScotsDon »

Further info:

I tried running my process at 03:01 by changing the Cron entry, and setting the Pi clock to 02:58. This time, the first device, Instance 1, on the list did not respond, even after 9 re-tries, but the remaining devices gave no error and appear to be OK. including a change of state to the second instance of the first device. The 9 re-tries are spaced out in time with a few seconds between each one, so 9 re-tries takes around 45 seconds (not timed though). The ZWay server remained working.
seattleneil
Posts: 172
Joined: 02 Mar 2020 22:41

Re: ZWAY server fails at 03:00 regularly

Post by seattleneil »

In addition to checking linux cron jobs that run at 3:00, here's a couple of ideas:

1. Check Z-Way's internal cron jobs:
curl -s -u admin:[ADMIN PASSWORD] 127.0.0.1:8083/ZAutomation/api/v1/instances/Cron

The json output is virtually impossible to interpret, so I use an online json pretty printer to format the output. Is there a Z-Way automation that runs at 3:00?

2. Check if systemd has any timers that run at 3:00:
sudo systemctl list-timers

3. Check the status of the Z-Way TCP connection (port 8083) (when the system is happy and then when the system hangs) with these 2 commands:
sudo lsof -n -i :8083
netstat -ant | grep 8083

Is the z-way-server listening? What (if anything) is connected to TCP port 8083 when Z-Way hangs?

4. Check what happens if you temporarily modify your python script to determine the current time and if it's 3:00 AM, exit the script.

5. You might see a network problem by running strace on the z-way-server process:
sudo strace -f -e trace=network,ipc -p `pidof z-way-server`

The strace command can provide a ton of information besides network-related system calls (with other parameters).
ScotsDon
Posts: 50
Joined: 28 Jul 2015 20:55

Re: ZWAY server fails at 03:00 regularly

Post by ScotsDon »

Thanks! Will try what you suggest tomorrow. As a work-around, I have set the pi to reboot at 03:10, which is one option. Will report back tomorrow.
seattleneil
Posts: 172
Joined: 02 Mar 2020 22:41

Re: ZWAY server fails at 03:00 regularly

Post by seattleneil »

A guess...

The z-way log file shows:
[2022-01-10 02:59:00.258] {I} [zway] Saving configuration data to config/zddx/cf22bda8-DevicesData.xml
[2022-01-10 03:00:00.875] [E] [zway] Dropping packet to myself (node 1)

The file cf22bda8-DevicesData.xml is likely quite large (check: ls -l /opt/z-way-server/config/zddx/cf22bda8-DevicesData.xml)

My guess (and it's really just a wild guess) is that your SD card is slow to write the file and that's causing Z-Way to hang (at least until the write completes). Possibly there's something else going on at 3:00 that's doing a lot of file I/O.

If you want to test this guess, stop the z-way-server process, create a symbolic link for /opt/z-way-server/config/zddx to a RAM-based file system (e.g., a directory under /run), copy cf22bda8-DevicesData.xml to the RAM-based zddx directory, and then start the z-way-server process. Be sure to retain a copy of cf22bda8-DevicesData.xml somewhere safe, like /home/pi.

It will be interesting to learn what you discover.
ScotsDon
Posts: 50
Joined: 28 Jul 2015 20:55

Re: ZWAY server fails at 03:00 regularly

Post by ScotsDon »

A slow SD card is a definite possibilty. It is an old one, and I noticed it was quite slow when backing up the image the other day. It's not Cat 10 (or whatever it's called).
ScotsDon
Posts: 50
Joined: 28 Jul 2015 20:55

Re: ZWAY server fails at 03:00 regularly

Post by ScotsDon »

seattleneil wrote:
10 Jan 2022 22:31
In addition to checking linux cron jobs that run at 3:00, here's a couple of ideas:

No System CRON jobs run at or near to 03:00. Log Rotates etc run at 06:xx

1. Check Z-Way's internal cron jobs:
curl -s -u admin:[ADMIN PASSWORD] 127.0.0.1:8083/ZAutomation/api/v1/instances/Cron

The json output is virtually impossible to interpret, so I use an online json pretty printer to format the output. Is there a Z-Way automation that runs at 3:00?

I ran this using Firefox and this URL:
http://192.168.1.10:8083/ZAutomation/ap ... ances/Cron
After logging in, Firefox decoded the JSON itself and presented it in an easily readable form. The result is that there are several polling requests that could be scheduled, but none are.


2. Check if systemd has any timers that run at 3:00:
sudo systemctl list-timers

None

3. Check the status of the Z-Way TCP connection (port 8083) (when the system is happy and then when the system hangs) with these 2 commands:
sudo lsof -n -i :8083
netstat -ant | grep 8083

When happy (not yet tested when unhappy, see below):
pi@raspberrypi-3:~ $ sudo lsof -n -i :8083
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
z-way-ser 397 root 6u IPv4 13515 0t0 TCP *:8083 (LISTEN)
pi@raspberrypi-3:~ $ netstat -ant | grep 8083
tcp 0 0 0.0.0.0:8083 0.0.0.0:* LISTEN


Is the z-way-server listening? What (if anything) is connected to TCP port 8083 when Z-Way hangs?

4. Check what happens if you temporarily modify your python script to determine the current time and if it's 3:00 AM, exit the script.

Have done a scheduled reboot instead, since this means I can try out things and the 03:00 process will still run, testing the change.

5. You might see a network problem by running strace on the z-way-server process:
sudo strace -f -e trace=network,ipc -p `pidof z-way-server`

The strace command can provide a ton of information besides network-related system calls (with other parameters).

Not done yet.
I can't find anything going on at 03:00. I'm going to try a much faster SD Card as the one I was using (which was an old spare) is only Class 6 (RPi needs Class 10 or better).
ScotsDon
Posts: 50
Joined: 28 Jul 2015 20:55

Re: ZWAY server fails at 03:00 regularly

Post by ScotsDon »

seattleneil wrote:
10 Jan 2022 23:41
A guess...

The z-way log file shows:
[2022-01-10 02:59:00.258] {I} [zway] Saving configuration data to config/zddx/cf22bda8-DevicesData.xml
[2022-01-10 03:00:00.875] [E] [zway] Dropping packet to myself (node 1)

The file cf22bda8-DevicesData.xml is likely quite large (check: ls -l /opt/z-way-server/config/zddx/cf22bda8-DevicesData.xml)

My guess (and it's really just a wild guess) is that your SD card is slow to write the file and that's causing Z-Way to hang (at least until the write completes). Possibly there's something else going on at 3:00 that's doing a lot of file I/O.
I checked today's log file and that Saving.... did not happen near 03:00. So it's not that. The z-way-log showed:

[2022-01-11 02:45:42.062] [D] [zway] SETDATA devices.7.data.lastReceived = 0 (0x00000000)
[2022-01-11 02:45:42.062] [D] [zway] Received reply on job (SwitchBinary Get to node 7)
[2022-01-11 02:45:42.062] [D] [zway] SETDATA devices.7.instances.0.commandClasses.37.data.level = False
[2022-01-11 02:45:43.159] [D] [zway] Job 0x13: deleted from queue
[2022-01-11 02:45:47.821] [D] [zway] Job 0x13: deleted from queue
[2022-01-11 02:45:54.968] [D] [zway] Job 0x13: deleted from queue
[2022-01-11 02:46:02.177] [D] [zway] Job 0x13: deleted from queue
[2022-01-11 03:00:00.606] [E] [zway] Dropping packet to myself (node 1)

Nothing between 02:46 and 03:00. But I am suspicious of my SD Card - new one coming tomorrow.
ScotsDon
Posts: 50
Joined: 28 Jul 2015 20:55

Re: ZWAY server fails at 03:00 regularly

Post by ScotsDon »

seattleneil wrote:
10 Jan 2022 22:31

3. Check the status of the Z-Way TCP connection (port 8083) (when the system is happy and then when the system hangs) with these 2 commands:
sudo lsof -n -i :8083
netstat -ant | grep 8083

Is the z-way-server listening? What (if anything) is connected to TCP port 8083 when Z-Way hangs?

When Zway Server is hanging:

pi@raspberrypi-3:~ $ sudo lsof -n -i :8083
pi@raspberrypi-3:~ $ netstat -ant | grep 8083
pi@raspberrypi-3:~ $

Presumably that means something? No response at all?
Post Reply