job queue runs full after upgrade to v4.1.0

MrRusch · Post by **MrRusch** » 28 May 2024 09:38

DesT wrote: ↑
21 May 2024 13:58
@PoltoS,

Any news about the queue problem? I'm with 4.1.3 and still having to reset almost everyday 'cause the queue is around 500-600

Same here. System becomes unresponsive if I don't reset a couple of times every day.

Software 4.1.3
SDK Version 7.20.00
Serial API Version 07.40

seattleneil · Post by **seattleneil** » 28 May 2024 19:55

It looks like Silicon Labs (the company that produces the Z-Wave chip) has released updated versions of the SDK where Silicon Labs release notes describe reliability improvements. Ideally, @PoltoS can share plans for when Z-Way will release firmware that includes an updated SDK version.

Here's a link to Silcon Labs SDK info:

Code: Select all

https://www.silabs.com/developers/z-wave-700-series?tab=documentation

From what I can tell, the newest Silicon Labs SDK is available for the Zooz ZST39 USB controller. Here are the relevant links:

Code: Select all

https://www.thesmartesthouse.com/products/zooz-800-series-z-wave-long-range-usb-stick-zst39)
and
https://www.getzooz.com/firmware/ZST39_SDK_7.21.3_US-LR_V01R30.zip

Note that the Z-Way home automation software only works with Z-Wave controllers running Z-Way's firmware. In other words, switching to a non-Z-Way controller such as the Zooz ZST39 will require switching to different home automation software such as Home Assistant and Z-Wave JS. Pursing this approach essentially means abandoning Z-Way's hardware and software.

In the meantime, since Z-Wave controller reliability issues are exacerbated by network problems, you may be able to improve reliability using the diagnostics tools on the Z-Way expert UI to fix routing, signal strength and excessive communication issues.

DesT · Post by **DesT** » 30 May 2024 14:12

@PoltoS,

It's really a pain the queue issue. Since a couple of days, I need to restart 2-3x per day as the queue is full near 500-600 events... and of course, nothing works on zwave. Can't open/close a light or anything.

Post by **PoltoS** » 01 Jun 2024 12:19

Could we have a look at your system when it is in this state?

DesT · Post by **DesT** » 01 Jun 2024 15:27

@PoltoS,

yeah sure. want me to email you when it happen ? and how do you want to connect ?

DesT · Post by **DesT** » 02 Jun 2024 15:49

@PoltoS,

Do you prefer at first to have the full log file ?

MrRusch · Post by **MrRusch** » 08 Jun 2024 00:44

PoltoS wrote: ↑
01 Jun 2024 12:19
Could we have a look at your system when it is in this state?

My system clogged up today again with 1000+ queued messages. Resetting API and rebooting server brings me back to a stable state, for a while again.
I just sent you a PM with logs from the last couple of days at log-level 0.
Let me know if there is something else I can provide.

Thank you for your efforts.

MrRusch · Post by **MrRusch** » 11 Jun 2024 11:12

OK so once again - basically nothing works. I recently completely recreated my z-wave network using my RaZberry 7 Pro. A month ago things were pretty stable.. I've hade some problems with the queue seemingly randomly building up, but nothing a soft reset wouldn't take care of. Now though, I am suddenly unable to communicate with 90% of my devices. All show as working properly, but requests just time out.

Code: Select all

[2024-06-11 10:06:59.442] [I] [zway] Using security scheme S0
[2024-06-11 10:06:59.443] [I] [zway] Adding job: SwitchBinary Get to node 2
[2024-06-11 10:06:59.444] [I] [zway] Node 2:0 CC Security: sending Nonce Get
[2024-06-11 10:06:59.444] [I] [zway] Adding job: Nonce Get to node 2
[2024-06-11 10:06:59.476] [I] [zway] Job 0x13 (Nonce Get to node 2): Delivered
[2024-06-11 10:06:59.476] [I] [zway] Waiting for job reply: Nonce Get from node 2
[2024-06-11 10:07:01.654] [I] [zway] Job 0x13 (Nonce Get to node 2): Reply not received before timeout, retrying
[2024-06-11 10:07:01.676] [I] [zway] Job 0x13 (Nonce Get to node 2): Delivered
[2024-06-11 10:07:01.677] [I] [zway] Waiting for job reply: Nonce Get from node 2
[2024-06-11 10:07:03.861] [I] [zway] Job 0x13 (Nonce Get to node 2): Reply not received before timeout, retrying
[2024-06-11 10:07:03.885] [I] [zway] Job 0x13 (Nonce Get to node 2): Delivered
[2024-06-11 10:07:03.885] [I] [zway] Waiting for job reply: Nonce Get from node 2
[2024-06-11 10:07:06.067] [W] [zway] Reply not received before timeout for job (Nonce Get to node 2)
[2024-06-11 10:07:10.145] [I] [zway] Node 2:0 CC Security: sending Nonce Get
[2024-06-11 10:07:10.145] [I] [zway] Adding job: Nonce Get to node 2
[2024-06-11 10:07:10.178] [I] [zway] Job 0x13 (Nonce Get to node 2): Delivered
[2024-06-11 10:07:10.178] [I] [zway] Waiting for job reply: Nonce Get from node 2
[2024-06-11 10:07:12.345] [I] [zway] Job 0x13 (Nonce Get to node 2): Reply not received before timeout, retrying
[2024-06-11 10:07:12.368] [I] [zway] Job 0x13 (Nonce Get to node 2): Delivered
[2024-06-11 10:07:12.368] [I] [zway] Waiting for job reply: Nonce Get from node 2
[2024-06-11 10:07:14.532] [I] [zway] Job 0x13 (Nonce Get to node 2): Reply not received before timeout, retrying
[2024-06-11 10:07:14.555] [I] [zway] Job 0x13 (Nonce Get to node 2): Delivered
[2024-06-11 10:07:14.555] [I] [zway] Waiting for job reply: Nonce Get from node 2
[2024-06-11 10:07:16.712] [W] [zway] Reply not received before timeout for job (Nonce Get to node 2)

And this appears to loop indefinitely, which builds up the queue and eventually leads to a complete clog.

The ones that still work are of different type, brand and security level.
I don't know what to do at this point..

What can I do that might help me avoid recreating everything again? Because let me tell you, if I end up needing to re-include my 60-something devices twice in a quarter - it wont be in z-way next time.
- Reinstalling z-way-server completely? Move the current installation out of place, reinstall new clean package, restore zbk? Should get me back to a stable state right?
- Migrate to old controller? I have an older RaZberry laying around, as well as a Z-stick gen5. Backup, shutdown, swap hardware, restart, factory reset and restore? How would my network react to me going backwards in controller gen?

EDIT: I've already switched from SD-card to SSD storage on the RPI just to make sure it wasn't I/O-related.

DesT · Post by **DesT** » 11 Jun 2024 20:57

@MrRusch I got the same issue and it's not hardware related to your rPI as I'm not using a rPI to run z-way-server.

On my side it start since I upgraded to 4.1.3 from 4.1.1

Even with a soft reset, I need sometimes to do soft+reset + stop + start + soft reset again to make it works for a little while.

@PoltoS, Any update about that ?

seattleneil · Post by **seattleneil** » 11 Jun 2024 21:34

A few things you may want to consider...

1. The queue problem you're experiencing is not experienced by most/all other Z-Way users. Presumably, there's something unique about your setup. As an experiment, I would remove power from node 2 since that node looks to be clogging up the queue. It could be another node or more than 1 node. If you restart z-way-server, the log should show what device begins the queue build-up. Odds are high that it's going to be a device that's been included with S0 security since S0 security requires extra messages. Since it only takes 1 misbehaving node to corrupt your entire system, don't worry about any devices other than the first device with retransmissions. Solve the retransmissions for the first node and repeat this approach if you're still seeing a lot of retransmissions.

2. Reinstalling z-way-server completely? I doubt this will be a productive path. I would definitely look at the expert UI diagnostic stats and identify devices with low signal strength, multiple hops and retransmissions. I'm guessing the problem you're encountering is related to an RF or Z-Wave routing problem.

3. Migrate to old controller? This could be a lot more difficult than it seems. You are assuming a backup of a Z-Way 700-series chip can be restored to a 500-series chip. You can try to do this or ask @PoltoS to see if this is supported. If a 700->500 series backup/restore isn't supported, you'll need to rebuild your entire Z-Wave network from scratch since including your 500-series chip and doing a controller shift will only be partially successful as the 500-series chip will not be assigned node ID=1 which is what you want. Note that if you upgraded from a 500-series chip to the 700-series chip using the backup/restore method (which is supported by Z-Way), then you may be able to swap the 2 controllers and do a network re-organization (without doing a backup/restore).

4. From my testing, other Z-Wave controller software (e.g., Z-Wave JS UI - https://github.com/zwave-js/zwave-js-ui) will consume more RAM and CPU as compared to Z-Way. If you're using a Pi4 or newer, this won't be an issue. I'm not defending Z-Way software or hardware since it's definitely not bug-free. For example, I experience CPU usage gradually increasing. Before you pursue the draconian approach of switching to a new Z-Wave controller/home automation solution, I suggest spend a more time to solve the queue issue. I'm guessing it's a matter of re-including a troublesome node without security or adding a manual route.

forum

job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0