job queue runs full after upgrade to v4.1.0

MrRusch · Post by **MrRusch** » 12 Jun 2024 21:07

DesT wrote: ↑
11 Jun 2024 20:57
@MrRusch I got the same issue and it's not hardware related to your rPI as I'm not using a rPI to run z-way-server.

On my side it start since I upgraded to 4.1.3 from 4.1.1

I dont know about this.. I tried removing all traces of z-way-server from my system and reinstalled an old 4.0.3 package I had saved.. since I first started experiencing this issue when I upgraded to 4.1.3 as well. But the version has no effect for me at this point.

MrRusch · Post by **MrRusch** » 12 Jun 2024 22:04

seattleneil wrote: ↑
11 Jun 2024 21:34
A few things you may want to consider...

1. The queue problem you're experiencing is not experienced by most/all other Z-Way users. Presumably, there's something unique about your setup. As an experiment, I would remove power from node 2 since that node looks to be clogging up the queue. It could be another node or more than 1 node. If you restart z-way-server, the log should show what device begins the queue build-up. Odds are high that it's going to be a device that's been included with S0 security since S0 security requires extra messages. Since it only takes 1 misbehaving node to corrupt your entire system, don't worry about any devices other than the first device with retransmissions. Solve the retransmissions for the first node and repeat this approach if you're still seeing a lot of retransmissions.

Device #2 is just one of many now that my controller is unable to receive a response from. Currently only 7 out of my 60 devices work. I guess I could start reincluding them all one by one again but it just feels pointless. I recently recreated everything, it was working OK, and suddenly 53 devices stop responding. I'm starting to think maybe if I just wait a few weeks everything will be fine again..

seattleneil wrote: ↑
11 Jun 2024 21:34
2. Reinstalling z-way-server completely? I doubt this will be a productive path. I would definitely look at the expert UI diagnostic stats and identify devices with low signal strength, multiple hops and retransmissions. I'm guessing the problem you're encountering is related to an RF or Z-Wave routing problem.

I tried this. Seems to have no effect.

seattleneil wrote: ↑
11 Jun 2024 21:34
3. Migrate to old controller? This could be a lot more difficult than it seems. You are assuming a backup of a Z-Way 700-series chip can be restored to a 500-series chip. You can try to do this or ask @PoltoS to see if this is supported. If a 700->500 series backup/restore isn't supported, you'll need to rebuild your entire Z-Wave network from scratch since including your 500-series chip and doing a controller shift will only be partially successful as the 500-series chip will not be assigned node ID=1 which is what you want. Note that if you upgraded from a 500-series chip to the 700-series chip using the backup/restore method (which is supported by Z-Way), then you may be able to swap the 2 controllers and do a network re-organization (without doing a backup/restore).

Figured this would be a problem. What if I include a secondary controller?

seattleneil wrote: ↑
11 Jun 2024 21:34
4. From my testing, other Z-Wave controller software (e.g., Z-Wave JS UI - https://github.com/zwave-js/zwave-js-ui) will consume more RAM and CPU as compared to Z-Way. If you're using a Pi4 or newer, this won't be an issue. I'm not defending Z-Way software or hardware since it's definitely not bug-free. For example, I experience CPU usage gradually increasing. Before you pursue the draconian approach of switching to a new Z-Wave controller/home automation solution, I suggest spend a more time to solve the queue issue. I'm guessing it's a matter of re-including a troublesome node without security or adding a manual route.

I honestly don't want to leave z-way either, but I feel I'm running out of options.. Logs show me nothing. I don't understand z-wave communication well enough to grasp what can possible be happening. It doesn't seem to be the z-way-server software as I've tried to start clean. I've tried restoring multiple backups from different dates, so it doesn't seem to be due to corrupted configuration data. All devices respond that they are operating, but almost none respond to requests to operate (can't toggle switches or get sensor or meter readings). I guess hardware failure is unlikely as not all devices have stopped working. What could cause this type of semi-functional state... The one thing I can think of that I did recently was enable the Z-Wave Long Range functionality via a license.

Analytics > Statistics shows 60% CRC16 ERR
Analytics > Noise meters are at around -100
Analytics > Signal strength is between -89 and -45 on all devices

Neighbors update completes fine.
Reorganization completes fine.

seattleneil · Post by **seattleneil** » 13 Jun 2024 05:29

The Z-Way developers just released a new RaZberry 7 firmware version today that uses the most recent version of Silicon Labs SDK (Z-Way firmware version 7.42 - SDK 7.21.3). This is a big deal - everyone with a RaZberry 7 should update their controller to firmware version 7.42 ASAP. You can read the SDK 7.21.3 release notes here: https://www.silabs.com/documents/public ... 21.3.0.pdf.

You wrote:

Logs show me nothing.

This has not been my experience - the log file is typically very informative, albeit cryptic. Rather than debate the log file contents, the information you provided is helpful. For instance, a Z-Wave network with 60 nodes is pretty large and likely has a complicated mesh topology. 60% CRC errors is terrible and needs to be fixed. Perhaps you're accidentally generating a lot of Z-Wave traffic by polling or having devices report status too frequently. Note that nodes that support Long Range will simplify the routing for Long Range nodes (it's point-to-point), but it's important to understand that Long Range nodes are unable to act as a mesh repeater. In other words, enabling Long Range is a double-edged sword; good for Long Range nodes and potentially catastrophic for non-Long Range nodes that relied on a newly-enabled Long Range node for mesh routing.

Here's what I would do:
1. Upgrade firmware to version 7.42, restart z-way-server and then reorganize your network. With luck, the new firmware will solve your CRC and queue issues.
2. If you're still seeing more than 5% CRC errors after a few hours of traffic, disable Long Range support, restart z-way-server and then reorganize your network. With luck, this will get you back to when your network was working without Long Range support, with the added benefit that the new firmware will solve your CRC and queue issues.
3. If you're still seeing more than 5% CRC errors after a few hours of traffic, use the expert UI analytics tools ([IP]:8083/expert/#/installer/packets) to find devices that are troublesome. Also look at [IP]:8083/expert/#/network/timing for useful info. The network map is also very helpful, especially if you go to the trouble of uploading a diagram of your house and move the devices to their proper location.

MrRusch · Post by **MrRusch** » 13 Jun 2024 10:23

seattleneil wrote: ↑
13 Jun 2024 05:29
Here's what I would do:
1. Upgrade firmware to version 7.42, restart z-way-server and then reorganize your network. With luck, the new firmware will solve your CRC and queue issues.

Just did this. Sadly still exactly the same behavior:
1. Controller sends Nonce - Delievered
2. Waits 1 second for response, times out, retries...
..endless loop

seattleneil wrote: ↑
13 Jun 2024 05:29
2. If you're still seeing more than 5% CRC errors after a few hours of traffic, disable Long Range support, restart z-way-server and then reorganize your network. With luck, this will get you back to when your network was working without Long Range support, with the added benefit that the new firmware will solve your CRC and queue issues.

I realized that even though I enabled this capability I shouldn't have been able to use it as I am situated outside of US and the GUI simply shows it as not supported on the Network Control page. Is there some way to see through configuration if this is in fact currently enabled, and also if a specific device has been added with this enabled?

seattleneil wrote: ↑
13 Jun 2024 05:29
3. If you're still seeing more than 5% CRC errors after a few hours of traffic, use the expert UI analytics tools ([IP]:8083/expert/#/installer/packets) to find devices that are troublesome. Also look at [IP]:8083/expert/#/network/timing for useful info. The network map is also very helpful, especially if you go to the trouble of uploading a diagram of your house and move the devices to their proper location.

I will try to understand what these tools are telling me better..

MrRusch · Post by **MrRusch** » 13 Jun 2024 22:21

One thing I am noticing in the Analytics > Route map is that many nodes don't have a route in the "Route FROM the node" (back to the controller). Even more suspiciously, 17 nodes are listed under "No status for", i.e. don't have a route either way. The manual says "The chart visualizes the possible links between the nodes and how they are used." so I thought of this as the routing table that comes from a reorganization. But then I see "Gathering period (in hours)" at the top, so maybe this is more of a visualization of the routes packages have actually taken - not which are actually possible? Either way it seems bad...
And how does this relate to the Network > Neighbors table? Here it seems that all of my nodes have plenty of green neighbors. How come they are not being picked as possible routes in the reorganization? When running a reorganization, it starts by asking for neighbors, so there is no reason to first update all neighbors through this dedicated function and then run a reorganization, right?

EDIT: After running the reorganization a few times with a restart of the server between each run, I managed to get rid of all the "No status for" nodes, except the controller (which I assume should be there). Sadly still same behavior..

lanbrown · Post by **lanbrown** » 14 Jun 2024 00:16

You may need to use the token of all to see the firmware.

seattleneil · Post by **seattleneil** » 14 Jun 2024 05:18

One of my concerns is that you reported your Z-Wave network was stable before you enabled the Long Range feature. By my way of thinking, the first thing I would try is to revert back to this stable state. Do you have an earlier expert UI .zbk backup in this stable state? If so, I suggest you create a new backup and then try to restore the earlier backup. If you don't have a backup, I suggest you create a new .zbk backup. Creating a backup is a safety measure so that you can revert back to your current/broken configuration. I also suggest you create a backup of the storage directory. Run something like this from the command line: tar cvf ~/zway-storage.tar /opt/z-way-server/automation/storage. This is another safety measure so that you can revert back to your current/broken configuration.

With your backups done, here's my responses and suggestions from your post:
[quote

MrRusch wrote: ↑
13 Jun 2024 22:21
One thing I am noticing in the Analytics > Route map is that many nodes don't have a route in the "Route FROM the node" (back to the controller).

I don't think this is a problem. It simply means the controller and the node haven't had any 2-way communication. One way to create some communication is to use the expert UI and issue a command to the node. For example, go to Configuration->Expert Commands and issue a Get on the ZWave Plus command class as shown below.

: expert command.jpg (59.03 KiB) Viewed 24153 times

You can now go back to the Network Map where you should be able to see the to/from path for that node by hovering your mouse over the node. In my experience, it's very helpful to have a sketch of your floorplan and place your nodes at their location in your floorplan. Here's what my network map looks like:

: route map.jpg (141.58 KiB) Viewed 24153 times

If you look carefully at the screenshot of my network map, you can see the route to/from my controller to node 31 based on a network reorganization is the crazy path of 1->41->42->24->32->31 (it appears as a black line). I added a manual route (it appears as a yellow line) of 1->32->31 which is much more direct. After adding the manual route, I tested the route by issuing another Get command. If you use this procedure for all of your nodes, you should be able to define a route map that works reliably. Deciding if/when to add a manual route involves some trial and error. Crazy multi-hop routes should get fixed. Also, if you have routes through nodes with poor signal strength, you'll be well-served to add a route using nodes that have good signal strength. You can see signal strength in installer/packets menu. You'll also need to use the network neighbors table to determine which nodes can see each other (you can't define a route using a node that does not have a green box to the next hop).

I suggest you tackle the route map 1 node at a time. For example, since you're having a problem with Nonce to a node, then check the to/from route for that node. As an aside, unless this node has a real need for security (e.g., it's a door lock), then I would exclude/re-include the node without security to reduce the number of packets.

MrRusch wrote: ↑
13 Jun 2024 22:21
When running a reorganization, it starts by asking for neighbors, so there is no reason to first update all neighbors through this dedicated function and then run a reorganization, right?

Correct. No need to update the network neighbors. The network re-org should have taken care of that for you. However, if you see traffic getting re-routed for no apparent reason, then the network neighbor table may be outdated.

As you see from this post, maintaining a healthy mesh routing table isn't easy. Nodes can get unplugged or have their antenna obscured. I wish I had a Z-Wave network with nodes that all support Long Range since all of the mesh routing complexity would be eliminated.

Please keep the forum posted on your progress.

piet66 · Post by **piet66** » 14 Jun 2024 09:29

MrRusch wrote: ↑
13 Jun 2024 22:21
One thing I am noticing in the Analytics > Route map is that many nodes don't have a route in the "Route FROM the node" (back to the controller).

I have the same effect. I attribute this to the issue https://forum.z-wave.me/viewtopic.php?f=3419&t=35878. It could also have an impact on a reorg.

Besides this, my system works fine without job queue problem.

MrRusch · Post by **MrRusch** » 14 Jun 2024 10:32

seattleneil wrote: ↑
14 Jun 2024 05:18
One of my concerns is that you reported your Z-Wave network was stable before you enabled the Long Range feature. By my way of thinking, the first thing I would try is to revert back to this stable state. Do you have an earlier expert UI .zbk backup in this stable state? If so, I suggest you create a new backup and then try to restore the earlier backup. If you don't have a backup, I suggest you create a new .zbk backup. Creating a backup is a safety measure so that you can revert back to your current/broken configuration. I also suggest you create a backup of the storage directory. Run something like this from the command line: tar cvf ~/zway-storage.tar /opt/z-way-server/automation/storage. This is another safety measure so that you can revert back to your current/broken configuration.

I have been making regular backups at different stages of my recent recreation of the network. Going through these backups it seems I enabled the long range function before including all the nodes.. At least if I can trust the

Code: Select all

<LongRange>1</LongRange><!-- Set to 1 to enable Z-Wave Long Range capabilities and turn the Z-Wave tranceiver into 16-bits node id mode -->

..attribute in Defaults.xml of the zbk file. It seems odd, because I really thought it was more recently that I enabled this via the license.

EDIT: Never mind.. The correct attribute is in the DevicesData.xml and it really seems to correlate well with the last known contact with the failed devices. I have already tried restoring from this point in time, but I will keep trying and digging into this. I kind of want to change this attribute in my current zbk and load that back.

EDIT2: OK so even if I go back to a backup where this is disabled in DevicesData.xml, it still shows as enabled in expert ui when loaded. So, it seems not possible to disable the long range function using zbk alone. I'm guessing something is written to the controller in the enabling process..

seattleneil · Post by **seattleneil** » 14 Jun 2024 18:02

MrRusch wrote: ↑
14 Jun 2024 10:32
I'm guessing something is written to the controller in the enabling process..

That seems like a reasonable assumption. I don't have any suggestions on how you can disable the Long Range feature. I'm guessing you'll need support from the Z-Way team. Perhaps @PoltoS will chime in on this discussion.

Now that your Razberry 7 is running the newest firmware (version 7.42), my best suggestion for solving your queue issue is to look for RF/routing problems. The expert UI has a very easy way to check the overall health of your Z-Wave network using the Network->Menu Timing menu item. As an aside... the main reason I stick with Z-Way is because of their valuable troubleshooting tools. Without these tools, it would be difficult to understand and fix Z-Wave issues. This is what appears on 1 of my systems:

: timing.jpg (166.87 KiB) Viewed 24109 times

At the very least, you need to resolve any red-colored communication. It will be worth the effort as you'll have a reliable home automation system.

forum

job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0

Re: job queue runs full after upgrade to v4.1.0