Z-way server goes down spontaneously

Discussions about Z-Way software and Z-Wave technology in general
User avatar
PoltoS
Posts: 7571
Joined: 26 Jan 2011 19:36

Re: Z-way server goes down spontaneously

Post by PoltoS »

Looks we have located the source of memory leak - it is XML parsing. It happens in HTTP requests when reply is XML as well as when sending or saving XML data. We are working on a fix to provide you a safe and stable Z-Way.
User avatar
PoltoS
Posts: 7571
Joined: 26 Jan 2011 19:36

Re: Z-way server goes down spontaneously

Post by PoltoS »

And just to confirm, do you have HTTP requests with XML result or other XML handling?

You can also install our very latest build made (less debug output and easier to analyze). Run like before
Once crashes, please provide the log. We are mostly interested in Occupied - in your case it should grow fast
User avatar
GokMasE
Posts: 59
Joined: 13 Mar 2016 01:04
Location: Sweden

Re: Z-way server goes down spontaneously

Post by GokMasE »

PoltoS wrote:
15 Mar 2020 05:50
Looks we have located the source of memory leak - it is XML parsing. It happens in HTTP requests when reply is XML as well as when sending or saving XML data. We are working on a fix to provide you a safe and stable Z-Way.
Ok, this sounds very promising! I am very much looking forward to a test build that targets this, needless to say :-D
And just to confirm, do you have HTTP requests with XML result or other XML handling?

You can also install our very latest build made (less debug output and easier to analyze). Run like before
Once crashes, please provide the log. We are mostly interested in Occupied - in your case it should grow fast
Even if I had thought that the main trigger for the stability problems was related to using the zway module "HTTPGet", I do indeed also use HTTP requests where the server replies with XML.

I have downloaded latest build and will try to launch it same way as the previous one.
Might have to wait until this evening though, but I will post back as soon as there are news.

UPDATE: I have the new build (z-way-3.0.4-58) running since about 15 minutes, and it is logging considerably less than previous version. The z-way.log in /tmp reached 2mb rather quickly, but it is now building up in a more moderate pace. This version also responds normally on http://192.168.1.195:8083/ - I presume the previous one failed to do that because of the extensive logging.



Regards,

/J
User avatar
GokMasE
Posts: 59
Joined: 13 Mar 2016 01:04
Location: Sweden

Re: Z-way server goes down spontaneously

Post by GokMasE »

Ok, test build so far keeps running but I was curious enough to have a peek at the "Occupancy" level development in the z-way.log-file. It seems to go through sequences of both increase and decrease. The top value I could spot manually so far is "Occupancy 5978".

Regards,

/Joakim
User avatar
PoltoS
Posts: 7571
Joined: 26 Jan 2011 19:36

Re: Z-way server goes down spontaneously

Post by PoltoS »

I think it will grow up for you smoothly (even despite it goes down for some values periodically after GC) and if you disable those requests with XML, it should stop growing
User avatar
PoltoS
Posts: 7571
Joined: 26 Jan 2011 19:36

Re: Z-way server goes down spontaneously

Post by PoltoS »

Finally! The issue was fixed in: z-way-3.0.4-60-g112e32c_armhf.deb

This version still produces quite a lot of debug output - we will remove all this once GoKMasE and/or some others will confirm that the issue is fixed.

Please try it
rafale77
Posts: 115
Joined: 23 Jun 2018 19:26

Re: Z-way server goes down spontaneously

Post by rafale77 »

Is it unique the the raspberry?
I am polling the zway API pretty intensively with a a large network and have not observed this problem on ubuntu x64.
User avatar
PoltoS
Posts: 7571
Joined: 26 Jan 2011 19:36

Re: Z-way server goes down spontaneously

Post by PoltoS »

This is not about polling the Z-Way API, but about doing HTTP requests from Z-Way to other sites that replies with XML.

The issue is inside the JS engine and is definitely CPU dependant, but I suppose memory leaks and corruption should happen on x64 too.

Anyway right now we want to confirm it works on RPi and we do a v3.0.5 release immediately.
rafale77
Posts: 115
Joined: 23 Jun 2018 19:26

Re: Z-way server goes down spontaneously

Post by rafale77 »

I am watching for memory leaks very very closely and have not seen any in over a week. I am pounding the zway http but not with XML so this may not help.
User avatar
GokMasE
Posts: 59
Joined: 13 Mar 2016 01:04
Location: Sweden

Re: Z-way server goes down spontaneously

Post by GokMasE »

Ok, I took a loooong break from this one but now I feel a follow-up is due on this topic:

I no longer think the actual problem lies within zway server itself - I think the problem is in fact a chain of events going wrong rather than an isolated single issue.


Background
==========

Around the time for the 3.0.5 release I realized that the XML bugfix was not the cure of the problem I was seeing, so I decided to take a break and do other things for a while - and aimed to return to investigating the crash problem later on, looking upon things with a rested mind. And lately I did find the time and energy to get back trying again..


Back to testing again!
===================

So I started over with the stability tests again and I could again quickly conclude that z-way crashing problem did not go away with the 3.0.5 release (and is was still very present in the 3.1.1 build as well).
When starting new test sessions with my program as well as zway server, things would work ok for a good while, but within the period of 2-5 days the zway server would eventually end up crashing.

And after lots of tedious tests and analyzing my log files I started noticing something weird - at some point in every session, I nocited how suddenly that although the RPi (zway) would keep on replying when called upon, there would be no HTTPGet requests (reflecting status changes and sensor data leven changes) coming through to my program from the RPi at all. Like if there was all of a sudden no z-wave device status changes or sensor level data changes to report? Well, since I was at the time sending (from my prg to zway) repeated status ON/OFF change requests every 30s (and could hear the corresponding switches actually responding/switching!) I knew something was very wrong if there were no status changes coming through - and I could see in zway log that HTTPGet was trying to issue corresponding requests to reflect the status/data changes! It seemed the listening socket in my program had suddenly become a "deaf" listening socket. By looking at my own programs log as well as the zway log, I could tell that after this had happened, the zway server would still keep running for another 20-25 minutes - but eventually the zway server would ALWAYS crash once the HTTPGet requests stopped getting through to my program.

I started zooming in on some of the details in the zway log file and suddenly noticed that around the time that my program would stop receiving HTTPGet requests, there had shortly before been a number of occasions where the HTTPget module was sending out several requests in one single go - I think I have seen as many as 6 such HTTPGet requests, one after the other, but here is at least a logged example with 3 requests within 0.1s:

[2020-10-14 17:55:22.373] [core] http://192.168.1.3:1025/ZWave?Cmd=CBck& ... Lev=%VALUE% : ZWayVDev_zway_6-1-50-2 : 0.2 : http://192.168.1.3:1025/ZWave?Cmd=CBck& ... -2&Lev=0.2
[2020-10-14 17:55:22.402] [core] http://192.168.1.3:1025/ZWave?Cmd=CBck& ... Lev=%VALUE% : ZWayVDev_zway_6-1-50-2 : 0.2 : http://192.168.1.3:1025/ZWave?Cmd=CBck& ... -2&Lev=0.2
[2020-10-14 17:55:22.456] [core] http://192.168.1.3:1025/ZWave?Cmd=CBck& ... Lev=%VALUE% : ZWayVDev_zway_6-1-50-2 : 0.2 : http://192.168.1.3:1025/ZWave?Cmd=CBck& ... -2&Lev=0.2

Ok, so I could conclude that there were defintely series of HTTPGet callbacks sent in bursts at times. I had to ask myself if that could be a problem? And if so, why?

I then started turning my attention to the listening sockets on my end, and started to wonder if they could be sensitive to being put under stress? I made a very basic test case, where I was merely listening for incoming http requests from a browser (the browser was run from another computer in the same LAN) and in reply to those incoming requests, I sent back a simple example string with the current timestamp. Then I simply held CTRL+R (reload) down in the browser in order to loop the test over and over very rapidly. While that seemed to work without problems for a while (800-1000 interations), suddenly the listening socket stopped reporting incoming data. It had gone bad/deaf! Very similar to what I saw happening in my program that is interacting with zway. And it seems obvious that the listening sockets on my end are not happy about a hefty pace of incoming requests, to put it mildly :/

To sum it up
==========

Basically this is my theory on what rougly happens between my program and zway server:

* HTTPGet requests from RPi are (at rare occasions) sent either too fast or too many at a time (for my system to cope with). The effect is similar to an unintentional miniature DoS-attack it seems.
* The listening sockets in my program then abruptly stops reporting the incoming data - I assume some kind of overflow/corruption in the network stack, but tracing that is beyond the scope of my skills by (very) far.. :/
* If my program + zway server are kept running in this state, the "bad" listening socket on my end seems to be *very* bad news for the zway server too. This seems to be 100% fatal for zway eventually. (I suppose HTTPGet module is not a separate process, or else the actual zway server would not have crashed?)
* I think it is quite safe to say that without HTTPGet involved, there is no issue - atm I am running a test session with my program + zway server (with HTTPGet module inactive) and the uptime has just passed 16 days without any incidents at all.


What about solutions?
==================

Of course, I realize my problem and situation is fairly exotic and that if devs can't reproduce my issues (and no other users are suffering from the same), it will of course limit the motivation as well as the possibilities to find a cure. OTOH - maybe a high frequency of outgoing HTTPGet requests from HTTPGet module could theoretically introduce problems elsewhere too?

One thing I could suggest (and hope for the best), is for someone to add some kind of limiting/flow control/outgoing queue to the HTTPGet module - that could perhaps be of benefit to more users than just me?

At this point in time - I would be VERY interested in testing what would happen if HTTPGet could be equipped with a limit on how fast it is allowed to send the next HTTPGet requests. It could ideally be made configurable I think, with maybe some basic filtering too. On top off my head I can imagine something like this: (rough example!)

1) Send max 1 request per: [0.2/0.5s/1s/2s]
2) Wait until ACK before sending more requests [ON/OFF]
3) Don't send same data twice (filter out request, if sensor state/level is same as last time) [ON/OFF]

I don't know who is maintaining the HTTPGet module today, but I notice in the configuration view that it says its version number is 2 and that the name of the author is "H Plato". Is that a member of the zway dev team or a standalone developer? And is he still active?


Sorry for the long post. My brain is tired from testing and log reading, I hope at least some of this made any sense..


Regards,

/Joakim
Post Reply