Pachube Feed Freezing
I'm using the 302 firmware and seem to be having problems with the Pachube uploads freezing. Checking the device this evening it froze earlier today.
Looking at the /var/log/xap-pachube.log -
# cat xap-pachube.log
Pachube Connector for xAP v12
Copyright (C) DBzoo 2009-2010
[err][pachulib.c:137:connect_server] errno 145 (Connection timed out)
[err][pachulib.c:137:connect_server] Err!! connect
[err][pachulib.c:137:connect_server] errno 145 (Connection timed out)
[err][pachulib.c:137:connect_server] Err!! connect
[err][pachulib.c:137:connect_server] errno 145 (Connection timed out)
[err][pachulib.c:137:connect_server] Err!! connect
[err][pachulib.c:185:read_data] errno 131 (Connection reset by peer)
[err][pachulib.c:1#
# ps
PID USER VSZ STAT COMMAND
1 root 1796 S init
2 root 0 SW [keventd]
3 root 0 RWN [ksoftirqd_CPU0]
4 root 0 SW [kswapd]
5 root 0 SW [bdflush]
6 root 0 SW [kupdated]
7 root 0 SW [mtdblockd]
8 root 0 SW [khubd]
34 root 0 SWN [jffs2_gcd_mtd2]
111 root 1668 S dropbear -p 22
133 root 1028 S /usr/bin/xap-hub -i br0
137 root 1068 S /usr/bin/xap-livebox -s /dev/ttyS0 -i br0
144 root 1044 S /usr/bin/xap-serial -i br0
146 root 4072 S /usr/bin/kloned
148 root 2616 S /usr/bin/xap-currentcost -s /dev/ttyUSB0 -i br0
152 root 4168 S lua /etc_ro_fs/plugboard/plugboard.lua
179 root 1080 S /usr/bin/xap-pachube -i br0
187 root 1740 S dropbear -p 22
188 root 1812 S -ash
193 root 1784 R ps
#
Those errors are just telling you there was a connectivity problem at some point - this would result in a frozen feed for that period. However they are benign and the process will recovery automatically without having to restart the process.
If you do experience an issue make sure you check the logs BEFORE you reboot as any error message will be lost on reboot.
Brett
They had some problems abourt 12 hours ago with their server and it came back up an hour later. Apparantlky problems with hosting provider
So that would have been down between 11 and 12 on the 13th jan ab back up between 12 and 1 ish
kevint
I've created issue 35 to track the pachube freezing problem - This happened to me too.
UPDATE: I found the problem it was a bug in a 3rd party library I was using. Grrrr
Pachube doesn't go down too often so we should be ok until I get the next release out. If it does you'll need to reboot to fix this as the issue is a File Descriptor leak which will affects every process on the HAH. Yeah its a really ugly bug I inherited in this library, see what happens when you don't write it yourself!
Brett
Hi Brett
I'm on 302.8 (- thanks for the ability to post to datastream '0'......)
Last night I had a brief service interruption - unfotunately I didn't check till this morning and found:
1) Non HAH post to Pachube all OK
2) HAH post to Pachube frozen (I have two HAH - both FROZEN)
I've tried a hard re-boot on both HAH - no fix
xFx viewer shows xAP activity with heartbeat on Pachube and CurrentCost
I can see CurrentCost values being posted by both HAH on Joggler-xAP Flash GUI screen
I can't log onto either HAH with WinSCP, and PuTTY takes 10 times longer than normal to respond, but eventually does get to the login screen.
I know I've re-booted but I get the following from CAT
# cat xap-pachube.log
cat: can't open 'xap-pachube.log': No such file or directory
I'm going to take down the whole LAN and bring each element on-line to see if I can sort it...
Any ideas?
EJ
OK - nothing improved until I physically disconnected the ADSL Router - powered up and everything fell back into place.....wierd....EJ
If your internet connection suddenly drops what happens is that DNS resolution fails, and now it takes a long time for a timeout, so when you attempt to SSH/SCP into you live box various networking stacks have a bit of an issue, and these delays become VERY noticable. As you have experienced. If you rebooted your HAH and then it locked up that would be the NTP DAEMON trying to reach out to setup the TIME before bringing up the SSHD daemon, again this takes about 5-10min to time out if no internet it found. So it appears your HAH is dead, its not. You should have noticed on the LCD the message 'NTP sync'. I added this so people would know why its freezing at this point. Perhaps you didn't look ?!
I suspect what might have happened is that your ADSL connection did a DHCP renew and perhaps got a new IP, which screwed everything over, as the connection that where establish internally now all get dropped. AND OR your Modem just fell over and DIED and need resetting (the likely scenario) as a DHCP RENEW should not have locked up the HAH like I'm mentioned above.
When you reset your modem the HAH would now resolve everything and boot up fine.
Brett
Hi Brett
thanks for that......I didn't know that a message would be shown on the LCD - must read ALL the wiki....anyway as I mentioned I re-booted the ADSL and everything came back up as normal!
I've reccently downloaded the BETA with the feed to datastream zero element added - which topic should I log into to post updates, etc......as i'm not able to get a feed to datastream zero from a HAHHub which has been instanced....
cold weather is abating - now a balmy 4.3C outside (9:00pm)
regards....EJ
It is VERY important that if you run more than 1 HAH on a network you MUST make sure they all have different MAC addresses. Remember the HAH comes with 2 network ports, they default to 00:07:3A:11:22:00 and 00:07:3A:11:22:01, and can be found on the Management tab of the web GUI. I would suggest you add 1 to the last part of the MAC for additional HAH's, ie 00:07:3A:11:22:02 andd 00:07:3A:11:22:03 for a second HAH.
Duplicate MAC addresses will also cause problem for your internet router or any other network device you might have. You may even find your router losing it's internet connection when you have duplicate MAC addressses.
Karl
I am also experiencing a similiar freeze problem which I have tracked down to it occuring when my router goes down (froze today at 15:50 router uptime aligns give or take a few minutes).
From issue 35 noted in an earlier post by Brett it was a known issue but has now been fixed (I am running version 306), details from xap-pachube.log are;
# cat xap-pachube.log
Pachube Connector for xAP v12
Copyright (C) DBzoo 2009-2010
[err][pachulib.c:101:resolve_host] Err!! gethostbyname
[err][pachulib.c:101:resolve_host] errno 22 (Invalid argument)
Resets ok when I reboot the HAH.
Any ideas ?
Dave
I have no idea's this time as I puzzled over this for a while the last time before I spotted a file descriptor leak. Now it gets even harder...
I recently purchased 2 boxes with 306 on them. I've also noticed the Pachube upload freezing. So much so, I've stopped using them to upload 'important' data for PV generation (and have gone back to the dreaded CCBridge). I still have it running from a clamp and another envi just to keep pushing data through it so see if it keeps breaking. When I say freezing, its still sending data, just the same value over and over again.
Can anyone give me any pointers or leads to follow to try to understand what is going wrong? I had a quick look at the xAP traffic with the free ipad app. I can see values moving around and changing, problem is its hard to follow as you have to catch it when it sticks which will be either when I'm at work or asleep! Are there any logs / debug stuff on the box to record if its having a problem sending data to COSM?
that your whole house hysterisis is set a bit high? therefore the same value will be uploaded until a significant (over setting) change is detected?
I still suffer freezing of the feed, I can normally associate this with my network connection going down.
It can also freeze the feed at a value but is worth checking the hystorisis setting to ensure this is also not the issue.
I work around the issue by scheduling a reboot of HAH every 12 hours, which is ok for now (see Pvoutput link below) as all I am using pachube for is my house consumption, my Pv is collected still by using a scrip on my PC to connect via bluetooth to the invertor every hour through the day. I would like to switch from the PC but results with using a CC clamp are less acurate so have decided to stick with the current setup for now, although I am looking into using the CC optismart sensor for a friends Pv installation.
I have heard reports of the feed not recovering past a network outage, unfortunately it's going to be a couple of months before I get a chance to look at this code and test for this as I'm 2 weeks from relocating countries, having said that with a stable network connection I've never experiences an outage. Maybe its time to change providers :)
Perhaps somebody else can take a swing at this issue and see if they can find a bug.
Brett
A couple of questions:
- How do I schedule a reboot?
- do I need to do something to turn on logging? The files in my logs directory are empty?
Brett will probably answer this before me, but you could either:
call reboot via crontab, or write a script for PBv2 to call reboot, or listen for a BSC message via PBv2 and reboot, lots possibilities. Another thought - look at web admin page, perhaps you could even post/get to the reboot button if you want to do it from wget on another system, although not checked this and perhaps brett does a referrer check. I'm not near to the hah source now so can't check myself.
As for logging, think unless you look at Bretts new beta's you may have to start processes manually to enable enough logging. Guess this is where a usb stick comes in handy...
--andy
Try this link for the reboot script, all the details are on the plugboard (PB) Wiki to explain how to get the Applets working.
https://docs.google.com/leaf?id=0BwzJbOYgkNcVZGU3NDZhZjEtZWI4Mi00MjkxLTh...
Nont worry its easy (I managed it !)
Dave
Try this link for the reboot script, all the details are on the plugboard (PB) Wiki to explain how to get the Applets working.
https://docs.google.com/leaf?id=0BwzJbOYgkNcVZGU3NDZhZjEtZWI4Mi00MjkxLTh...
Nont worry its easy (I managed it !)
Dave
Brett - did you manage to get this resolved? Im on 306 and believe its still happening.
Every day or so, my feeds stop; Im not sure its directly related to router/network issues though, as I cant do an exact correlation to my router going offline. As everyone says though, the feed process doesnt recover when the network is back. A simple click of the restart button on the Pachube page starts it working.
In the meantime, other coders, I wonder is there a better way than doing a full reboot? Maybe a check if the service has stopped, or just restarting it every 12 hours or so?
Thoughts?
Edit #1
I noticed that, presumably because Id restart the Pachube service, there were seveal instances of it running (as shown by PS) - I would have thought they should have disappeared, and could be hanging?
Ive rebooted and will obtain screenshots as they become relevant.
I have a really flaky Internet connection and since Bretts alterations my Pachube recovers fine. In fact I've had no Pachube failures for as long as I can remember. So it certainly isn't a universal issue.
As for rebooting, you could just schedule the cron scheduler to issue a reboot command as often as you like.
More intelligently you could also write a script that monitors Pachube heartbeats and reboot if they stop. If you need helP wih this I could knock one up for you.
In fact have you checked the heartbeats do actually stop when the posting stops? Just wondering as you say you are seeing mulitple instances.
Garry.
Ok, stopped again this morning. This is the pachube log
# cat xap-pachube.log
Pachube Connector for xAP v12
Copyright (C) DBzoo 2009-2010
[err][pachulib.c:101:resolve_host] errno 22 (Invalid argument)
[err][pachulib.c:101:resolve_host] Err!! gethostbyname
[err][pachulib.c:101:resolve_host] errno 22 (Invalid argument)
[err][pachulib.c:101:resolve_host] Err!! gethostbyname
etc
I cant see that my network has dropped out, but I wonder if the hostname resolution is failing (as above) ?
The xap Pachube hearbeat is still running.
Prior to restarting, there is only one copy of
145 root 1080 S /usr/bin/xap-pachube -i br0
after, restarting, there are two:
145 root 1080 S /usr/bin/xap-pachube -i br0
204 root 1080 S /usr/bin/xap-pachube -i br0
So Im guessing the first one is hanging, although thats challengeable because the heartbeat is still runnig. Also, XFX Viewer now shows both heartbeats.
Can someone try a restart on their (working) version, and see if it creates another instance (as shown in PS) without closing the first?
You said you are using firmware 302 - do you mean 306 as this is the latest.
Post release 302 there where some pachube buffer overrun issues that where resolved which could cause it to misbehave if you have many data feeds you are updating. Thus why I ask if you are really on 302. You want to be on 306.
Brett
Hi Brett
Just to stick my 2 cents worth in - my main home HUB (not instanced) freezes whenever my (Sky) internet hiccups, however the remote HUB (instanced) recovers!!!!
The main HUB has many CC and weatherstation items being posted via the HUB page, plus I'm posting a dozen more to ID's on a couple of Feeds via Plugboard...... the remote HUB only has three items being posted via the HUB Pachube page.
BUT - its not consistent.... I just had a 20 second loss of sevice and neither HUB recovered gracefully.....had to re-start Pachube on both (waited 15mins to make sure it wasn't auto recovering)!!!
I don't always get anything in the Pachube log, but occasionally there are err 22 and err 146 in the log.
It's become more noticeable recently as Sky have a very flakey service in my area - which they blame on the 'unseasonable wet weather' !!!!!!
Anyway - over the past 14 days I've had 26 freezes, 18 reboots and 8 restart Pachube's. The Pachube heartbeat doesn't always cease so I can't use that as a trigger..... can anyone point me to a Twitter applet?
hope you can get to the bottom of this......cheers for all your good work......EJ
Exceedingly - to the extent that I don't dare touch the tab (add, amend or delete) anymore! I lost several ID's, Feeds, etc whenever I tried to change anything so I resorted to my Plugboard applets to post additional ID's to Pachube.... Brett wants/was going to rehash the ini system at one stage, so I will wait for that....EJ
Sometimes the web interface makes a bit of a dogs breakfast of the .ini file. Once I get settled (one more house move to go) I'll be able to spend some time investigating these problems. The pachube one is curious.
The only thing that comes to mind is that if you have a really prolonged outage the amount of log that is produced can fill up the /var/log area and break things due to out of "disk". As this is a ram disk a reboot clears this and then we are good.
Brett
So it happened again - while I was watching the xfx viewer and the HubGUI......
1) GoogleCal went dead.... (orange box in the view list)
2) Pachube went dead .... ( ... ditto ... )
3) COSM/Pachube froze on the last transmitted numbers....
4) Nothing in any log file!!!!
So I looked at the HUB - all appeared to be functional. The PC on the same LAN was able to access the internet so I decided to try a Pachube refresh from the Hub GUI..... the Pachube icon in xfx viewer showed OK and it listed the heartbest stop and start so I left it for a while.... the heartbeats were not every 60 - they came about 90, then 75, then 80 then 70...all over the place.... and the COSM site was not updating!!!
This time I rebooted from the GUI..... everything looked OK (all heartbeats OK), but the COSM site was still not updating..... then I glanced at the Hub time stamp.... Jan 01 1970..... the Hub had rebooted but hadn't got a good np time synch.... rebooted again... bad time/heartbeat.... and again.... bad time/heartbeat.....left for 30 mins then rebooted..... bad time/heartbeat.... left an hour then rebooted....all OK, time in synch - everything heartbeating OK and the COSM site was being updated...!!!!!
I'm now constantly looking at my COSM twitter feed to see if the ID has frozen, but as it can take upto 30 mins for COSM to respond to a freeze, it's not exactly rapid response....
Hope this helps....
That is certainly interesting.
On a reboot what can happen is that if the network is down, or at least DNS resolution can't occur, then the script (/usr/share/udhcpc/default.script) that executes when a dhcp IP request is received may timeout too leaving your system with the wrong date/time.
So its possible that your router is up and handing out DHCP IP addresses however the WAN (internet) is still down which causes the ntpclient to hang before timing out. At this point the system will come up, BUT pachube will continue to fail as the internet is still down (ie gethostname will throw that error that you see in the logs) as the DNS resolution is breaking still.
What it does not explain is why none of this recovers - certainly anything that needs SSL (ie SSH & HTTPS) will not work until the time is synced up, that is xap-googlecal and xap-twitter will both be hosed.
I'm not sure of the behaviour of the ntpclient once it times out I don't belive it will try again which would require a further reboot (as you mentioned).
The irregular heartbearts are unusual but I'm not sure what that is indicative of - perhaps some sort of network load?
You can try using a different NTP host for syncing the time if the uk.pool.ntp.org is playing up (find that very unusual). Perhaps your service provider as a closer source has one: ntp.sky.co.uk or ntp1.isp.sky.com - you might want to manually check those work before changing your box to use them. They resolve for me but I could not sync to them - but then again I'm not on a sky network and perhaps they lock them down.
# ntpdate -s -h <NTPSOURCE>
Brett
My Pachube feed (36589) froze this morning at about 11 am. I noticed at about 11:45 and rebooted the box. It's been fine ever since. I checked that xap-pachube was still running and everything seemed OK. So it may just be coincidence. I'm still running 301 at the moment.
/var/log/xap-pachube.log is clean by the way.