[Solved] Hah xAP messages stopped.
Hi All,
I have an odd issue where my xAP messages have all but stopped.
Only when I enable the pluginboard do I get a few heartbeats and then that seems to stop too.
I'm using 314/3.4 on Livebox. Not sure when this happened as most of my functionality has stopped but I had just decided to get RF plugs and Jogglers backup and running and was trying to debug, when I noticed it.
Both my Jogglers are showing black screens, which I know can be caused by many things, but I when checking iServer, I noticed they weren't even connecting. I can connect to iServer using telnet. It was then I thought to check the xFX viewer...and saw nothing.
Anyone else seen this?
Outgoing messages are direct to the wire from each process.
Incoming messages all enter via the xap-hub and then are distributed to each process.
If each was sending but not receiving I'd blame the hub, but you are saying "sending" does not happen.
If all the C programs and the LUA code (plugboard) suffer the same fate then its not a code issue, as each uses separate xAP sending code.
That would imply something going wrong at the OS and/or Network level with your hardware.
Check kernel messages "dmesg" output and in /var/log/messages
My livebox hasn't gone down for a while and it runs 24x7 also on build 314.
# uptime
04:27:07 up 58 days, 55 min, load average: 0.00, 0.01, 0.00
Brett
The most recent version of xFX Viewer requires that a xAP hub is running on the PC (earlier versions had an inbuilt hub fallback). However as you say you do see some occasional messages (?) in xFX Viewer I don't think this is the issue.
Is xAP Flash on the Joggler running and do you see heartbeats from that ?
K
Hey Bodge good to see you back. I've been plodding along fixing bugs and pushing a release out everynow and then. What release are you running?
Depending on the answer to that quite a lot could have changed with regards to the how the configuration files are maintained and where they now live. On the latest release each process has its own configuration file in the directory /etc/xap.d/ Which means /etc/xap-livebox.ini is never consultanted and should be removed to avoid confusion.
I think if your xFxViewer is not seeing messages this would be a good place to start as this is your main diagnostic tool. Do remember that there is a BIG difference between XfxViewer v3 and v4.
- V3 does not require a HUB to work.
- v4 requires that you run a hub before it will work.
Brett
Bodge,
There is a program on the livebox called "xap-snoop" trying running that without any arguments it will display every xAP packet that is on the wire. It should go a little crazy with output if things are working and at least let you verify things are flowing. If you are seeing packets using this program but not using XfxViewer then you know who is wrong.
The xap-*.log files in /var/log can often be ZERO length due to linux caching the process output until it exits. There is no cause for alarm on this.
Brett
Port 3639 can only be listened on by a single process. This should be the xap-hub. This is where ALL xAP traffic is broadcast to.
Each other process that starts up after the hub will take the next port in sequence and will tell the HUB where they are listening. The HUB when it receives an xAP packet will forward it onto each process that has registered its listening port so they too can receive the message. That's how the hub works.
Its a little like a USB hub, one connection allows multiple devices to all communicate down the same wire to the computer. In our case the wire is ONE port, port 3639 which is where ALL xAP traffic must occur.
# netstat -aup
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 0 0 0.0.0.0:3639 0.0.0.0:* 143/xap-hub
udp 0 0 localhost:3640 0.0.0.0:* 147/xap-livebox
udp 0 0 localhost:3641 0.0.0.0:* 146/xap-xively
udp 0 0 localhost:3642 0.0.0.0:* 149/iServer
udp 0 0 localhost:3643 0.0.0.0:* 165/xap-mail
udp 0 0 localhost:3644 0.0.0.0:* 162/lua
The message you get from xap-livebox that something is already listening on port 3639 is normal. This is simply telling you 'HEY I FOUND WHAT I THINK IS A HUB, I'LL GO SEE IF I CAN FIND ANOTHER PORT TO LISTEN ON'. What you did not paste was the additional debug info that does that.
One of the major changes I made to allow the portable distribution to evolve was to change the triple dbzoo.livebox.<thing> to dbzoo.<hostname>.<thing>.
I stopped hardcoding the word 'livebox' as part of the address and instead it now uses your hostname.
This also allows you to easily have more than one 'livebox' on your network and avoid xAP path collisions.
The Portable distribution is a branch of code that you can compile on your RaspPi, beaglebone, or any embedded / desktop Linux system.
It works just like the livebox distro (with some caveats).
Brett
Its fine you just are not reading the output correctly.
udp 0 0 0.0.0.0:3073 0.0.0.0:* 129/xap-hub
udp 0 0 0.0.0.0:3639 0.0.0.0:* 129/xap-hub
You will note that the PID is 129 for each connection. So its the same process.
One socket is receiving; one socket is for transmitting.
I guess the transmit socket could be opened and closed when its needed but I decided that xaplib2 should hold it open and that is why you see two ports. 3639->3644 onwards are RX Ports.
What it does show is that you have a set of processes running and they have all allocated a port and bolted into the xap-hub has they should. I see no reason why xAP packets would not be sent out of your livebox.
You do know that in XfXviewer all your messages will be in dbzoo.hal.* as that is you hostname.
I had a thought about this problem.
The plugboard uses the broadcast address 255.255.255.255 for its messages.
/usr/share/lua/5.1/xap/init.lua
local function getTxPort()
local udp = socket.udp()
udp:setoption('broadcast',true)
udp:setpeername('255.255.255.255', 3639)
return udp
end
However all the other C programs use a more localized broadcast address according to the subnet they are hosted on based on their netmask.
# xap-xively -d 6
Xively Connector for xAP v12
Copyright (C) DBzoo 2009-2013
[inf][init.c:98:discoverBroadcastNetwork] 2 interfaces found
[inf][init.c:102:discoverBroadcastNetwork] 1) interface: lo
[inf][init.c:102:discoverBroadcastNetwork] 2) interface: br0
[inf][init.c:109:discoverBroadcastNetwork] address: 192.168.1.5
[inf][init.c:114:discoverBroadcastNetwork] broadcast: 192.168.1.255
[ntc][init.c:55:discoverHub] Broadcast socket port 3639 in use
Which means if you have segmented your network and you have your livebox on one network segment and your PC on another the only message that will come throught will be from LUA as it uses GLOBAL broadcast address.
So if you have two LAN network 192.168.1.0/24 and 192.168.2.0/24
XAP packets from network one will not travel to network two if you have set your netmask on your livebox to 255.255.255.0
# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
127.0.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 lo
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 br0
0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0 br0
If you look at my BR0 configuration you see its netmask.
# ifconfig br0
br0 Link encap:Ethernet HWaddr 00:07:3A:11:22:00
inet addr:192.168.1.5 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2447157 errors:0 dropped:0 overruns:0 frame:0
TX packets:1434022 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:257977076 (246.0 MiB) TX bytes:158241951 (150.9 MiB)
If you widen this to encompasss both networks:
# ifconfig br0 netmask 255.255.0.0 broadcast 192.168.255.255
Now when we start our XAP process their payload will travese both subnets as the broadcast cover both and the switch will forward it:
# xap-xively -d 6
Xively Connector for xAP v12
Copyright (C) DBzoo 2009-2013
[inf][init.c:98:discoverBroadcastNetwork] 2 interfaces found
[inf][init.c:102:discoverBroadcastNetwork] 1) interface: lo
[inf][init.c:102:discoverBroadcastNetwork] 2) interface: br0
[inf][init.c:109:discoverBroadcastNetwork] address: 192.168.1.5
[inf][init.c:114:discoverBroadcastNetwork] broadcast: 192.168.255.255
Does all this make sense? Is this happening to you?
Theoretically xap-plugboard has a defect and it should be using a LAN base broadcast address and you should not be seeing this packet either. :)
Brett
Your broadcast does not match you're mask.
If you have a mask of 255.255.255.0 on the ip 172.16.10.100 you should have a bcast of 172.16.10.255
But instead you have bcast of 172.16.255.255 which does not make sense. To me at least.
If that is your broadcast and your windows box also has a netmask of 255.255.255.0 then it won't see it (which is why you are saying is happening) You windows box would need a mask of 255.255.0.0 to see that broadcast frame.
Can you reconfigure the livebox's bcast address manually?
# ifconfig br0 bcast 172.16.10.255
Does that take? If so do you start seeing xAp packets on your windows box?
- What subnets have you setup?
- Do you use just the one 172.16.10.0/24 space ?
- Do you route between subnets?
- We can't debug this seeing only one IP address given this is network issue. How is your windows box setup?
We are starting to figure out the cause here.
Brett
OK here is what we are going to do. I'm going to add an option so that you can override the broadcast and force the system to use the all ones (think binary) broadcast; 255.255.255.255.
Given we know this is working from the xap-plugboard LUA sub-system.
What you need to do is edit /etc/xap.d/system.ini and in the [network] section add the item bcast_ones=1
I will set the livebox beta 316.1 to upload to the site in 1 hour from this post.
( The wifey is watching something on you tube and if I do it now she will kill me when the network gets trashed by my 5Mb upload )
The change is pretty straight forward: https://code.google.com/p/livebox-hah/source/detail?r=628
Brett
The BETA is up there now. Sorry I left it scheduled and went to bed, I guess I should I have said that.
Before you update there is one more test I would to try thou. You reset the broadcast but you did not restart all the xap component so they did not pick it up let redo that test.
# ifconfig br0 broadcast 172.16.10.255
# /etc/init.d/xap restart
To pick up the beta 316.1
# /etc_ro_fs/update-dev hah-beta.dbzoo.com
To summarise this thread here is what was happening.
You are using a CLASS B network address 172.16.0.0/12 however you are subnetting this to CLASS C for your own usage as 172.16.10.0/24
When the broadcast address is determined on the livebox is uses the Class B broadcast 172.16.255.255 even thou you had a class C mask. Arguably that is a bug in the livebox OS.
When you modified br0 to use the Class C broadcast for your subnet 172.16.10.255 things started working, once you restarted the xap components to pick up this new broadcast address.
To workaround there would be two options.
- Have some shell script to modified the br0 broadcast before all the xAP components start up.
- or use the local broadcast 255.255.255.255 and ignore any broadcast set on your Network interface.
We sent for the 2nd options which can be enabled with a setting in /etc/xap.d/system.ini
Glad its working for you.
-- RCA --
I'm trying to think why this happened and I know why. It was during this change which was required to support the portable distribution correctly so the right ethernet interface was picked up.
http://code.google.com/p/livebox-hah/source/detail?spec=svn629&r=563
I used to calculate the broadcast myself based on the IP address and mask. Then I decided it would be easy to just query the interface and ask for its broadcast.
When I was figuring it I correctly determined 172.16.10.100 & mask 255.255.255.0 = 172.16.10.255
But when I queried the NIC I would get back 172.16.255.255 which is clearly wrong.
As it turns out calculating it myself was hiding the fact that the NIC had the wrong broadcast setup on it from the OS. I was covering up an OS problem.
Damn.....
What I'm going to do is put back this broadcast calcuation logic so that you don't need to use the 255.255.255.255 override. However I'll leave that option in the code just in case.
Thx ... this bug had hair on it.
Brett
Bodge,
If you want to remove that system.ini setting that forces the broadcast to 255.255.255.255 and update to 316.2 where I recompute the broadcast from your ip/mask settings you should find things still work for you.
That bcast_ones=1 setting should not be necessary anymore, you can change it to 0 to disable.
Brett
ps: I've left ability to override the broadcast in the system just in case.
Not seen this myself but have you checked your logs?
cd /var/log
See if anything in there gives you a clue.