this might sound strange but when we connect a DTH ATCA board in one of our development setups to the GPN, the IPMC firmware crashes within seconds. Our network expert is looking into it but finally we think we could exclude any other reason than the IPMC is screwed up by packets received over ethernet. This is what our experts have reported:
DTH powercycles shortly after RJ45 uplink is connected
After extensive testing in the (Tracker crate?), any electrical issues (such as short circuits) were ruled out. The problem was traced to GPN traffic and the IPMC.
When an Ethernet cable (RJ45) is connected to the uplink of the DTH board, the IPMC hangs within a few seconds, reporting “UNHANDLED EXCEPTION: HARD FAULT,” and is subsequently reset by the watchdog. This results in a power cycle of the DTH.
We could not reproduce the issue anywhere else than this setup, but in this TRACKER dev setup it happens all the time in both ATCA blades in the crate. Also network misconfigurations could be excluded. We had a look at the network traffic where nothing unexpected could be found.
Question:
was such a behavior seen anywhere else?
any hint what we can test?
are there maybe newer framework versions available with upgraded ip libraries?
An interesting observation that you have! I have not heard of anyone else reporting anything similar.
I would be surprised if the IPMC could be swamped by IP packets if they are not addressed to it. So, I would think it is rather its network configuration which may be incorrect.Do you use static IP address or DHCP? In the latter case do you know if it has obtained its IP address?
Is the debug port of the IPMC connected an can you read what it prints when booting? Or can you read the LAN configuration using IPMI, e.g.
ipmitool -H <ipmc_address> -P “” lan print 5
or
ipmitool -H <shmm_address> -P“” -t <ipmc_hw_address> lan print 5
I do not know the internal network structure of the DTH, so I am wondering is the IPMC directly connected to GPN or via the DTH internal network? Is there any way to know the status of teh internal network?
can we include our network expert Petr here? He has done all the detailed tests and he can explain better all the configurations of the internal switch he has tried.
sorry to hear that you are having problem, however we will need additional information in order to be able to help:
Are you using the Xilinx virtual cable server running on the IPMC?
Do you use serial-over-LAN?
Can you please send the complete output from the IPMC serial debug UART
Can you give us access to the IPMC project that you use to build the firmware
To answer your questions:
was such a behavior seen anywhere else?
No. For instance the ATCA blades already in ATLAS that all use the CERN IPMC have been operating without any issues throughout Run-3.
any hint what we can test?
As Ralf already said, we need more details about your network architecture, since you are suspecting a networking issue.
are there maybe newer framework versions available with upgraded ip libraries?
No, there are no recent updates from PigeonPoint. When was the IPMC firmware that is being used built?
I would also normally advise against connecting the IPMC (or any ATCA equipment for that matter) directly to GPN if it can be avoided. However in case this is required, I would at least define the LAN control sets to protect the ATCA endpoints.
At the moment, the problem is reproducible in multiple ATCA crates in
B186, which is our CMS Tracker test installation. It does not happen,
for example, in B40 in the CMS DAQ test system.
The IPMC hangs with the message “UNHANDLED EXCEPTION: HARD FAULT”,
followed by a register dump. The full log is attached. Afterwards, the
board is reset by the internal watchdog.
Arguably, an unhandled exception should never happen, so this appears to
be something critical. Since it seems to be related to network, I can
imagine there may be a bug in the packet processing function, where the
software receives something unexpected and is unable to handle it
correctly. At the moment, it is just a speculation that needs to be
verified.
The fact that we are seeing this for the first time is probably just a
statistical coincidence. The number of people using it on GPN is slowly
increasing, so we may see this more often in the future.
Ulf and I will try to reproduce the issue on the IPMC development board,
i.e. outside of the ATCA crate. The only potential downside is that the
IPMC is without the connection to the shelf manager, so let’s see if
that is reproducible.
In the meantime, it would be helpful if somebody could check where in
the code the unhandled exception occurred based on the register dump.
I don’t know what the Xilinx virtual cable server is, if it’s not part of the standard setup we are not using it.
Serial-over-lan is connected to the console of the RTM Zynq
This is our repository: https://gitlab.cern.ch/cms-daq/ipmc I guess you don’t have access and the rights are set at an upper level. We can provide you with the directory we are uploading for compilation
The serial debug output is right now full sensor readout debug messages, we are planning to reduce this
Depending on the frequency, printing out the full sensor readout debug messages can slow down the IPMC considerably and deprive it from the ability to look after other, potentially essential, tasks …
Uff. That Discourse system is a disaster. it already refused the attachment when I sent it by email. And when I pasted the content directly into the email, it removed the most important part.
Sending it again with the email addresses explicitly mentioned. Please see the attached file.
thanks, this time I got the full file. I agree about discourse being hot garbage, however this is apparently the solution that CERN IT advocates.
There appears nothing obviously relating to the crash in the log, however I have a few observations:
After the reboot, I do not see the message about the IPMC receiving it’s IP address, is your DHCP server running? There should be a message similar to this at some point:
<_>: LAN iface 0: IP = x.x.x.x
Ralf is correct, if you are continuously printing debug output from the sensor drivers, this could potentially slow down the main loop so much that it eventually crashes. Once the sensor driver are tested and working, the debug output should be disabled.
I have never seen the following error message before:
: Deprecated non-NULL member ‘first’ of group structure detected.
This could indicate some problem in your project.
BTW, do you LAN control sets to protect the IPMC and the SoC on your ATCA cards?
Thanks for the initial check. Please see my reply inline:
I see. I can see this line in the log: The worrisome part is that it is 0.0.0.0. That being said, on GPN there is always a DHCP server assigning some IP address, either the default CERN IT server or a custom one. I can check with the CMS Tracker group whether they are running their own DHCP server, if necessary. However, how that can cause “UNHANDLED EXCEPTION: HARD FAULT”? The only explanation I can think of is that something goes wrong in the lightweight TCP/IP stack used by the IPMC software or somewhere around. Looking at the register dump, the PC register is zero, so this appears to be a classic jump to an invalid address. If I am not mistaken, the LR register should contain the return address from the current function. Can we check where it points in the source code? I guess we would need the firmware binary from Ulf for that? The problem does not appear in our lab, there the IPMC works correctly with the same level of debug output. If the mail loop is slow, can it crash exactly with “UNHANDLED EXCEPTION: HARD FAULT”? Ok, this is probably something for Ulf to investigate. However, since this message is also present in logs where the IPMC works correctly in our lab, I currently consider it unrelated to the problem we are debugging. I am not entirely sure what is meant here. However, there are currently no explicit QoS or ACL protections configured for the network. Best regards, Petr
Yes this is expected and is printed immediately at boot. However the the IP address assignment via DHCP comes later, after a few seconds. Well, I don’t know, we never saw this error before, so I guess it could be anything … I’m guessing in this case it gets assigned a proper IP address? Again, I don’t know, but it can certainly create problems. In any case it’s better to remove the debug print out if you don’t need it anymore. Maybe, but it could also be a combination of causes. In any case, it would be good to fix this.