Ethernet traffic is crashing IPMC

Hi all,

this might sound strange but when we connect a DTH ATCA board in one of our development setups to the GPN, the IPMC firmware crashes within seconds. Our network expert is looking into it but finally we think we could exclude any other reason than the IPMC is screwed up by packets received over ethernet. This is what our experts have reported:

  • DTH powercycles shortly after RJ45 uplink is connected

  • After extensive testing in the (Tracker crate?), any electrical issues (such as short circuits) were ruled out. The problem was traced to GPN traffic and the IPMC.

  • When an Ethernet cable (RJ45) is connected to the uplink of the DTH board, the IPMC hangs within a few seconds, reporting “UNHANDLED EXCEPTION: HARD FAULT,” and is subsequently reset by the watchdog. This results in a power cycle of the DTH.

We could not reproduce the issue anywhere else than this setup, but in this TRACKER dev setup it happens all the time in both ATCA blades in the crate. Also network misconfigurations could be excluded. We had a look at the network traffic where nothing unexpected could be found.

Question:

  • was such a behavior seen anywhere else?
  • any hint what we can test?
  • are there maybe newer framework versions available with upgraded ip libraries?

thanks for any help

Ulf

Hi Ulf,

An interesting observation that you have! I have not heard of anyone else reporting anything similar.

I would be surprised if the IPMC could be swamped by IP packets if they are not addressed to it. So, I would think it is rather its network configuration which may be incorrect.Do you use static IP address or DHCP? In the latter case do you know if it has obtained its IP address?

Is the debug port of the IPMC connected an can you read what it prints when booting? Or can you read the LAN configuration using IPMI, e.g.

ipmitool -H <ipmc_address> -P “” lan print 5

or

ipmitool -H <shmm_address> -P“” -t <ipmc_hw_address> lan print 5

I do not know the internal network structure of the DTH, so I am wondering is the IPMC directly connected to GPN or via the DTH internal network? Is there any way to know the status of teh internal network?

Cheers,
Ralf

Hi Ralf,

can we include our network expert Petr here? He has done all the detailed tests and he can explain better all the configurations of the internal switch he has tried.

cheers

Ulf

Hello Ulf,

sorry to hear that you are having problem, however we will need additional information in order to be able to help:

  • Are you using the Xilinx virtual cable server running on the IPMC?

  • Do you use serial-over-LAN?

  • Can you please send the complete output from the IPMC serial debug UART

  • Can you give us access to the IPMC project that you use to build the firmware

To answer your questions:

  • was such a behavior seen anywhere else?

No. For instance the ATCA blades already in ATLAS that all use the CERN IPMC have been operating without any issues throughout Run-3.

  • any hint what we can test?

As Ralf already said, we need more details about your network architecture, since you are suspecting a networking issue.

  • are there maybe newer framework versions available with upgraded ip libraries?

No, there are no recent updates from PigeonPoint. When was the IPMC firmware that is being used built?

I would also normally advise against connecting the IPMC (or any ATCA equipment for that matter) directly to GPN if it can be avoided. However in case this is required, I would at least define the LAN control sets to protect the ATCA endpoints.

cheers,

Stefan

Hi Ralf, Ulf,

At the moment, the problem is reproducible in multiple ATCA crates in
B186, which is our CMS Tracker test installation. It does not happen,
for example, in B40 in the CMS DAQ test system.

The IPMC hangs with the message “UNHANDLED EXCEPTION: HARD FAULT”,
followed by a register dump. The full log is attached. Afterwards, the
board is reset by the internal watchdog.

Arguably, an unhandled exception should never happen, so this appears to
be something critical. Since it seems to be related to network, I can
imagine there may be a bug in the packet processing function, where the
software receives something unexpected and is unable to handle it
correctly. At the moment, it is just a speculation that needs to be
verified.

The fact that we are seeing this for the first time is probably just a
statistical coincidence. The number of people using it on GPN is slowly
increasing, so we may see this more often in the future.

Ulf and I will try to reproduce the issue on the IPMC development board,
i.e. outside of the ATCA crate. The only potential downside is that the
IPMC is without the connection to the shelf manager, so let’s see if
that is reproducible.

In the meantime, it would be helpful if somebody could check where in
the code the unhandled exception occurred based on the register dump.

Best regards,
Petr

(Attachment IPMC.txt is missing)

The file IPMC.txt was rejected with: [CERN IPMC] Email issue –
Attachment Rejected.

Adding it here:

ltc2991 mux:23/i2c:49 V34 6494 mA
ltc out: 3808
72637600 nV
ltc2991 mux:25/i2c:49 V12 2504 mA
ltc out: 1128
21516600 nV
ltc2991 mux:25/i2c:49 V78 1501 mA
ltc out: 1132
21592900 nV
ltc2991 mux:0/i2c:48 V12 222 mA
ltc out: 1391
26533325 nV
ltc2991 mux:24/i2c:48 V12 265 mA
ltc out: 1902
36280650 nV
ltc2991 mux:24/i2c:48 V34 362 mA
ltc out: 1348
25713100 nV
ltc2991 mux:24/i2c:48 V56 257 mA
ltc out: 1132
21592900 nV
ltc2991 mux:24/i2c:48 V78 863 mA
I2C dev read error, I2C address: 0250
READ ERROR read_sensor_fpga_temp_mux mux channel# 0x83
I2C dev read error, I2C address: 0250
READ ERROR read_sensor_fpga_temp_mux mux channel# 0x85
ltc out: 1081
20620075 nV
ltc2991 mux:23/i2c:48 V12 206 mA
ltc out: 1189
22680175 nV
ltc2991 mux:23/i2c:48 V34 226 mA
ltc out: 1275
24320625 nV
ltc2991 mux:23/i2c:48 V56 838 mA
ltc out: 1322
25217150 nV
ltc2991 mux:23/i2c:48 V78 869 mA
ltc out: 1084
20677300 nV
ltc2991 mux:23/i2c:49 V12 1442 mA
ltc out: 9862
188117650 nV
ltc2991 mux:23/i2c:49 V34 6486 mA
ltc out: 3810
72675750 nV
ltc2991 mux:25/i2c:49 V12 2506 mA
ltc out: 1128
21516600 nV
ltc2991 mux:25/i2c:49 V78 1501 mA
ltc out: 1133
21611975 nV
ltc2991 mux:0/i2c:48 V12 222 mA
ltc out: 1497
28555275 nV
ltc2991 mux:24/i2c:48 V12 285 mA
ltc out: 2064
39370800 nV
ltc2991 mux:24/i2c:48 V34 393 mA
ltc out: 1620
30901500 nV
ltc2991 mux:24/i2c:48 V56 309 mA
ltc out: 1141
21764575 nV
ltc2991 mux:24/i2c:48 V78 870 mA

Hello Petr,

the file appears to be truncated, at least I can’t see the error message you mentioned. Can you please send the file directly by e-mail to us?

cheers,

Stefan

Hi Stefan,

I don’t know what the Xilinx virtual cable server is, if it’s not part of the standard setup we are not using it.
Serial-over-lan is connected to the console of the RTM Zynq
This is our repository: https://gitlab.cern.ch/cms-daq/ipmc I guess you don’t have access and the rights are set at an upper level. We can provide you with the directory we are uploading for compilation
The serial debug output is right now full sensor readout debug messages, we are planning to reduce this

cheers

Ulf

Hi Ulf,

Depending on the frequency, printing out the full sensor readout debug messages can slow down the IPMC considerably and deprive it from the ability to look after other, potentially essential, tasks …

Cheers,
Ralf

Hi Stefan,

Uff. That Discourse system is a disaster. it already refused the attachment when I sent it by email. And when I pasted the content directly into the email, it removed the most important part.

Sending it again with the email addresses explicitly mentioned. Please see the attached file.

Best regards,
Petr

(Attachment IPMC.txt is missing)

Hello Petr,

thanks, this time I got the full file. I agree about discourse being hot garbage, however this is apparently the solution that CERN IT advocates.

There appears nothing obviously relating to the crash in the log, however I have a few observations:

  • After the reboot, I do not see the message about the IPMC receiving it’s IP address, is your DHCP server running? There should be a message similar to this at some point:

<_>: LAN iface 0: IP = x.x.x.x

  • Ralf is correct, if you are continuously printing debug output from the sensor drivers, this could potentially slow down the main loop so much that it eventually crashes. Once the sensor driver are tested and working, the debug output should be disabled.

  • I have never seen the following error message before:

: Deprecated non-NULL member ‘first’ of group structure detected.

This could indicate some problem in your project.

BTW, do you LAN control sets to protect the IPMC and the SoC on your ATCA cards?

cheers,

Stefan

Hi Stefan,

Thanks for the initial check. Please see my reply inline:

I see. I can see this line in the log: The worrisome part is that it is 0.0.0.0. That being said, on GPN there is always a DHCP server assigning some IP address, either the default CERN IT server or a custom one. I can check with the CMS Tracker group whether they are running their own DHCP server, if necessary. However, how that can cause “UNHANDLED EXCEPTION: HARD FAULT”? The only explanation I can think of is that something goes wrong in the lightweight TCP/IP stack used by the IPMC software or somewhere around. Looking at the register dump, the PC register is zero, so this appears to be a classic jump to an invalid address. If I am not mistaken, the LR register should contain the return address from the current function. Can we check where it points in the source code? I guess we would need the firmware binary from Ulf for that? The problem does not appear in our lab, there the IPMC works correctly with the same level of debug output. If the mail loop is slow, can it crash exactly with “UNHANDLED EXCEPTION: HARD FAULT”? Ok, this is probably something for Ulf to investigate. However, since this message is also present in logs where the IPMC works correctly in our lab, I currently consider it unrelated to the problem we are debugging. I am not entirely sure what is meant here. However, there are currently no explicit QoS or ACL protections configured for the network. Best regards, Petr

Hello Petr,

Yes this is expected and is printed immediately at boot. However the the IP address assignment via DHCP comes later, after a few seconds. Well, I don’t know, we never saw this error before, so I guess it could be anything … I’m guessing in this case it gets assigned a proper IP address? Again, I don’t know, but it can certainly create problems. In any case it’s better to remove the debug print out if you don’t need it anymore. Maybe, but it could also be a combination of causes. In any case, it would be good to fix this.