Ethernet traffic is crashing IPMC

behrens · May 7, 2026, 4:16pm

Hi all,

this might sound strange but when we connect a DTH ATCA board in one of our development setups to the GPN, the IPMC firmware crashes within seconds. Our network expert is looking into it but finally we think we could exclude any other reason than the IPMC is screwed up by packets received over ethernet. This is what our experts have reported:

DTH powercycles shortly after RJ45 uplink is connected
After extensive testing in the (Tracker crate?), any electrical issues (such as short circuits) were ruled out. The problem was traced to GPN traffic and the IPMC.
When an Ethernet cable (RJ45) is connected to the uplink of the DTH board, the IPMC hangs within a few seconds, reporting “UNHANDLED EXCEPTION: HARD FAULT,” and is subsequently reset by the watchdog. This results in a power cycle of the DTH.

We could not reproduce the issue anywhere else than this setup, but in this TRACKER dev setup it happens all the time in both ATCA blades in the crate. Also network misconfigurations could be excluded. We had a look at the network traffic where nothing unexpected could be found.

Question:

was such a behavior seen anywhere else?
any hint what we can test?
are there maybe newer framework versions available with upgraded ip libraries?

thanks for any help

Ulf

spiwoks · May 8, 2026, 7:23am

Hi Ulf,

An interesting observation that you have! I have not heard of anyone else reporting anything similar.

I would be surprised if the IPMC could be swamped by IP packets if they are not addressed to it. So, I would think it is rather its network configuration which may be incorrect.Do you use static IP address or DHCP? In the latter case do you know if it has obtained its IP address?

Is the debug port of the IPMC connected an can you read what it prints when booting? Or can you read the LAN configuration using IPMI, e.g.

ipmitool -H <ipmc_address> -P “” lan print 5

or

ipmitool -H <shmm_address> -P“” -t <ipmc_hw_address> lan print 5

I do not know the internal network structure of the DTH, so I am wondering is the IPMC directly connected to GPN or via the DTH internal network? Is there any way to know the status of teh internal network?

Cheers,
Ralf

behrens · May 8, 2026, 7:40am

Hi Ralf,

can we include our network expert Petr here? He has done all the detailed tests and he can explain better all the configurations of the internal switch he has tried.

cheers

Ulf

haass · May 8, 2026, 8:19am

Hello Ulf,

sorry to hear that you are having problem, however we will need additional information in order to be able to help:

Are you using the Xilinx virtual cable server running on the IPMC?
Do you use serial-over-LAN?
Can you please send the complete output from the IPMC serial debug UART
Can you give us access to the IPMC project that you use to build the firmware

To answer your questions:

was such a behavior seen anywhere else?

No. For instance the ATCA blades already in ATLAS that all use the CERN IPMC have been operating without any issues throughout Run-3.

any hint what we can test?

As Ralf already said, we need more details about your network architecture, since you are suspecting a networking issue.

are there maybe newer framework versions available with upgraded ip libraries?

No, there are no recent updates from PigeonPoint. When was the IPMC firmware that is being used built?

I would also normally advise against connecting the IPMC (or any ATCA equipment for that matter) directly to GPN if it can be avoided. However in case this is required, I would at least define the LAN control sets to protect the ATCA endpoints.

cheers,

Stefan

pzejdl · May 8, 2026, 10:54am

Hi Ralf, Ulf,

At the moment, the problem is reproducible in multiple ATCA crates in
B186, which is our CMS Tracker test installation. It does not happen,
for example, in B40 in the CMS DAQ test system.

The IPMC hangs with the message “UNHANDLED EXCEPTION: HARD FAULT”,
followed by a register dump. The full log is attached. Afterwards, the
board is reset by the internal watchdog.

Arguably, an unhandled exception should never happen, so this appears to
be something critical. Since it seems to be related to network, I can
imagine there may be a bug in the packet processing function, where the
software receives something unexpected and is unable to handle it
correctly. At the moment, it is just a speculation that needs to be
verified.

The fact that we are seeing this for the first time is probably just a
statistical coincidence. The number of people using it on GPN is slowly
increasing, so we may see this more often in the future.

Ulf and I will try to reproduce the issue on the IPMC development board,
i.e. outside of the ATCA crate. The only potential downside is that the
IPMC is without the connection to the shelf manager, so let’s see if
that is reproducible.

In the meantime, it would be helpful if somebody could check where in
the code the unhandled exception occurred based on the register dump.

Best regards,
Petr

(Attachment IPMC.txt is missing)

pzejdl · May 8, 2026, 11:02am

The file IPMC.txt was rejected with: [CERN IPMC] Email issue –
Attachment Rejected.

Adding it here:

ltc2991 mux:23/i2c:49 V34 6494 mA
ltc out: 3808
72637600 nV
ltc2991 mux:25/i2c:49 V12 2504 mA
ltc out: 1128
21516600 nV
ltc2991 mux:25/i2c:49 V78 1501 mA
ltc out: 1132
21592900 nV
ltc2991 mux:0/i2c:48 V12 222 mA
ltc out: 1391
26533325 nV
ltc2991 mux:24/i2c:48 V12 265 mA
ltc out: 1902
36280650 nV
ltc2991 mux:24/i2c:48 V34 362 mA
ltc out: 1348
25713100 nV
ltc2991 mux:24/i2c:48 V56 257 mA
ltc out: 1132
21592900 nV
ltc2991 mux:24/i2c:48 V78 863 mA
I2C dev read error, I2C address: 0250
READ ERROR read_sensor_fpga_temp_mux mux channel# 0x83
I2C dev read error, I2C address: 0250
READ ERROR read_sensor_fpga_temp_mux mux channel# 0x85
ltc out: 1081
20620075 nV
ltc2991 mux:23/i2c:48 V12 206 mA
ltc out: 1189
22680175 nV
ltc2991 mux:23/i2c:48 V34 226 mA
ltc out: 1275
24320625 nV
ltc2991 mux:23/i2c:48 V56 838 mA
ltc out: 1322
25217150 nV
ltc2991 mux:23/i2c:48 V78 869 mA
ltc out: 1084
20677300 nV
ltc2991 mux:23/i2c:49 V12 1442 mA
ltc out: 9862
188117650 nV
ltc2991 mux:23/i2c:49 V34 6486 mA
ltc out: 3810
72675750 nV
ltc2991 mux:25/i2c:49 V12 2506 mA
ltc out: 1128
21516600 nV
ltc2991 mux:25/i2c:49 V78 1501 mA
ltc out: 1133
21611975 nV
ltc2991 mux:0/i2c:48 V12 222 mA
ltc out: 1497
28555275 nV
ltc2991 mux:24/i2c:48 V12 285 mA
ltc out: 2064
39370800 nV
ltc2991 mux:24/i2c:48 V34 393 mA
ltc out: 1620
30901500 nV
ltc2991 mux:24/i2c:48 V56 309 mA
ltc out: 1141
21764575 nV
ltc2991 mux:24/i2c:48 V78 870 mA

haass · May 8, 2026, 12:18pm

Hello Petr,

the file appears to be truncated, at least I can’t see the error message you mentioned. Can you please send the file directly by e-mail to us?

cheers,

Stefan

behrens · May 8, 2026, 12:25pm

Hi Stefan,

I don’t know what the Xilinx virtual cable server is, if it’s not part of the standard setup we are not using it.
Serial-over-lan is connected to the console of the RTM Zynq
This is our repository: https://gitlab.cern.ch/cms-daq/ipmc I guess you don’t have access and the rights are set at an upper level. We can provide you with the directory we are uploading for compilation
The serial debug output is right now full sensor readout debug messages, we are planning to reduce this

cheers

Ulf

spiwoks · May 8, 2026, 12:35pm

Hi Ulf,

Depending on the frequency, printing out the full sensor readout debug messages can slow down the IPMC considerably and deprive it from the ability to look after other, potentially essential, tasks …

Cheers,
Ralf

pzejdl · May 8, 2026, 12:49pm

Hi Stefan,

Uff. That Discourse system is a disaster. it already refused the attachment when I sent it by email. And when I pasted the content directly into the email, it removed the most important part.

Sending it again with the email addresses explicitly mentioned. Please see the attached file.

Best regards,
Petr

(Attachment IPMC.txt is missing)

haass · May 8, 2026, 1:07pm

Hello Petr,

thanks, this time I got the full file. I agree about discourse being hot garbage, however this is apparently the solution that CERN IT advocates.

There appears nothing obviously relating to the crash in the log, however I have a few observations:

After the reboot, I do not see the message about the IPMC receiving it’s IP address, is your DHCP server running? There should be a message similar to this at some point:

<_>: LAN iface 0: IP = x.x.x.x

Ralf is correct, if you are continuously printing debug output from the sensor drivers, this could potentially slow down the main loop so much that it eventually crashes. Once the sensor driver are tested and working, the debug output should be disabled.
I have never seen the following error message before:

: Deprecated non-NULL member ‘first’ of group structure detected.

This could indicate some problem in your project.

BTW, do you LAN control sets to protect the IPMC and the SoC on your ATCA cards?

cheers,

Stefan

pzejdl · May 8, 2026, 2:49pm

Hi Stefan,

Thanks for the initial check. Please see my reply inline:

I see. I can see this line in the log: The worrisome part is that it is 0.0.0.0. That being said, on GPN there is always a DHCP server assigning some IP address, either the default CERN IT server or a custom one. I can check with the CMS Tracker group whether they are running their own DHCP server, if necessary. However, how that can cause “UNHANDLED EXCEPTION: HARD FAULT”? The only explanation I can think of is that something goes wrong in the lightweight TCP/IP stack used by the IPMC software or somewhere around. Looking at the register dump, the PC register is zero, so this appears to be a classic jump to an invalid address. If I am not mistaken, the LR register should contain the return address from the current function. Can we check where it points in the source code? I guess we would need the firmware binary from Ulf for that? The problem does not appear in our lab, there the IPMC works correctly with the same level of debug output. If the mail loop is slow, can it crash exactly with “UNHANDLED EXCEPTION: HARD FAULT”? Ok, this is probably something for Ulf to investigate. However, since this message is also present in logs where the IPMC works correctly in our lab, I currently consider it unrelated to the problem we are debugging. I am not entirely sure what is meant here. However, there are currently no explicit QoS or ACL protections configured for the network. Best regards, Petr

haass · May 8, 2026, 3:33pm

Hello Petr,

Yes this is expected and is printed immediately at boot. However the the IP address assignment via DHCP comes later, after a few seconds. Well, I don’t know, we never saw this error before, so I guess it could be anything … I’m guessing in this case it gets assigned a proper IP address? Again, I don’t know, but it can certainly create problems. In any case it’s better to remove the debug print out if you don’t need it anymore. Maybe, but it could also be a combination of causes. In any case, it would be good to fix this.

pzejdl · May 11, 2026, 4:08pm

Hi Stefan,

I’m trying to understand the root cause of the problem so that we can prevent it from happening in the future. At the moment, we do not know which packet is triggering the fault or where it originates, so it is unclear what exactly we need to protect the IPMC against. My suspicion is broadcast traffic, as this is the only type of traffic that can easily reach the IPMC. However, broadcast traffic remains within the local subnet, so configuring ACLs would not help it.

It would be very helpful if we could try to identify the function in which the fault occurs. This might indicate whether the issue is related to broadcast traffic, DHCP, ARP, or something entirely different.

From the register dump, the PC register is zero. However, the LR register should contain the return address of the current function. Can we check where this address points in the source code?

A summary of what we know:

The IPMC works fine when it is not connected to the network, or when it is connected to the GPN network in B40. In that case, it obtains an IP address through the DHCP Client ID mechanism from a local DHCP server.
When connected to the network in B186, the IPMC hangs with the message “UNHANDLED EXCEPTION: HARD FAULT” and is then reset by the watchdog.
It continues to hang and reset repeatedly.
According to the logs, it never obtains an IP address.

Best regards,
Petr

haass · May 11, 2026, 4:46pm

Hello Petr,

I believe the problem could be related to the fact that the IPMC is not assigned an IP address. This is clearly a mistake, so you should start by fixing that. Either by registering the IPMC in LanDB or by assigning a fixed IP address in the firmware. You may also be able to test this temporarily by assigning the IP address of the IPMC via the shelf manager.

Can you also please let us know the network architecture, e.g. is the IPMC is connected through the network directly or via an on-board switch? If so, what other devices are on the on-board switch, etc.

cheers,

Stefan

haass · May 11, 2026, 4:49pm

Hello Ulf,

indeed, this link does not work, I get the following:

404: Page not found

Can you please grant us read access to the repository?

cheers,

Stefan

spiwoks · May 12, 2026, 11:07am

Hi Petr,

I can identify with your wish to understand the root cause of the problem. But the IPMC runs a bare-metal program using exceptions, ARM hardware exceptions, not C++ . The cause of the exception could be anywhere in the code and you would probably not easily find it by just reading the code. I would guess you are much better off by solving the problem of not assigning an IP address, reducing the user code debug messages, and potentially adding some debug messages for the network if that is what you are expecting to be problematic.

Cheers,
Ralf

spiwoks · July 1, 2026, 7:04am

Hi Petr,

Any news on your observations of IPMC crashes? Did you manage to see useful output from the IPMC console, or connecting a network sniffer?

Cheers,
Ralf

pzejdl · July 24, 2026, 3:51pm

Hi Ralf,

I’m very sorry, I only just saw your message on the forum. I didn’t receive any email notification. I hope I will receive one when you reply to this message.

In short, we continued our testing and managed to reproduce the issue using the barebone firmware (without any customizations) and on the IPMC Development Board, i.e. completely outside the ATCA infrastructure.

This confirms that the issue is present in the base software and should therefore be followed up with the software vendor. At the moment, it represents a potential denial-of-service vulnerability.

Please find the detailed description and logs here: https://gitlab.cern.ch/hardware/phase2/ipmc_tests .

Best regards,
Petr

spiwoks · July 27, 2026, 8:12am

Hi Petr,

Thanks for your reply and for the detailed logs. However, I am still not quite sure what is really going on. What is your network setup? And what are the first messages that you see as soon as you connect the IPMC to the network?

From the logs I can see that the IPMC uses IP address 0.0.0.0, which in my understanding means that it will listen to ANY message passing on the network! I can understand that this will completely overwhelm it.

I am further wondering about the “IPMB-A error: IPMC 20: not ready” (same for IPMB-B) messages. IPMB are two separate networks, and should be up and working. Of course, the CPU is shared with the control of the IP network - so, when the IP network swamps the CPU with messages, the IPMB networks could be starving. In that case it could be a consequence of the IP network problem, not another problem in itself …

The IPMC software/firmware is developed for an embedded system and works as a tight loop acting on messages (of all network and internal events). Its resources are limited. So, a correct IP network set up is essential.

Cheers,
Ralf