We’ve been facing problems to program our MMC through the IPMC via shelf manager. This issue was raised by the company engineers that work with us. Below you can find the description of what they are experiencing.
I'm providing details on a failure we have observed while using the
CERN IPMC. We were trying to use the /sdr fill file/ command on
ipmitool for updating remotely the sdrs of a MMC connected to the IPMC
via the IPMB-L bus.
While we were successful applying the sdr file using a bridge command:
ipmitool -H 192.168.0.200 -P "" -t 0x72 -b 7 sdr fill file
mezz_thermal_1_revc_3.sdr
We failed when we tried to use a double bridge request over the shcm:
ipmitool -H 192.168.88.251 -P "" -T 0x8C -t 0x72 -B 0 -b 7 sdr fill
file mezz_thermal_1_revc_3.sdr
The output of ipmitool for the double bridge commands shows that
the partial add requests fail:
...................................
ipmitool: partial add failed
Cannot add SDR ID 0x0001 to repository...
ipmitool: partial add failed
Cannot add SDR ID 0x0002 to repository...
ipmitool: partial add failed
Cannot add SDR ID 0x0003 to repository...
ipmitool: partial add failed
Cannot add SDR ID 0x0004 to repository...
......................
By monitoring the IPMI traffic on the MMC side we have noticed that
only the first bytes of each sdr are sent to the MMC:
................................................
Reserve SDR Repository:
R-10-1-72 28 66 20 A8 22 16
T- 0-1-20 2C B4 72 A8 22 00 97 00 2D
Partial ADD SDR
R-11-1-72 28 66 20 AC 25 97 00 00 00 00 00 02 00 51 01 3A EA
T- 0-1-20 2C B4 72 AC 25 00 00 00 BD
Get Sensor Reading:
R-12-1-72 10 7E 20 B8 2D 78 83
T- 0-1-20 14 CC 72 B8 2D CB DE
Get Sensor Reading:
R-13-1-72 10 7E 20 BC 2D 78 7F
T- 0-1-20 14 CC 72 BC 2D CB DA
............................................
The MMC acknowledges the first bytes, but the IPMC never forwards
the following bridge messages it gets from ipmitool via the shmc. On the
other side, the ipmitool times-out and moves on to the next sdr. The
story repeats itself over and over with the MMC receiving only a Reserve
SDR Repository request and the first bytes of each SDR until the file is
completely read.
Another thing that can be noticed during this update is that
besides the commands it needs to forward from ipmitool, the IPMC
continues to poll the sensors of the MMC. As the sdrs of the MMC were
erased at the beginning of the sdr update process, there is nothing to
be polled until the update is finished. This is why all the Get Sensor
Reading request will receive a 0xCB -> Sensor Not Present until the sdr
file is received completely.
In case of updating the sdr using a bridge command and having
ipmitool communicate directly to the IPMC instead of routing through the
ShMC as a double bridge, the process is a lot faster and all the
required IPMI requests fit between the sensor polling requests. In this
case there are no Get Sensor Reading requests that get the 0xCB
completion code between the requests of the sdr update process. Maybe
this is the reason the process is successful for bridge commands and not
for double bridge, or maybe it is just a coincidence.
Sorry for the late reply, I missed your question on discourse. I tried to understand your setup and if I understand right: you suspect an issue on forwarding the MMC response to the ShMM, is that right?
Therefore, I would need additional details:
Which MMC?
Which version of the IPMC software?
What is the intent of the SDR fill file command?
If you send a get device id command to the MMC, does it work?
Thanks for getting back to us. I will try to answer your questions below:
Which MMC?
It is an MMC prepared by SAMWAY, the company that designed the mezzanines for the NSW Trigger Processor project. We have two of them on our blade.
Which version of the IPMC software?
I am not sure about the versioning of the IPMC firmware. How do I see that? I was not able to find any tag in the ipmc-dev repository. Our repository is not up-to-date with the main repository since they diverged when I was instructed to clone it. There was no change in this part of the code, I must say.
What is the intent of the SDR fill file command?
SDR fill command update the information for thesensors in the SDR. But our main interest is on the hpm upgrade command bridged to the MMC using the IPMC.
If you send a get device id command to the MMC, does it work?
I am not sure how to send specificaly a get device ID command, but I am able perform other bridged commands like asking for the SDR directly: ipmitool -H $SHMIP -P "" -T $IPMB -t 0x72 -B 0 -b 7 sdr
This tells me that the message exchange between the IPMC and the MMC is not the problem in a generic way. The problem seems to happen specifically with the hpm upgrade command.
I have debugged the failure and it looks like it is caused by the
IPMC software . I have tried updating the firmware in a double-bridge
scenario ( ipmitool -> shmc -> ipmc -> mmc_mezanine) and also in a
bridge scenario ( ipmitool -> ipmc -> mmc_mezzanine). As both types of
requests failed and I was able to successfully perform an update using a
different IPMC, the conclusion would be that there is an issue in the
CERN IPMC firmware.
For the double bridge requests the syntax i used was:
.\ipmitool.exe -H 192.168.1.103 -P "" -T 0x82 -t 0x76 -B 0 -b 7 hpm
upgrade .\MMC_Mezanine_Rev1.6_b33.hpm -vvv
For the bridge requests the syntax I used was:
.\ipmitool.exe -H 192.168.0.200 -P "" -t 0x76 -b 7 hpm upgrade
.\MMC_Mezanine_Rev1.6_b33.hpm -vvv
Checking out the output of the ipmitool command -vvv mode, I was able to
see the actual messages that are sent by the software. On that side,
everything looks ok (I have attached a log of the communication). In the
double bridge attempt, the last message going through as part of the
hpm upgrade is:
send_packet (70 bytes)
06 00 ff 07 02 28 06 00 00 0a 00 00 00 68 fc 3e
37 13 94 84 a5 76 7a 8e b4 47 61 fc 49 28 20 18
c8 81 a8 34 40 8c 18 5c 10 a8 34 47 72 b0 de 10
a8 32 00 14 ea 44 30 e0 0a 4c ea e4 a0 50 71 92
12 38 62 01 cd 63
This is a double bridge message going through the ShMC and the IPMC and
getting to the MMC, where:
ShMC: 20 18 c8 81 a8 34 40
IPMC: 8c 18 5c 10 a8 34 47
MMC: 72 b0 de 10 a8 32 00 14 ea 44 30 e0 0a 4c ea e4 a0 50 71 92 12 38 62 01
The response of the MMC is 0xc6 : Data truncated. That means that
the final message (MMC part) did not reach the MMC correctly. Even
tough the message that reached the IPMC is a correct IPMI message
(checksum provided is correct, otherwise the response code would have
been different), the length of the message is not 24 bytes. For
returning 0xc6 the MMC has to receive a request with length between 7
and 10 bytes.
Sorry for the late reply. Julian is currently on holiday. He will be back on Monday, November 1st.
I am not at all sure, that I understand what you are trying to do and what error you are observing. But I do remember that when I was developing software for the serial interface using the bridging capacity in a “Send Request” command, there was problem with a checksum of the response, which was calculated wrongly. That was with version 1.2, and was entirely fixed in version 1.3 - and there were many other fixes in 1.3. Now, what I saw and what you are observing might not be related to the same problem. But I would strongly advise you to check what version you are using. And if it is not 1.3, to update to it.
This is probably not the answer you expected, but I hope it could help. On the other hand, if you are still using 1.2, it probably wouldn’t make sense to try to fix the problem anyway, because it might already have been fixed in 1.3.
Thanks for getting back to us. I was expecting this suggestion. That’s why I was so reluctant in the past to fork the repository when I was instructed to. A quick glance of the situation now:
git ls-files | wc -l tells me that I have 633 files in my repository
git diff --name-only HEAD..upstream/master | wc -l tells me that the number of files changed from my fork to the ipmc-dev is of 602.
I certainly did not change so many files. What happened to the upstream repository? Was it fully replaced?
I am having trouble to start working with the current IPMC code. I tried many different approaches without luck. In the end, I made the simplest test I could think of, but even that seems to be crashing for me.
Here is what I did:
got a fresh clone of the ipmc-project repository
removed the file ipmc-user/user_mainfile.c
edited the config.xml file to be similar (very basic changes, I left all the details and sensors out, please see attached) to what I will need
executed compile.py which performed the compilation without errors and returned to me some files, including hpm1all.img that I use to program the IPMC
programmed our IPMC with that file and activated it
the IPMC system immediately reverted to the firmware that was previously used, which suggests me that the code is seg faulting or something.
I am probably missing something obvious. I would appreciate some feedback.
Thanks
PS: Well, it seems I cannot attach anything here. I will just add the config.xml file below.
<?xml version="1.0" encoding="UTF-8"?>
<IPMC>
<GeneralConfig>
<DeviceID>0x12</DeviceID>
<DeviceRevision>0x00</DeviceRevision>
<ManufacturerID>0x000060</ManufacturerID>
<ProductID>0x1236</ProductID>
<ManufacturingDate>06/01/2017</ManufacturingDate>
<BoardManuf>Cirly/Addax</BoardManuf>
<BoardName>TEST_FRUFROLLBACK</BoardName>
<BoardSN>00001</BoardSN>
<BoardPN>P580050995</BoardPN>
<ProductManuf>CERN</ProductManuf>
<ProductName>IPMC-TestPAD</ProductName>
<ProductPN>PN00001</ProductPN>
<ProductSN>0000001</ProductSN>
<ProductVersion type="major">1</ProductVersion>
<ProductVersion type="minor">20</ProductVersion>
<MaxCurrent>30.0</MaxCurrent>
<MaxInternalCurrent>2.0</MaxInternalCurrent>
<!-- Hardware -->
<HandleSwitch active="LOW" inactive="HIGH" />
<!-- <ResetOnWrongHAEn /> -->
<!-- <PowerMonitoringEn /> -->
<!-- <AlertMonitoringEn />-->
<!-- Shutdown timeout in tens of ms (optional - if not defined: 10s) -->
<shutdownTimeout>0</shutdownTimeout>
<nonVolatileParams forced="false" />
</GeneralConfig>
<SerialInterfaces>
<!--
This part allows connecting the UART port to interfaces.
The ports 0 to 2 are linked to the hardware:
port 0: Edge connector (Tx: 57 / Rx: 60)
port 1: Edge connector (Tx: 58 / Rx: 61)
port 2: Optionnal UART (Tx: 75 / Rx: 76)
Warning: Enabling port 2 will automatically set the GPIOs in UART mode!
For each bord, the following name can be used:
"SOL": Serial Over Lan
"SDI": Serial Debug Interface
"PI": Payload Interface
The baudrate can be set using the baudrate param. By default,
it is configured to 115200b/s.
-->
<Connect port="0" name="SDI" baudrate="115200"/>
<!-- <Connect port="1" name="PI" baudrate="115200"/> -->
<!-- <Connect port="2" name="SOL" baudrate="115200" extended="true" /> -->
<RedirectSDItoSOL/>
</SerialInterfaces>
<PowerManagement>
<PowerONSeq>
<step>PSQ_ENABLE_SIGNAL(CFG_PAYLOAD_DCDC_EN_SIGNAL)</step>
<step>PSQ_END</step>
</PowerONSeq>
<PowerOFFSeq>
<step>PSQ_DISABLE_SIGNAL(CFG_PAYLOAD_DCDC_EN_SIGNAL)</step>
<step>PSQ_END</step>
</PowerOFFSeq>
</PowerManagement>
<LANConfig>
<MACAddr>0A:0A:0A:0A:0A:86</MACAddr>
<NetMask>255.255.255.0</NetMask>
<GatewayIP>192.138.1.3</GatewayIP>
<UseFlashedMAC />
<EnableDHCP />
<IPAddrList> <!-- Default IP Addresses (used if DHCP is not active) -->
<IPAddr slot_addr="default">192.168.1.34</IPAddr>
<IPAddr slot_addr="0x41">192.168.1.20</IPAddr>
<IPAddr slot_addr="0x42">192.168.1.21</IPAddr>
<IPAddr slot_addr="0x43">192.168.1.22</IPAddr>
<IPAddr slot_addr="0x44">192.168.1.23</IPAddr>
<IPAddr slot_addr="0x45">192.168.1.24</IPAddr>
<IPAddr slot_addr="0x46">192.168.1.25</IPAddr>
<IPAddr slot_addr="0x47">192.168.1.26</IPAddr>
<IPAddr slot_addr="0x48">192.168.1.27</IPAddr>
<IPAddr slot_addr="0x49">192.168.1.28</IPAddr>
<IPAddr slot_addr="0x4a">192.168.1.29</IPAddr>
<IPAddr slot_addr="0x4b">192.168.1.30</IPAddr>
<IPAddr slot_addr="0x4c">192.168.1.31</IPAddr>
<IPAddr slot_addr="0x4d">192.168.1.32</IPAddr>
<IPAddr slot_addr="0x4e">192.168.1.33</IPAddr>
<IPAddr slot_addr="0x4f">192.168.1.34</IPAddr>
<IPAddr slot_addr="0x50">192.168.1.35</IPAddr>
</IPAddrList>
</LANConfig>
<AMCSlots>
<AMC site="1">
<PhysicalPort>1</PhysicalPort>
<MaxCurrent>6.0</MaxCurrent>
<PowerGoodTimeout>300</PowerGoodTimeout>
<DCDCEfficiency>85</DCDCEfficiency>
</AMC>
<AMC site="2">
<PhysicalPort>2</PhysicalPort>
<MaxCurrent>6.0</MaxCurrent>
<PowerGoodTimeout>300</PowerGoodTimeout>
<DCDCEfficiency>85</DCDCEfficiency>
</AMC>
<AMC site="3">
<PhysicalPort>3</PhysicalPort>
<MaxCurrent>6.0</MaxCurrent>
<PowerGoodTimeout>300</PowerGoodTimeout>
<DCDCEfficiency>85</DCDCEfficiency>
</AMC>
</AMCSlots>
<SensorList>
<Sensors type="raw" global_define="CFG_SENSOR_MCP9801" function_name="SENSOR_MCP9801" rawType="MCP9801">
<Sensor>
<Name>Internal temp.</Name>
<Type>Temperature</Type>
<Units>degrees C</Units>
<NominalReading>25</NominalReading>
<NormalMaximum>60</NormalMaximum>
<NormalMinimum>-10</NormalMinimum>
<Point id="0" x="0" y="0" />
<Point id="1" x="5" y="5" />
<Thresholds>
<UpperNonRecovery>80</UpperNonRecovery>
<UpperCritical>60</UpperCritical>
<UpperNonCritical>40</UpperNonCritical>
<LowerNonRecovery>-20</LowerNonRecovery>
<LowerCritical>-10</LowerCritical>
<LowerNonCritical>0</LowerNonCritical>
</Thresholds>
<Params>
<p type="record_id"></p> <!-- mandatory -->
<p type="user">0x090</p>
<p type="user">UCGH | LCGL</p>
</Params>
<AssertEvMask>0x0A80</AssertEvMask>
<DeassertEvMask>0x7A80</DeassertEvMask>
<DiscreteRdMask>0x3838</DiscreteRdMask>
<AnalogDataFmt>2S_COMPL</AnalogDataFmt>
<PosHysteresis>2</PosHysteresis>
<NegHysteresis>2</NegHysteresis>
<MaxReading>127</MaxReading>
<MinReading>-128</MinReading>
</Sensor>
</Sensors>
<!-- Example for GPIO sensors:
<Sensors type="raw" global_define="CFG_SENSOR_GPIO " function_name="SENSOR_GPIO" rawType="GPIOSENSOR">
<Sensor>
<Name>GPIOSens Ex.</Name>
<Type>Processor</Type>
<Params>
<p type="record_id"></p>
<p type="user">0x1</p>
<p type="user">POWER_GOOD_12V</p>
</Params>
<DiscreteRdMask>0x000F</DiscreteRdMask>
</Sensor>
</Sensors>
-->
<!-- Example for payload sensors:
<Sensors type="raw" global_define="CFG_SENSOR_PAYLOAD_THRESHOLD" function_name="SENSOR_PAYLOAD_THRESHOLD" rawType="PAYLOADSENSOR_THRESH">
<Sensor>
<Name>PayloadSens Ex.</Name>
<Type>Temperature</Type>
<Units>degrees C</Units>
<NominalReading>25</NominalReading>
<NormalMaximum>60</NormalMaximum>
<NormalMinimum>-10</NormalMinimum>
<Point id="0" x="0" y="0" />
<Point id="1" x="5" y="5" />
<Thresholds>
<UpperNonRecovery>80</UpperNonRecovery>
<UpperCritical>60</UpperCritical>
<UpperNonCritical>40</UpperNonCritical>
<LowerNonRecovery>-20</LowerNonRecovery>
<LowerCritical>-10</LowerCritical>
<LowerNonCritical>0</LowerNonCritical>
</Thresholds>
<Params>
<p type="record_id"></p>
</Params>
<AssertEvMask>0x0A80</AssertEvMask>
<DeassertEvMask>0x7A80</DeassertEvMask>
<DiscreteRdMask>0x3838</DiscreteRdMask>
<AnalogDataFmt>2S_COMPL</AnalogDataFmt>
<PosHysteresis>2</PosHysteresis>
<NegHysteresis>2</NegHysteresis>
<MaxReading>127</MaxReading>
<MinReading>-128</MinReading>
</Sensor>
</Sensors>
-->
</SensorList>
</IPMC>
Sorry for the delay but I am just coming back from vactions. As Ralf told you, some problems were found on the i2c bus with the version 1.2 and are fixed on version 1.3. As the command might send a lot of data through the i2c bus, it could be the source of your issue.
Concerning your problem with going to 1.3, it is really weird, I’ve just tried again to compile and force and it looks ok. However, it could be because of a version checking issue that the rollback is automatically performed. Could you check what is printed on the serial debug interface when you activate the new version?
Thanks for answering promptly after your return. I hope you enjoyed your time off.
Concerning your problem with going to 1.3, it is really weird, I’ve just tried again to compile and force and it looks ok.
Just to confirm, have you used the same config.xml I sent above?
However, it could be because of a version checking issue that the rollback is automatically performed. Could you check what is printed on the serial debug interface when you activate the new version?
I am afraid I cannot get this. The IPMC serial is connected to a Zynq device on the same blade, and I would only have access to the serial messages after the Zynq itself finish booting. Can you explain a bit more about this version checking issue?
Also, these IPMCs are an old revision. Could be this the problem? Can you try with a similar one as well?
The XML looks good and you should not have any problem using your binary file. When you move from 1.2 to 1.3, the firmware tries to recover “non-volatile parameters”. However, the parameters changed between 1.2 and 1.3 and, at the first boot, as the activate function cannot convert all of the parameter, it automatically performs a rollback. However, the upgrade can be forced by adding the “norollback” instruction when you activate the new firmware.
Connecting to the debug interface allows getting additional details to confirm that the rollback is issued by the non volatile parameter issue. But, if you flash with the norollback and you really have an issue with the firmware, you could face the case where you need to flash your ipmc again via JTAG.
Do you have a JTAG cable that you could use, or even a raspberry pi that we now support for reseting the system? If you have a way to flash back the module, you can try adding the “norollback” instruction to the activate command.
I bought the JTAG cable sometime ago because I thought we could end-up on this kind of situation, but I never actually used it. Do you have instructions about how to use it under the CERN IPMC context? (Linux solution, please, I don´t have any computer with other operating systems around)
Unfortunatelly, I am sorry but Pigeon Point provides only a solution with Windows, I am not aware of equivalent with Linux. But, maybe, using the stapl file, there is a solution to run it on linux.
Thanks, Julian. Can you add this somewhere in the documentation? This way people are able to get the correct items from start. I will try to get my hands on a RPi then, but it will stall the debug process for some more days.
That is not what I meant. I think it would be helpful for the developers to know what they need to have to recover an IPMC, in my case I just learned that I need a RPI for that. If I had this info before, I would get one beforehand in case of unexpected results.