Issue upgrading MMC

Hi, all.

We’ve been facing problems to program our MMC through the IPMC via shelf manager. This issue was raised by the company engineers that work with us. Below you can find the description of what they are experiencing.

    I'm providing details on a failure we have observed while using the 
CERN IPMC. We were trying to use the /sdr fill file/ command  on 
ipmitool for updating remotely the sdrs of a MMC connected to the IPMC 
via the IPMB-L bus.

    While we were successful applying the sdr file using a bridge command:

    ipmitool -H 192.168.0.200 -P "" -t 0x72 -b 7 sdr fill file 
mezz_thermal_1_revc_3.sdr

    We failed when we tried to use a double bridge request over the shcm:

    ipmitool -H 192.168.88.251 -P "" -T 0x8C -t 0x72 -B 0 -b 7 sdr fill 
file mezz_thermal_1_revc_3.sdr

    The output of ipmitool for the double bridge commands shows that 
the partial add requests fail:
...................................
ipmitool: partial add failed
Cannot add SDR ID 0x0001 to repository...
ipmitool: partial add failed
Cannot add SDR ID 0x0002 to repository...
ipmitool: partial add failed
Cannot add SDR ID 0x0003 to repository...
ipmitool: partial add failed
Cannot add SDR ID 0x0004 to repository...
......................

    By monitoring the IPMI traffic on the MMC side we have noticed that 
only the first bytes of each sdr are sent to the MMC:
................................................
Reserve SDR Repository:
R-10-1-72 28 66 20 A8 22 16
T- 0-1-20 2C B4 72 A8 22 00 97 00 2D

 Partial ADD SDR
R-11-1-72 28 66 20 AC 25 97 00 00 00 00 00 02 00 51 01 3A EA
T- 0-1-20 2C B4 72 AC 25 00 00 00 BD

 Get Sensor Reading:
R-12-1-72 10 7E 20 B8 2D 78 83
T- 0-1-20 14 CC 72 B8 2D CB DE

 Get Sensor Reading:
R-13-1-72 10 7E 20 BC 2D 78 7F
T- 0-1-20 14 CC 72 BC 2D CB DA
 ............................................

    The MMC acknowledges the first bytes, but the IPMC never forwards 
the following bridge messages it gets from ipmitool via the shmc. On the 
other side, the ipmitool times-out and moves on to the next sdr. The 
story repeats itself over and over with the MMC receiving only a Reserve 
SDR Repository request and the first bytes of each SDR until the file is 
completely read.

    Another thing that can be noticed during this update is that 
besides the commands it needs to forward from ipmitool, the IPMC 
continues to poll the sensors of the MMC. As the sdrs of the MMC were 
erased at the beginning of the sdr update process, there is nothing to 
be polled  until the update is finished. This is why all the Get Sensor 
Reading request will receive a 0xCB -> Sensor Not Present until the sdr 
file is received completely.

    In case of updating the sdr using a bridge command and having 
ipmitool communicate directly to the IPMC instead of routing through the 
ShMC as a double bridge, the process is a lot faster and all the
required IPMI requests fit between the sensor polling requests. In this 
case there are no Get Sensor Reading requests that get the 0xCB 
completion code between the requests of the sdr update process. Maybe 
this is the reason the process is successful for bridge commands and not 
for double bridge,  or maybe it is just a coincidence.

Have you seen this before?

2 Likes

@smartoiu @mvasile, here is the post I mentioned to you.

Dear Thiago,

Sorry for the late reply, I missed your question on discourse. I tried to understand your setup and if I understand right: you suspect an issue on forwarding the MMC response to the ShMM, is that right?

Therefore, I would need additional details:

  • Which MMC?
  • Which version of the IPMC software?
  • What is the intent of the SDR fill file command?
  • If you send a get device id command to the MMC, does it work?

Thanks and regards,
Julian

Hi, Julian.

Thanks for getting back to us. I will try to answer your questions below:

  • Which MMC?

    It is an MMC prepared by SAMWAY, the company that designed the mezzanines for the NSW Trigger Processor project. We have two of them on our blade.

  • Which version of the IPMC software?

    I am not sure about the versioning of the IPMC firmware. How do I see that? I was not able to find any tag in the ipmc-dev repository. Our repository is not up-to-date with the main repository since they diverged when I was instructed to clone it. There was no change in this part of the code, I must say.

  • What is the intent of the SDR fill file command?

    SDR fill command update the information for thesensors in the SDR. But our main interest is on the hpm upgrade command bridged to the MMC using the IPMC.

  • If you send a get device id command to the MMC, does it work?

    I am not sure how to send specificaly a get device ID command, but I am able perform other bridged commands like asking for the SDR directly:
    ipmitool -H $SHMIP -P "" -T $IPMB -t 0x72 -B 0 -b 7 sdr
    This tells me that the message exchange between the IPMC and the MMC is not the problem in a generic way. The problem seems to happen specifically with the hpm upgrade command.

Thiago

Hi, Julian.

Adding some more info sent by SAMWAY:

      I have debugged the failure and it looks like it is caused by the 
IPMC software . I have tried updating the firmware in a double-bridge 
scenario ( ipmitool -> shmc -> ipmc -> mmc_mezanine) and also in a 
bridge scenario ( ipmitool -> ipmc -> mmc_mezzanine). As both types of 
requests failed and I was able to successfully perform an update using a 
different IPMC, the conclusion would be that there is an issue in the 
CERN IPMC firmware.

     For the double bridge requests the syntax i used was:

.\ipmitool.exe -H 192.168.1.103  -P ""  -T 0x82 -t 0x76 -B 0 -b 7 hpm 
upgrade .\MMC_Mezanine_Rev1.6_b33.hpm -vvv

    For the bridge requests the syntax I used was:

.\ipmitool.exe -H 192.168.0.200  -P ""   -t 0x76 -b 7 hpm upgrade 
.\MMC_Mezanine_Rev1.6_b33.hpm -vvv

Checking out the output of the ipmitool command -vvv mode, I was able to 
see the actual messages that are sent by the software. On that side, 
everything looks ok (I have attached a log of the communication). In the 
double bridge attempt, the last message going through as  part of the 
hpm upgrade is:

send_packet (70 bytes)
  06 00 ff 07 02 28 06 00 00 0a 00 00 00 68 fc 3e
  37 13 94 84 a5 76 7a 8e b4 47 61 fc 49 28 20 18
  c8 81 a8 34 40 8c 18 5c 10 a8 34 47 72 b0 de 10
  a8 32 00 14 ea 44 30 e0 0a 4c ea e4 a0 50 71 92
  12 38 62 01 cd 63

This is a double bridge message going through the ShMC  and the IPMC and 
getting to the MMC, where:

ShMC: 20 18 c8 81 a8 34 40
IPMC: 8c 18 5c 10 a8 34 47 
MMC: 72 b0 de 10 a8 32 00 14 ea 44 30 e0 0a 4c ea e4 a0 50 71 92 12 38 62 01

     The response of the MMC is 0xc6 : Data truncated. That means that 
the final message (MMC part) did not reach the MMC correctly.  Even 
tough the message that reached the IPMC is a correct IPMI message 
(checksum provided is correct, otherwise the response code would have 
been different), the length of the message is not 24 bytes. For 
returning 0xc6 the MMC has to receive a request with length between  7 
and 10 bytes.

Hi Thiago,

Sorry for the late reply. Julian is currently on holiday. He will be back on Monday, November 1st.

I am not at all sure, that I understand what you are trying to do and what error you are observing. But I do remember that when I was developing software for the serial interface using the bridging capacity in a “Send Request” command, there was problem with a checksum of the response, which was calculated wrongly. That was with version 1.2, and was entirely fixed in version 1.3 - and there were many other fixes in 1.3. Now, what I saw and what you are observing might not be related to the same problem. But I would strongly advise you to check what version you are using. And if it is not 1.3, to update to it.

This is probably not the answer you expected, but I hope it could help. On the other hand, if you are still using 1.2, it probably wouldn’t make sense to try to fix the problem anyway, because it might already have been fixed in 1.3.

Cheers,
Ralf.

Hi, Ralf.

Thanks for getting back to us. I was expecting this suggestion. That’s why I was so reluctant in the past to fork the repository when I was instructed to. A quick glance of the situation now:

  • git ls-files | wc -l tells me that I have 633 files in my repository

  • git diff --name-only HEAD..upstream/master | wc -l tells me that the number of files changed from my fork to the ipmc-dev is of 602.

I certainly did not change so many files. What happened to the upstream repository? Was it fully replaced?

Tracking this can be a nightmare.

Thiago

Hi again, Ralf.

I will try to make a minimal port of the version 1.3 to the NSWTP blade in the next days to check if it solves the issue.

Thanks,

Thiago

Hi

I am having trouble to start working with the current IPMC code. I tried many different approaches without luck. In the end, I made the simplest test I could think of, but even that seems to be crashing for me.

Here is what I did:

  • got a fresh clone of the ipmc-project repository
  • removed the file ipmc-user/user_mainfile.c
  • edited the config.xml file to be similar (very basic changes, I left all the details and sensors out, please see attached) to what I will need
  • executed compile.py which performed the compilation without errors and returned to me some files, including hpm1all.img that I use to program the IPMC
  • programmed our IPMC with that file and activated it
  • the IPMC system immediately reverted to the firmware that was previously used, which suggests me that the code is seg faulting or something.

I am probably missing something obvious. I would appreciate some feedback.

Thanks

PS: Well, it seems I cannot attach anything here. I will just add the config.xml file below.

<?xml version="1.0" encoding="UTF-8"?>

<IPMC>
	<GeneralConfig>

		<DeviceID>0x12</DeviceID>
		<DeviceRevision>0x00</DeviceRevision>
		<ManufacturerID>0x000060</ManufacturerID>
		<ProductID>0x1236</ProductID>

		<ManufacturingDate>06/01/2017</ManufacturingDate>

		<BoardManuf>Cirly/Addax</BoardManuf>
		<BoardName>TEST_FRUFROLLBACK</BoardName>
		<BoardSN>00001</BoardSN>
		<BoardPN>P580050995</BoardPN>

		<ProductManuf>CERN</ProductManuf>
		<ProductName>IPMC-TestPAD</ProductName>
		<ProductPN>PN00001</ProductPN>
		<ProductSN>0000001</ProductSN>
		<ProductVersion type="major">1</ProductVersion>
		<ProductVersion type="minor">20</ProductVersion>

		<MaxCurrent>30.0</MaxCurrent>
		<MaxInternalCurrent>2.0</MaxInternalCurrent>

		<!-- Hardware -->
		<HandleSwitch active="LOW" inactive="HIGH" />

        <!-- <ResetOnWrongHAEn /> -->
        <!-- <PowerMonitoringEn /> -->
        <!-- <AlertMonitoringEn />-->

        <!-- Shutdown timeout in tens of ms (optional - if not defined: 10s) -->
        <shutdownTimeout>0</shutdownTimeout>

        <nonVolatileParams forced="false" />
	</GeneralConfig>

    <SerialInterfaces>
        <!--
            This part allows connecting the UART port to interfaces.

            The ports 0 to 2 are linked to the hardware:
                port 0: Edge connector (Tx: 57 / Rx: 60)
                port 1: Edge connector (Tx: 58 / Rx: 61)
                port 2: Optionnal UART (Tx: 75 / Rx: 76)

            Warning: Enabling port 2 will automatically set the GPIOs in UART mode!

            For each bord, the following name can be used:
                "SOL": Serial Over Lan
                "SDI": Serial Debug Interface
                "PI": Payload Interface

            The baudrate can be set using the baudrate param. By default,
            it is configured to 115200b/s.
        -->
        <Connect port="0" name="SDI" baudrate="115200"/>
        <!-- <Connect port="1" name="PI"  baudrate="115200"/> -->
        <!-- <Connect port="2" name="SOL"  baudrate="115200" extended="true" /> -->

        <RedirectSDItoSOL/>

    </SerialInterfaces>

	<PowerManagement>

		<PowerONSeq>
			<step>PSQ_ENABLE_SIGNAL(CFG_PAYLOAD_DCDC_EN_SIGNAL)</step>
			<step>PSQ_END</step>
		</PowerONSeq>

		<PowerOFFSeq>
			<step>PSQ_DISABLE_SIGNAL(CFG_PAYLOAD_DCDC_EN_SIGNAL)</step>
			<step>PSQ_END</step>
		</PowerOFFSeq>

	</PowerManagement>

	<LANConfig>

		<MACAddr>0A:0A:0A:0A:0A:86</MACAddr>
		<NetMask>255.255.255.0</NetMask>
		<GatewayIP>192.138.1.3</GatewayIP>

        <UseFlashedMAC />
        <EnableDHCP />

		<IPAddrList> <!-- Default IP Addresses (used if DHCP is not active) -->
			<IPAddr slot_addr="default">192.168.1.34</IPAddr>
			<IPAddr slot_addr="0x41">192.168.1.20</IPAddr>
			<IPAddr slot_addr="0x42">192.168.1.21</IPAddr>
			<IPAddr slot_addr="0x43">192.168.1.22</IPAddr>
			<IPAddr slot_addr="0x44">192.168.1.23</IPAddr>
			<IPAddr slot_addr="0x45">192.168.1.24</IPAddr>
			<IPAddr slot_addr="0x46">192.168.1.25</IPAddr>
			<IPAddr slot_addr="0x47">192.168.1.26</IPAddr>
			<IPAddr slot_addr="0x48">192.168.1.27</IPAddr>
			<IPAddr slot_addr="0x49">192.168.1.28</IPAddr>
			<IPAddr slot_addr="0x4a">192.168.1.29</IPAddr>
			<IPAddr slot_addr="0x4b">192.168.1.30</IPAddr>
			<IPAddr slot_addr="0x4c">192.168.1.31</IPAddr>
			<IPAddr slot_addr="0x4d">192.168.1.32</IPAddr>
			<IPAddr slot_addr="0x4e">192.168.1.33</IPAddr>
			<IPAddr slot_addr="0x4f">192.168.1.34</IPAddr>
			<IPAddr slot_addr="0x50">192.168.1.35</IPAddr>
		</IPAddrList>

	</LANConfig>

	<AMCSlots>

		<AMC site="1">
			<PhysicalPort>1</PhysicalPort>
			<MaxCurrent>6.0</MaxCurrent>
			<PowerGoodTimeout>300</PowerGoodTimeout>
			<DCDCEfficiency>85</DCDCEfficiency>
		</AMC>

		<AMC site="2">
			<PhysicalPort>2</PhysicalPort>
			<MaxCurrent>6.0</MaxCurrent>
			<PowerGoodTimeout>300</PowerGoodTimeout>
			<DCDCEfficiency>85</DCDCEfficiency>
		</AMC>

		<AMC site="3">
			<PhysicalPort>3</PhysicalPort>
			<MaxCurrent>6.0</MaxCurrent>
			<PowerGoodTimeout>300</PowerGoodTimeout>
			<DCDCEfficiency>85</DCDCEfficiency>
		</AMC>

	</AMCSlots>

	<SensorList>
        <Sensors type="raw" global_define="CFG_SENSOR_MCP9801" function_name="SENSOR_MCP9801" rawType="MCP9801">
            <Sensor>
                <Name>Internal temp.</Name>

                <Type>Temperature</Type>
                <Units>degrees C</Units>

                <NominalReading>25</NominalReading>
                <NormalMaximum>60</NormalMaximum>
                <NormalMinimum>-10</NormalMinimum>

                <Point id="0" x="0" y="0" />
                <Point id="1" x="5" y="5" />

                <Thresholds>
                    <UpperNonRecovery>80</UpperNonRecovery>
                    <UpperCritical>60</UpperCritical>
                    <UpperNonCritical>40</UpperNonCritical>
                    <LowerNonRecovery>-20</LowerNonRecovery>
                    <LowerCritical>-10</LowerCritical>
                    <LowerNonCritical>0</LowerNonCritical>
                </Thresholds>

                <Params>
                    <p type="record_id"></p>  <!-- mandatory -->
                    <p type="user">0x090</p>
                    <p type="user">UCGH | LCGL</p>
                </Params>

                <AssertEvMask>0x0A80</AssertEvMask>
                <DeassertEvMask>0x7A80</DeassertEvMask>
                <DiscreteRdMask>0x3838</DiscreteRdMask>
                <AnalogDataFmt>2S_COMPL</AnalogDataFmt>
                <PosHysteresis>2</PosHysteresis>
                <NegHysteresis>2</NegHysteresis>
                <MaxReading>127</MaxReading>
                <MinReading>-128</MinReading>
            </Sensor>
        </Sensors>

        <!-- Example for GPIO sensors:
        <Sensors type="raw" global_define="CFG_SENSOR_GPIO " function_name="SENSOR_GPIO" rawType="GPIOSENSOR">
            <Sensor>
                <Name>GPIOSens Ex.</Name>

                <Type>Processor</Type>

                <Params>
                    <p type="record_id"></p>
                    <p type="user">0x1</p>
                    <p type="user">POWER_GOOD_12V</p>
                </Params>

                <DiscreteRdMask>0x000F</DiscreteRdMask>
            </Sensor>
        </Sensors>
        -->

        <!-- Example for payload sensors:
        <Sensors type="raw" global_define="CFG_SENSOR_PAYLOAD_THRESHOLD" function_name="SENSOR_PAYLOAD_THRESHOLD" rawType="PAYLOADSENSOR_THRESH">
            <Sensor>
                <Name>PayloadSens Ex.</Name>

                <Type>Temperature</Type>
                <Units>degrees C</Units>

                <NominalReading>25</NominalReading>
                <NormalMaximum>60</NormalMaximum>
                <NormalMinimum>-10</NormalMinimum>

                <Point id="0" x="0" y="0" />
                <Point id="1" x="5" y="5" />

                <Thresholds>
                    <UpperNonRecovery>80</UpperNonRecovery>
                    <UpperCritical>60</UpperCritical>
                    <UpperNonCritical>40</UpperNonCritical>
                    <LowerNonRecovery>-20</LowerNonRecovery>
                    <LowerCritical>-10</LowerCritical>
                    <LowerNonCritical>0</LowerNonCritical>
                </Thresholds>

                <Params>
                    <p type="record_id"></p>
                </Params>

                <AssertEvMask>0x0A80</AssertEvMask>
                <DeassertEvMask>0x7A80</DeassertEvMask>
                <DiscreteRdMask>0x3838</DiscreteRdMask>
                <AnalogDataFmt>2S_COMPL</AnalogDataFmt>
                <PosHysteresis>2</PosHysteresis>
                <NegHysteresis>2</NegHysteresis>
                <MaxReading>127</MaxReading>
                <MinReading>-128</MinReading>
            </Sensor>
        </Sensors>
        -->

	</SensorList>
</IPMC>
1 Like

Dear Thiago,

Sorry for the delay but I am just coming back from vactions. As Ralf told you, some problems were found on the i2c bus with the version 1.2 and are fixed on version 1.3. As the command might send a lot of data through the i2c bus, it could be the source of your issue.

Concerning your problem with going to 1.3, it is really weird, I’ve just tried again to compile and force and it looks ok. However, it could be because of a version checking issue that the rollback is automatically performed. Could you check what is printed on the serial debug interface when you activate the new version?

Thank you,
best,
Julian

Dear Julian,

Thanks for answering promptly after your return. I hope you enjoyed your time off.

Concerning your problem with going to 1.3, it is really weird, I’ve just tried again to compile and force and it looks ok.

Just to confirm, have you used the same config.xml I sent above?

However, it could be because of a version checking issue that the rollback is automatically performed. Could you check what is printed on the serial debug interface when you activate the new version?

I am afraid I cannot get this. The IPMC serial is connected to a Zynq device on the same blade, and I would only have access to the serial messages after the Zynq itself finish booting. Can you explain a bit more about this version checking issue?

Also, these IPMCs are an old revision. Could be this the problem? Can you try with a similar one as well?

Thanks,

Thiago

Hi Thiago,

The XML looks good and you should not have any problem using your binary file. When you move from 1.2 to 1.3, the firmware tries to recover “non-volatile parameters”. However, the parameters changed between 1.2 and 1.3 and, at the first boot, as the activate function cannot convert all of the parameter, it automatically performs a rollback. However, the upgrade can be forced by adding the “norollback” instruction when you activate the new firmware.

Connecting to the debug interface allows getting additional details to confirm that the rollback is issued by the non volatile parameter issue. But, if you flash with the norollback and you really have an issue with the firmware, you could face the case where you need to flash your ipmc again via JTAG.

Do you have a JTAG cable that you could use, or even a raspberry pi that we now support for reseting the system? If you have a way to flash back the module, you can try adding the “norollback” instruction to the activate command.

Cheers,
Julian

Hi, Julian.

I bought the JTAG cable sometime ago because I thought we could end-up on this kind of situation, but I never actually used it. Do you have instructions about how to use it under the CERN IPMC context? (Linux solution, please, I don´t have any computer with other operating systems around)

Thanks,

Thiago

Hi Thiago,

Unfortunatelly, I am sorry but Pigeon Point provides only a solution with Windows, I am not aware of equivalent with Linux. But, maybe, using the stapl file, there is a solution to run it on linux.

However, we have a solution to flash the IPMC with a raspberry pi that is described here: IPMC v3 Image Upgrade Issue - #6 by jumendez

Cheers,
Julian

Thanks, Julian. Can you add this somewhere in the documentation? This way people are able to get the correct items from start. I will try to get my hands on a RPi then, but it will stall the debug process for some more days.

Cheers,

Thiago

Hi again, Julian.

I just visited the other issue and the picture is not accessible anymore. Can you please provide it again?

Thanks,
Thiago

Hi Thiago,

The issue is reported in the release note: v.1.3.1 · Tags · ep-ese-be-xtca / ipmc-project · GitLab

That you can also find from the documentation section of the cern-ipmc website: CERN-IPMC > Release note v.1.3

Cheers,
Julian

I’ve just tested the link and it works well on my side, but it takes time to get the file being downloaded. Which kind of error are you facing?

Cheers,
Julian

The picture in that link is not available:

Hi, Julian.

The issue is reported in the release note: v.1.3.1 · Tags · ep-ese-be-xtca / ipmc-project · GitLab

That is not what I meant. I think it would be helpful for the developers to know what they need to have to recover an IPMC, in my case I just learned that I need a RPI for that. If I had this info before, I would get one beforehand in case of unexpected results.

Thiago