This is a story from my workplace, so obviously I’m not going to disclose much in terms of specifics, most of what you read here is just what can be puzzled together anyway from official documentation of our products and from the LKML, with my narrative added.

When it comes to Ethernet, I’ve had my fair share of struggles with it. From trying to do PCB design, to getting lost in the letter slurry of the IEEE-802.3-2022 specification, this little interface makes our lives both much easier (by displacing a variety of older interfaces, thereby unifying the ecosystem), and much harder at the same time.

Ethernet is an old protocol. Designed in the early ’70s, its primary design principle was to be cheap to implement. Of course, over the decades a lot of stuff was tacked on, maximum data rate increased about 400 000x, and now the whole thing is now anything but cheap and easy to implement. And as is customary, there are now multiple ways of doing things. For instance, for configuring an Ethernet PHY (physical layer transceiver, the analog/mixed mode circuit that interfaces with OSI Layer 1 by turning data words into analog signals), there’s:

Management Data Input/Output (MDIO)
Serial Management Interface (SMI)
Media Independent Interface Management (MIIM)
and I²C

The first three interfaces are supposedly the same, but some IC’s datasheets will differentiate e.g. “SMI mode” and “MIIM mode”, so no one really knows. At this point I have read through the IEEE spec twice already, and I still have no clue as to whether there should be a difference, and if so, what is it. And that is even before we get to how to actually transfer the data to/from the PHY, which will involve one of these interfaces (not all of them are defined in 802.3, some are external standards): AUI, TBI, MII, RMII, SMII, RTBI, GMII, RGMII, XGMII, XAUI, RXAUI, SGMII, HSGMII, QSGMII, PSGMII, XFI, 25GAUI, LAUI, CAUI… And the list doesn’t end there, but you get the point (and also there are different versions of these out there).

Anyhow, rambling aside (I will make a more comprehensive post on Ethernet anyways), let’s get into the meat of the story. The year is 2022, and one of the company’s earliest product line is in need of a refresher.

A new hero is born

I’m talking about the Profield RTU family, designed for use by the electric grid control sector. The first generation was based on an embedded i86, and ran a completely custom software stack. Legend has it, that one of the engineers eventually was so tired of having to compile, flash an EEPROM, test, fix bugs, and repeat, that they instead ported one of the DOSes to the box to develop, compile and test their code locally. The whole system then consisted of two levels of rack-mount units, each having multiple cards, most of which had their own onboard processors running part of the code.

Profield RTU Third-generaton Profield-C head unit. Source: official product page

This architecture worked wonders in the ’90s and into the 2000s, but eventually it was time to move on. There had been two more generations, Profield-B and -C, each equipped with ever faster external interfaces. By this time the communication controller card had a whole other, bigger CPU running Linux, at which point they realized you could cram some of the functionality into it. Thus was born the next generation: the Intelligent Electronic Device (IED), which was a single rack-mount box that could not only act as the top-level unit in the former two-level system, but also work on its own, performing the duties of the lower level as well. Eventually though, this system became dated as well, running on an old NXP CPU (still alongside the embedded i86, of course, which at this point was relegated to a “realtime co-processor” so to say; analogous as to how one might have a Cortex-R core among the A cores in a modern SoC), and having good ol’ copper-based Fast Ethernet for a blazing speed of 10/100 Mbps!

And so did the development of the fifth generation Profield-D RTU begin. The idea was to follow the logic of our other, fairly successful RTU’s architecture: a central Linux processor connected to Ethernet, with a bunch of I/O cards, external interface communication controllers, measuring instruments etc. hooked up via a CAN backplane. But, of course, the old i86 coprocessor was retained for backwards compatibility with older generation cards.

Old man yells at Cloud (at Gigabit speed)

Early on, it was decided that since the future is Fiber Optic Ethernet, the new RTU must support that. Gone are the days of pulling twisted pair everywhere; especially in electrical distribution, where noise is plentiful, and Remote Terminal Units, are, well, remote, most networks are glass fiber nowadays. Some previous RTUs only had copper Ethernet, and thus required adding an external media converter, then having to power said converter, then having to ensure said power will absolutely not go out, unless the main power source of the entire RTU cuts out, in which case you’re toast anyway (figuratively of course; the system is designed to enter a failsafe mode on blackout. But if the RTU stays on but communication cuts out, , which can be inconvenient, although still not life-threatening thankfully).

The new SoC had Gigabit Ethernet, which was good, since we had to one-up the previous generation in interface speed (the wonders of “numbers go up”-marketing…). However, it was RGMII, which is not really suitable for Ethernet over Fiber.

Side-note: dipping our toes into the letter soup

If you’re not intimately familiar with the Ethernet architecture, good for you, and I’m envious already. Unfortunately for you though, you clicked on this page and thus have to come with me on a journey of suffering and data paths and signal baseband processing.

Diagram showing the OSI layers and relevant Ethernet interfaces Ethernet PMD (Physical Media Dependent) architecture overview. Source: Semiconductor Engineering

Okay, not really. At least not today. All you need to know now is the OSI layer model, and only the bottom 2 layers from that. Between these two layers sits the whole Ethernet stack.

We have the Physical Layer, which is the medium in which some analog signals carry our ~~porn~~ Ethernet packets. The Ethernet spec calls this Media Dependent Interface (MDI), and it can be coax cables, twisted pair, glass fiber and a few types of more exotic media. These analog signals are sent and received by an analog IC called the PHY transceiver. In the case of Fiber Ethernet, it is common for the PHY to be in an external, hotpluggable metal-casing module, called the Small Form-Factor Pluggable (SFP) module. This is because there are way more flavors of Fiber MDI, using different wavelengths, modulation, different sized fiber cores for the cabling (which affects the propagation of electromagnetic radiation — I especially won’t go into the physics, but this is why there is ‘SM’ and ‘MM’ fiber, and if you mix them up, as we at times did, it won’t work, or at least not reliably), and even transceivers using multiple of these simultaneously for increased bandwidth. Therefore, it’s easier to make it user-replaceable, than to have just one option built in into your network device. It is also coincidentally easier for a user to accidentally mix them up and have two incompatible modules on each end of the fiber, wondering why the link won’t go up (if someone hands you an optic cable with an SFP module hanging off, never blindly trust it. Always double-check both ends, and the cable type as well. Especially if they aren’t well-versed in Fiber Ethernet — a fact they will most likely only admit after you have questioned them on why they handed you a mismatched set).

The second layer, Data Link Layer is essentially what Linux userspace sees: an API to send and receive frames. This is mostly managed by a communications controller peripheral in our SoC (or network card, in a PC) called the Media Access Controller (MAC). It has a MAC address that other stations can address frames to.

The PHY and MAC are normally connected by the Media Independent Interface (MII), which carries the data, and MDIO, which, as discussed earlier, is for configuration (link speed and duplexity, auto-negotiation, flow control etc.).

Now, as mentioned in the beginning, there are many flavors of MII. In our case, the SoC had Reduced Gigabit MII (RGMII), which is a reduced pin-count interface carrying Gigabit Ethernet traffic. It is essentially a 2x4 bit parallel full duplex synchronous bus with a TX and RX clock and some DDR status lines. SFP modules on the other hand use a different flavor: Serial Gigabit MII (SGMII), which is a differential serial TX/RX pair of 4 lines total. They also have some single-ended control and status signals, plus an I²C connection instead of MDIO, mostly for identification (modules have an EEPROM storing the module’s type, basic capabilities, serial number etc.), but a few also have some diagnostics (thermal sensors, for instance).

The road to Hell is paved with good intent

MAX24287's Ethernet block diagram Internal Ethernet circuitry of the star of our show, the MAX24287. Source: Microsemi, Jotrin

To solve this disparity, despite our best efforts to dissuade them from doing so, the hardware department added a MAX24287 Serial/Parallel MII SERDES IC. As someone who has dabbled a bit in Ethernet before, when I heard this I was horrified. It’s like as if someone said they found an IC that would finally convert their UART to SPI. A more sensible option would have been to a) find a SoC that has SGMII (they exist), or b) attach a SGMII-equipped network card via some other interface (USB 3.0, PCIe, SDIO), or, if all else fails, c) a proper managed switch. But alas, the first run of boards were manufactured, and we got to work on the Linux system to breathe life into them.

Luckily, there was already a driver on GitHub… for Linux 2.6. One of the hardware devs started to port it to Linux 5.4, and managed to get impressively far with it (i.e. it didn’t immediately crash and burn). My initial task was to try and get interrupts working, which he couldn’t for the life of him manage to do. I spent a good deal of time untangling the code, only to find out that the IRQ handler code was utterly dysfunctional, as if the original writer gave up writing it mid-statement. Later I would find out why: this IC was almost impossible to do proper IRQ handling with. The IRQ status bits were “sticky”, meaning they would latch and stay latched until both the condition ceases AND the flag is cleared via a register operation. But since you had no way of knowing when the condition ended, the only way was to do that was to keep trying to clear the interrupt flags in a polling manner until they actually cleared, sort of “half-polling” the device.

Another issue rose when we tried to add I²C management support. So far, the only thing we had was the SoC’s MDIO wired up to the MAX24287, which was trying its darnedest to give us the illusion of a copper PHY. Luckily, the SFP module was connected to one of the SoC’s I²C busses, but as soon as we added its address to the Device Tree description, the whole interface came crumbling down. This was, apparently, because now that the Ethernet interface was assigned to an SFP port, Linux decided to use the more modern and flexible phylink API to manage it. Needless to say, the driver absolutely wasn’t ready for this, and thus was completely left out of the whole ordeal. In the end, phylink then tried to connect the SoC’s MAC, which reported that it had an interface meant for twisted pair Ethernet, to the SFP port, and got very confused (i.e. errored out, and completely disabled the interface).

A funny thing that happened then, is that by then removing and re-inserting the SFP module (which I was doing a lot to try to debug the problem), each time a new hwmon device would appear. You see, this was one of the smarter modules, which had an I²C thermal sensor built in. Linux did get so far as to read this fact out of the module’s EEPROM, and it did successfully probe the sensor and added it as a new device. However, when phylink errored out, it was not removed. So far, that wouldn’t be such a big deal, as one extra device wouldn’t hurt anyone, and the sensor was operating correctly independent of the networking stack anyways, but the code that ran when it detected that the module was unplugged only destroyed the associated sensor devices if the SFP port was fully set up beforehand. Since it was in an error state instead, the cleanup code was not run when the unplug event was handled, leaving an invalid temperature sensor device, that was “stuck” on the last reading, and then when the module was plugged back again, a new device would be created (interestingly, both would continue reading normally, displaying the same values, until the module was yanked again). Then, at one point during my testing, I ended up having 7-8 hwmon devices and I started to be a little suspicious something was amiss. This was then all fixed in e96b2933152f (“net: sfp: Always call sfp_sm_mod_remove() on remove”).

In the end, sadly, we had to let go of SFP management. There simply was no time to try to iron it all out, so we told the technicians to live with that, and shipped it in 2024. After all, the rest of the system worked beautifully (even interfacing with the ol’ i86 turned out to be not that hard — just an UART with an industry-standard protocol on top). Then we reprioritized and I got assigned to solving yet another impossible task. But in my dreams (nightmares?) I pour petrol on this sorry excuse of a driver and write it again from scratch, using modern APIs, that would allow me to express that this abomination of an IC is, in fact, capable of turning RGMII into (headaches and) SGMII. And indeed, the data path is the only part where we had no problems at all.