It takes time and efforts to debugging hardware and software to get a product right, but some bugs may be hard to reproduce, or only happen over time, and it appears some Intel Celeron C2000 series processor for microservers may stop working after about 18 months, with the likelihood of problems increasing over time, due to clock signals that stop functioning.
This is documented in Intel Atom Processor C2000 Product Family Specification Update, with Errata AVR 54 explaining the issue:
AVR54. System May Experience Inability to Boot or May Cease Operation
Problem: The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock
outputs) may stop functioning.
Implication: If the LPC clock(s) stop functioning the system will no longer be able to boot.
Workaround: A platform level change has been identified and may be implemented as a workaround
for this erratum.
Status: For the steppings affected, see Table 1, “Errata Summary Table” on page 9.
The table on page 9 shows stepping “B0” suffers from this problem. The issue affects existing motherboard and server based on Atom C2000, and companies like Cisco will provide replacements:
Recently, Cisco became aware of an issue related to a component manufactured by one supplier that affects some Cisco products. In some units, we have seen the clock signal component degrade over time. Although the Cisco products with this component are currently performing normally, we expect product failures to increase over the years, beginning after the unit has been in operation for approximately 18 months. Once the component has failed, the system will stop functioning, will not boot, and is not recoverable. This component is also used by other companies.
We have identified all Cisco products that have this component and worked with the supplier to quickly put a fix in place. All products shipping currently do not have this issue. To support our customers and partners, Cisco will proactively provide replacement products under warranty or covered by any valid services contract dated as of November 16, 2016, which have this component. Due to the age-based nature of the failure and the volume of replacements, we will be prioritizing orders based on the products’ time in operation.
The good news is that a new revision of the chip fixes the issue for new processors, but there’s no fix for older ones. So if you own any such systems, and they have stopped working or become unstable suddenly, it may be the reason. You also want to check if you can get a replacement while it is still under warranty whether it works or not.
Thanks to Mike for the tip.
Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.
Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress
“Workaround: A platform level change has been identified and may be implemented as a workaround for this erratum.”
What does that mean? Is that a software/firmware workaround for existing hardware, or does the workaround mean replacing the Atom for a new Atom?
@Sander
I understood it as a new revision of the chip will have a fix. But maybe “Platform level change” means something else…
Table 9 in the PDF document shows there’s no fix for AVR54 for “stepping B0” processors.
Most probably means exchanging some of the hardware, but “platform level change” is a good PR workaround… hehe
… and, being a SoC, this will mean “replace the motherboard/cpuboard” :-/
At my work, I’ve forwarded those infos to my IP/Transport mgr; several Cisco products are involved, for instance some Nexus 9000 family, ASA5500, ISR4300 and so on.
@CNX: should “Intel Celeron C2000” be “Intel Atom C2000”?
Courious what is the cost involved in parts and labour replacing one, also what is the expected in use life, of such aerver.
@Jean-Luc Aufranc (CNXSoft) That generally means that you’re getting a new board in most cases. The phrase is PR-speak to cover for the fact that the vendors, if this stuff is under a warranty, is going to be EATING this. @theguyuk Cost is probably the BoM cost of the board itself- since the device is soldered ONTO the board, like most SoC’s are done, it means replacement of the whole board. In most cases, this equates to pitching the affected device into the recycle bin and shipping you a new board/machine. It’d be like if someone fubared something on the… Read more »
And on that note…I’ll observe that you want Intel for WHAT reasons? >;-D
@theguyuk
Cisco indicated in their advisory that the systems generally start failing around 18 months. I have an Asrock C2750d4I that has the affected stepping and it’s been chugging along as a light duty home NAS for about a year now. I’m interested to see how this plays out for the little guys without Cisco support contracts because ASRock only has a 1 year warranty. Hopefully Intel will make good on it like they did with the Pentium FP bug. If not, I’ll soon be able to justify that D1541 to the wife 😉
@sandbender Opening a system and changing a Cpu or Motherboard affects the failure rate becasuse disturbing connections means more chance of errors. Then you have security of data issues. At home Nas, replace or offer you money off upgrade. ( Not everyone can open, take the bits out and replace , as not all are hardware trained or interested ) Cisco involved means I guess, business contracts. Have it replaced, is it warranted, for how long? Are you getting new for old or repaired refurbished replacements? You can go on and on. Motherboards heating and cooling become acceptable to more… Read more »
“platform level change” means you must switch to the new CPU family. AMD for example 😉
@theguyuk I think you misunderstood, I have a custom NAS built around a C2750D4I motherboard, I was just wondering about possible RMA’s for the board itself. Not a whole NAS. Apparently SuperMicro is already offering a fix for their boards if requested, even if the board hasn’t failed yet. The fault is very specific and not something you could easily fake or ascribe to a different fault like cracking traces on the board (only two CLK pins die on the actual chip, you can’t just over volt it and claim it failed because of this issue). @Jean-Luc Aufranc (CNXSoft) After… Read more »