[OmniOS-discuss] Overheating faults with ST4000NM0023

Saso Kiselkov skiselkov.ml at gmail.com
Tue Apr 15 22:30:20 UTC 2014


Hi,

I've hit this exact same issue on my recent SEAGATE ST2000NM0023 drives.
Can you please direct me to where I can get the firmware package?
Perhaps we could also post the link publicly, so that people can find it
through google or some such method.

Thanks!

Best wishes,
-- 
Saso

On 2/13/14, 11:18 AM, Thibault VINCENT wrote:
> On 02/12/2014 09:59 PM, Steamer wrote:
>> Did you ever find a solution to the overheating faults with the
>> ST4000NM0023?
>>  
>> I'm currently having the exact same issue with ST1000NM0023 drives,
>> seems like seagate has the user temp probe set at 40'C. The manual
>> states that the temperature settings are programmable via smart, but I
>> haven't found a way to do that.
> 
> Hello Emile,
> 
> I've found a workaround but the definitive fix should be handled by
> Illumos I guess. There is no open ticket, first I was waiting for
> something to happen with #4051 before going back to using that distro
> and kernel.
> 
> Here's the story:
> The SCSI specification defines two registers to store the temperature
> thresholds in SMART data. One contains the recommended maximum operation
> temperature for best MTBF, and the other register is for the absolute
> maximum rating. Usually the industry has always put the same value in
> both, and that is the absolute maximum. That's why we always see
> something like 60/65°C from SMART. But recently Seagate has changed that
> because it was asked by a large OS company to comply with the
> specification for better hardware monitoring integration. The change did
> not only occur in newer products but in a firmware update for existing
> disks and that was applied to the production line which explains some
> disks mays or may not expose this problem although they are the same
> model. Our disks are of the Megalodon serie and all share the same
> firmware basecode.
> 
> So any Seagate disk will now trigger faults in FMA if they have a
> firmware with the newer policy. Also I think other brands will follow
> the same path.
> 
> Like other members suggested in that thread, maybe nothing should change
> in FMA but let's face it, you can't maintain a temperature steadily
> under 40°C in a JBOD of hundreds of busy disks. Especially in
> eco-friendly datacenters. IMHO we should not trigger a fault on the
> lower threshold, and certainly not a drive retirement. It breaks storage
> servers on reboot or before a pool import, also spare disks could
> disappear with the retirement triggered.
> 
> The workaround is to downgrade firmware to the last version before the
> change, and to reset the register with an SCSI command. It is not
> possible to set the register to a user specified value like the
> documentation suggests, they confirmed it.
> 
> I'm sending a working firmware to you in a private mail. I'm not aware
> of any issue working with that older version and hopefully it should
> upload to 1TB drives as well.
> I'm applying it like this but from Linux not OmniOS:
> # ./dl_sea_fw-0.2.3_32 -f Megalodon_StdOEM_SAS_0002+C84C.lod -m ST4000NM0023
> # ./dl_sea_fw-0.2.3_32 -i
> 
> Then you should reset the drives so they reload the firmware.
> Here's our example for 4TB drives:
> -------------
> for i in $(lsscsi | grep 'ST4000NM0023' | awk '{print $6}') ; do
>   sg_reset -d $i
> done
> -------------
> 
> And reset the register that contains value from the previous firmware.
> It doesn't work well so we've got this script to run a few times until
> all disks got it. Again it matches 4TB Megalodon.
> -------------
> for i in $(lsscsi | grep 'ST4000NM0023' | awk '{print $6}') ; do
>   echo -n "$i "
>   if sg_logs $i --page=0x0d | grep 'Reference temperature = 68 C'
>> /dev/null ; then
>     echo 'ok'
>   else
>     sg_logs $i --page=0x0d --reset
>     echo 'reset'
>   fi
> done
> -------------
> 
> 
> Cheers
> 



More information about the OmniOS-discuss mailing list