DV Info Net - SSD HD Failure White Paper

DV Info Net (https://www.dvinfo.net/forum/)

- Open DV Discussion (https://www.dvinfo.net/forum/open-dv-discussion/)

- - SSD HD Failure White Paper (https://www.dvinfo.net/forum/open-dv-discussion/528783-ssd-hd-failure-white-paper.html)

SSD HD Failure White Paper

Re: SSD HD Failure White Paper

Abstract of linked paper ....

ABSTRACT
Servers use flash memory based solid state drives (SSDs) as a
high-performance alternative to hard disk drives to store persistent
data. Unfortunately, recent increases in flash density
have also brought about decreases in chip-level reliability. In
a data center environment, flash-based SSD failures can lead
to downtime and, in the worst case, data loss. As a result,
it is important to understand flash memory reliability characteristics
over flash lifetime in a realistic production data
center environment running modern applications and system
software.

This paper presents the first large-scale study of flash-based
SSD reliability in the field. We analyze data collected across
a majority of flash-based solid state drives at Facebook data
centers over nearly four years and many millions of operational
hours in order to understand failure properties and trends of
flash-based SSDs. Our study considers a variety of SSD characteristics,
including: the amount of data written to and read
from flash chips; how data is mapped within the SSD address
space; the amount of data copied, erased, and discarded by the
flash controller; and flash board temperature and bus power.
Based on our field analysis of how flash memory errors manifest
when running modern workloads on modern SSDs, this
paper is the first to make several major observations:

(1) SSD failure rates do not increase monotonically with flash
chip wear; instead they go through several distinct periods
corresponding to how failures emerge and are subsequently
detected, (2) the effects of read disturbance errors are not
prevalent in the field, (3) sparse logical data layout across an
SSD’s physical address space (e.g., non-contiguous data), as
measured by the amount of metadata required to track logical
address translations stored in an SSD-internal DRAM buffer,
can greatly affect SSD failure rate, (4) higher temperatures
lead to higher failure rates, but techniques that throttle SSD
operation appear to greatly reduce the negative reliability impact
of higher temperatures, and (5) data written by the operating
system to flash-based SSDs does not always accurately
indicate the amount of wear induced on flash cells due to optimizations
in the SSD controller and buffering employed in
the system software. We hope that the findings of this first
large-scale flash memory reliability study can inspire others
to develop other publicly-available analyses and novel flash
reliability solutions.

Re: SSD HD Failure White Paper

Also ....

SUMMARY AND CONCLUSIONS

We performed an extensive analysis of the effects of various
factors on flash-based SSD reliability across a majority of the
SSDs employed at Facebook, running production data center
workloads. We analyze a variety of internal and external
characteristics of SSDs and examine how these characteristics
affect the trends for uncorrectable errors. To conclude, we
briefly summarize the key observations from our study and
discuss their implications for SSD and system design.
Observation 1: We observe that SSDs go through several
distinct failure periods – early detection, early failure, usable
life, and wearout – during their lifecycle, corresponding to the
amount of data written to flash chips.

Due to pools of flash blocks with different reliability characteristics,
failure rate in a population does not monotonically
increase with respect to amount of data written to flash chips.
This is unlike the failure rate trends seen in raw flash chips.
We suggest that techniques should be designed to help reduce
or tolerate errors throughout SSD lifecycle. For example, additional
error correction at the beginning of an SSD’s life could
help reduce the failure rates we see during the early detection
period.

Observation 2: We find that the effect of read disturbance
errors is not a predominant source of errors in the SSDs we
examine.

While prior work has shown that such errors can occur under
certain access patterns in controlled environments [5, 32,
6, 8], we do not observe this effect across the SSDs we examine.
This corroborates prior work which showed that the effect
of retention errors in flash cells dominate error rate compared
to read disturbance [32, 6]. It may be beneficial to perform
a more detailed study of the effect of these types of errors in
flash-based SSDs used in servers.

Observation 3: Sparse data layout across an SSD’s physical
address space (e.g., non-contiguously allocated data) leads
to high SSD failure rates; dense data layout (e.g., contiguous
data) can also negatively impact reliability under certain conditions,
likely due to adversarial access patterns.

Further research into flash write coalescing policies with information
from the system level may help improve SSD reliability.
For example, information about write access patterns
from the operating system could potentially inform SSD controllers
of non-contiguous data that is accessed very frequently,
which may be one type of access pattern that adversely affects
SSD reliability and is a candidate for storing in a separate
write buffer.

Observation 4: Higher temperatures lead to increased failure
rates, but do so most noticeably for SSDs that do not employ
throttling techniques.

In general, we find techniques like throttling, which may
be employed to reduce SSD temperature, to be effective at
reducing the failure rate of SSDs. We also find that SSD
temperature is correlated with the power used to transmit
data across the PCIe bus, which can potentially be used as
a proxy for temperature in the absence of SSD temperature
sensors.
Observation 5: The amount of data reported to be written
by the system software can overstate the amount of data actually
written to flash chips, due to system-level buffering and
wear reduction techniques.

Techniques that simply reduce the rate of software-level
writes may not reduce the failure rate of SSDs. Studies seeking
to model the effects of reducing software-level writes on flash
reliability should also consider how other aspects of SSD operation,
such as system-level buffering and SSD controller wear
leveling, affect the actual amount of data written to SSDs.
Conclusions. We hope that our new observations, with
real workloads and real systems from the field, can aid in (1)
understanding the effects of different factors, including system
software, applications, and SSD controllers on flash memory
reliability, (2) the design of more reliable flash architectures
and systems, and (3) improving the evaluation methodologies
for future flash memory reliability studies.