1. Most Technical Debt Is Just Bullshit

    September 25, 2020

    Introduction

    I made an offhand remark about technical debt to a friend and he interrupted me, saying: "technical debt is just bullshit". In his experience, people talking about technical debt were mostly trying to:

    • cover up bad code
    • cover up unfinished work

    mess

    source1

    Calling these issues 'technical debt' seems to be a tactic of distancing oneself from these problems. A nice way of avoiding responsibility. To sweep things under the rug.

    Intrigued, I decided to take a better look at the metaphor of techical debt, to better understand what is actually meant.

    Tip: this article on Medium by David Vandegrift also tackles this topic.

    A definition of technical debt

    Right off the bat, I realised that my own understanding of technical debt was wrong. Most people seem to understand technical debt as:

    "cut a corner now, to capture short-term business value (taking on debt), and clean up later (repaying the debt)".

    I think that's wrong.

    Ward Cunningham, who coined the metaphor of technical debt, wrote:

    You know, if you want to be able to go into debt that way by developing software that you don't completely understand, you are wise to make that software reflect your understanding as best as you can, so that when it does come time to refactor, it's clear what you were thinking when you wrote it, making it easier to refactor it into what your current thinking is now.

    In some sense, this reads to me as a form of prototyping. To try out and test design/architecture to see if it fits the problem space at hand. But it also incorporates the willingness to spend extra time in the future to change the code to better reflect the current understanding of the problem at hand.

    ... if we failed to make our program align with what we then understood to be the proper way to think about our financial objects, then we were gonna continually stumble over that disagreement and that would slow us down which was like paying interest on a loan.

    The misalignment of the design/architecture and the problem domain creates a bottleneck, slowing down future development.

    So I think it's clearly not about taking shortcuts for a short-term business gain.

    It is more a constant reinvestment in the future. It may temporary halt feature work, but it should result in more functionality and features in the long run. It doesn't seem short-term focussed at all to me. And you need to write 'clean' code and do your best because it is likely that you will have to rewrite parts of it.

    These two articles by Ron Jeffries already discuss this in great detail.

    A logical error

    Reading up on the topic, I noticed something very peculiar. Somehow along the way, everything that hinders software development has become 'technical debt'.

    Anything that creates a bottleneck, is suddenly put into the basket of technical debt. I started to get a strong sense that a lot of people are somehow making a logical fallacy.

    If you have technical debt, you'll experience friction when trying to ignore it and just plow ahead. The technical debt creates a bottleneck.

    But then people reason the wrong way around: I notice a bottleneck in my software development process, so we have 'technical debt'.

    However, because technical debt creates a bottleneck, it doesn't follow that every bottleneck is thus technical debt.

    I think it's this flawed reasoning that turns every perceived obstacle into technical debt2.

    Maybe I'm creating a straw man argument, but I think I have some examples that show that people are thinking the wrong way around.

    If we look at the wikipedia page about technical debt, there is a long list of possible causes of technical debt.

    To site some examples:

    • Insufficient up-front definition
    • Lack of clear requirements before the start of development
    • Lack of documentation
    • Lack of a test suite
    • Lack of collaboration / knowledge sharing
    • Lack of knowledge/skills resulting in bad or suboptimal code
    • Poor technical leadership
    • Last minute specification changes

    Notice that these issues are called 'technical debt' because they can have a similar outcome as technical debt. They can create a bottleneck.

    But why the hell would we call these issues technical debt?

    These issues are self-explanatory. Calling them technical debt not only seems inappropriate, it just obfuscates the cause of these problems and it doesn't provide any new insight. Even in conversations with laypeople.

    A mess is not a Technical Debt

    A blogpost by Uncle Bob with the same title3 also hits on this issue that a lot of issues are incorrectly labeled as 'technical debt'.

    Unfortunately there is another situation that is sometimes called “technical debt” but that is neither reasoned nor wise. A mess.

    ...

    A mess is not a technical debt. A mess is just a mess. Technical debt decisions are made based on real project constraints. They are risky, but they can be beneficial. The decision to make a mess is never rational, is always based on laziness and unprofessionalism, and has no chance of paying of in the future. A mess is always a loss.

    Cunningham's definition of technical debt shows that it's a very conscious and deliberate process. Creating a mess isn't. It's totally inappropriate to call that technical debt. It's just a mess.

    I think that nicely relates back to that earlier list from wikipedia. Just call things out for what they actually are.

    Is quibbling over 'technical debt' as a metaphor missing the point?

    In this blogpost, Martin Fowler addresses the blogpost by Uncle Bob and argues that technical debt as a metaphor is (still) very valuable when communicating with non-technical people.

    He even introduces a quadrant:

    RecklessPrudent
    Deliberate"We don't have time for design""We must ship now and deal with consequences (later)"
    inadvertent"What's Layering?""Now we know how we should have done it"

    This quadrant makes me extremely suspicious. Because in this quadrant, everything is technical debt. He just invents different flavours of technical debt. It's never not technical debt. It's technical debt all the way down.

    It seems to me that Martin Fowler twists the metaphor of technical debt into something that can never be falsified, like psychoanalysis.

    It's not 'bad code', a 'design flaw' or 'a mess', it's 'inadvertent & reckless technical debt'. What is really more descriptive of the problem?

    Maybe it's just my lack of understanding, but I fail to see why it is in any way helpful to call every kind of bottleneck 'technical debt'. I again fail to see how this conveys any meaning.

    In the end, what Fowler does is just pointing out that bottlenecks in software development can be due to the four stages of competence.

    IncompetenceCompetence
    Concious"We don't have time for design""We must ship now and deal with consequences (later)"
    Unconscious"What's Layering?""Now we know how we should have done it"

    I don't think we need new metaphors for things we (even laypeople) already understand.

    Does technical debt (even) exists?

    The HFT Guy goes as far as to argue that technical debt doesn't really exists, it isn't a 'real' concept.

    After decades of software engineering, I came to the professional conclusion that technical debt doesn’t exist.

    His argument boils down to the idea that what people call technical debt is actually mostly maintenance.

    So reincorporating a better understanding of the problem at hand into the code (design) is seen as an integral and natural part of software development, illustrated by the substitute metaphor of mining (alternating between digging and reinforcing). At least that's how I understand it.

    Substituting one metaphor with another, how useful is that really? But in this case it's at least less generic and more precise.

    Closing words

    Although Cunningham meant well, I think the metaphor of technical debt started to take on a life of its own. To a point where code that doesn't conform to some Platonic ideal is called technical debt4.

    Every mistake, every changing requirement, every tradeoff that becomes a bottleneck within the development process is labeled 'technical debt'. I don't think that this is constructive.

    I think my friend was right: the concept of technical debt has become bullshit. It doesn't convey any better insight or meaning. On the contrary, it seems to obfuscate the true cause of a bottleneck.

    At this point, when people talk about technical debt, I would be very sceptical and would want more details. Technical debt doesn't actually explain why we are where we are. It has become a hollow, hand-wavy 'explanation'.

    With all due respect to Cunningham, because the concept is so widely misunderstood and abused, it may be better to retire it.


    1. I discovered this image in this blogpost

    2. if you are not working on a new feature, you are working on technical debt. 

    3. I think that Uncle Bob's definition of technical debt in this article is not correct. He also defines it basically as cutting corners for short-term gain. 

    4. See again Martin Fowlers article about technical debt. 

    Tagged as : None
  2. This Blog Is Now Running on Solar Power

    July 06, 2020

    Introduction

    This blog is now running on solar power.

    I've put a solar panel on my balcony, which is connected to a solar charge controller. This device charges an old worn-out car battery and provides power to a Raspberry Pi 3b+, which in turn powers this (static) website.

    For updates: scroll to the bottom of this article.

    solar

    Some statistics about the current status of the solar setup is shown in the sidebar to the right. The historical graph below is updated every few minutes (European time).

    solarstatus

    Low-tech Magazine as inspiration

    If you think you've seen a concept like this before, you are right.

    The website Low-tech Magazine is the inspiration for my effort. I would really recommend visiting this site because it goes to incredible length to make the site energy-efficient. For example, images are dithered to save on bandwidth!

    Low-tech Magazine goes off-line when there isn't enough sunlight and the battery runs out, which can happen after a few days of bad weather.

    In January 2020, the site shared some numbers about the sustainability of the solar-powered website.

    The build

    My build is almost identical to that of Low-tech Magazine in concept, but not nearly as efficient. I've just performed a lift-and-shift of my blog from the cloud to a Raspberry Pi 3b+.

    I've build my setup based on some parts I already owned, such as the old car battery and the Pi. The solar panel and solar charge controller were purchased new. The LCD display and current/voltage sensor have been recycled from an earlier hobby project.

    controller

    I've used these parts:

    Solar PanelMonocrystalline 150 Watt 12V
    Battery12 Volt Lead Acid Battery (Exide 63Ah)
    Solar Charge ControllerVictron BlueSolar MPPT 75|10
    Voltage/Current sensorINA260
    LCD DisplayHD44780 20x4
    ComputerRaspberry Pi 3b+
    Communications cableVE.Direct to USB interface

    The Solar Panel

    The panel is extremely over-dimensioned because my balcony is directed towards the west, so it has only a few hours a day of direct sunlight. Furthermore, the angle of the solar panel is sub-optimal.

    My main concern will be the winter. It is not unlikely that during the winter, the panel will not be able to generate enough energy to power the Pi and charge the battery for the night.

    I have also noticed that under great sunlight conditions, the panel can easily produce 60+ Watt1 but the battery cannot ingest power that fast.

    I'm not sure about the actual brand of the panel, it was the cheapest panel I could find on Amazon for the rated wattage.

    The Solar Charger

    It's a standard solar charger made by Victron, for small solar setups (to power a shed or mobile home). I've bought the special data cable2 so I can get information such as voltage, current and power usage.

    chargecontroller

    The controller uses a documented protocol called ve.direct. I'm using a Python module to obtain the data.

    According to the manual, this solar charger will assure that the battery is sufficiently charged and protects against deep discharge or other conditions that could damage the battery.

    I feel that this is a very high-quality product. It seems sturdy and the communications port (which even supports a bluetooth dongle) giving you access to the data is really nice.

    The controller is ever so slightly under-dimensioned for the solar panel, but since I will never get the theoretical full power of the panel due to the sub-optimal configuration, this should not be an issue.

    The battery

    In the day and age of Lithium-ion batteries it may be strange to use a Lead Acid battery. The fact is that this battery3 was free and - although too worn down for a car - can still power light loads for a very long time (days). And I could just hook up a few extra batteries to expand capacity (and increase solar energy absorption rates).

    To protect against short-circuits, the battery is protected by a fuse. This is critical because car batteries can produce so much current that they can be used for welding. They are dangerous.

    If you ever work with lead acid batteries, know this: don't discharge them beyond 50% of capacity, and ideally not beyond 70% of capacity. The deeper the discharge, the lower the life expectancy. A 100% discharge of a lead acid battery will kill it very quickly.

    You may understand why Lead Acid batteries aren't that great for solar usage, because you need to buy enough of them to assure you never have to deep discharge them.

    Voltage, Current and Power Sensor

    I noticed that the load current sensor of the solar charge controller was not very precise, so I added an INA260 based sensor. This sensor uses I2C for communication, just like the LCD display. It measures voltage, current and power in a reasonable presice resolution.

    Using the sensor is quite simple (pip3 install adafruit-circuitpython-ina260):

    1
    2
    3
    4
    5
    6
    7
    8
    #!/usr/bin/env python3
    import board
    import adafruit_ina260
    i2c = board.I2C()
    ina260_L = adafruit_ina260.INA260(i2c,address=64)
    print(ina260_L.current)
    print(ina260_L.voltage)
    print(ina260_L.power)
    

    Please note that this sensor is purely optional, the precision it provides is not really required. I've used this sensor to observe that the voltage and current sensing sensors of the solar charge controller are fairly accurate, except for that of the load, which only measures the current in increments of 100 mAh.

    The LCD Display

    The display has four lines of twenty characters and uses a HD44780 controller. It's dirt-cheap and uses the I2C bus for communications. By default, the screen is very bright, but I've used a resistor on a header for the backlight to lower the brightness.

    lcddisplay

    I'm using the Python RPLCD library (pip3 install RPLCD) for interfacing with the LCD display.

    Using an LCD display in any kind of project is very simple.

    1
    2
    3
    4
    5
    6
    #!/usr/bin/env python3
    from RPLCD.i2c import CharLCD
    lcd = CharLCD('PCF8574', 0x27, cols=20, rows=4)
    lcd.clear()
    lcd.cursor_pos = (0,0) # (line,column)
    lcd.write_string("Hello")
    

    12 volt to 5 Volt conversion

    I'm just using a simple car cigarette lighter USB adapter to power the Raspberry Pi 3b+. I'm looking at a more power-efficient converter, although I'm not sure how much efficiency I'll be able to gain, if any.

    Update: I've replaced the cigarette lighter usb adapter device with a buck converter, which resulted in a very slight reduction in power consumption.

    Script to collect data

    I've written a small Python script to collect all the data. The data is send to two places:

    • It is send to Graphite/Grafana for nice charts (serves no real purpose)
    • It is used to generate the infographic in the sidebar to the right

    Because I don't want to wear out the SD card of the Raspberry Pi, the stats as shown in the sidebar to the right is written to a folder that is mounted on tmpfs.

    The cloud as backup

    When you connect to this site, you connect to a VPS running HAProxy. HAproxy determines if my blog is up and if so, will proxy between you and the Raspberry Pi. If the battery would run out, HAProxy will redirect you an instance of my blog on the same VPS (where it was running for years).

    As you may understand, I still have to pay for the cloud VPS and that VPS also uses power. From an economical standpoint and from a ecological standpoint, this project may make little sense.

    Possible improvements

    VPS on-demand

    The obvious flaw in my whole setup is the need for a cloud VPS that is hosting HAProxy and a backup instance of my blog.

    A better solution would be to only spawn a cloud VPS on demand, when power is getting low. To move visitors to the VPS, the DNS records should be changed to point to the right IP-address, which could be done with a few API calls.

    I could also follow the example of Low-tech Magazine and just accept that my blog would be offline for some time, but I don't like that.

    Switching to Lithium-ion

    As long as the car battery is still fine, I have no reason to switch to Lithium-ion. I've also purchased a few smaller Lead Acid batteries just to test their real-life capacity, to support projects like these. Once the car battery dies, I can use those to power this project.

    The rest of the network is not solar-powered

    The switches, router and modem that supply internet access are not solar-powered. Together, these devices use significantly more power, which I cannot support with my solar setup.

    I would have to move to a different house to be able to install sufficient solar capacity.

    Other applications

    During good weather conditions, the solar panel provides way more power than is required to keep the battery charged and run the Raspberry Pi.

    I've used the excess energy to charge my mobile devices. Although I think that's fun, if I just forget turning off my lights or amplifier for a few hours, I would already waste most of my solar gains.

    I guess it's the tought that counts.

    Conclusion

    In the end, it it was a fun hobby project for me to realise. I want to thank Low-tech Magazine for the idea, I had a lot of fun creating my (significantly worse) copy of it.

    If you have any ideas on how to improve this project, feel free to comment below or email me.

    This blog post featured on hacker news and the Pi 3b+ had no problems handling the load.

    Updates

    Car battery died

    After about two weeks the old and worn-down car battery finally died. Even after a whole day of charging, the voltage of the battery dropped to 11.5 Volts in about a minute. It would no longer hold a charge.

    I have quite a lot of spare 12 volt 7Ah batteries that I can use as a replacement. I'm now using four of those batteries (older ones) in parallel.

    Added wall charger as backup power (October 2020)

    As we approached fall, the sun started to set earlier and earlier. The problem with my balcony is that I only have direct sunlight at 16:00 until sunset. My solar panel was therefore unable to keep the batteries charged.

    I even added a smaller 60 watt solar panel I used for earlier tests in parallel to gain a few extra watts, but that didn't help much.

    It is now at a point where I think it's reasonable to say that the project failed in my particular case. However, I do believe it would still be fine if I could capture the sun during the whole day (if my balcony wasn't in such a bad spot, the solar panel would be able to keep up).

    As the batteries were draining I decided to implement a backup power solution, to protect the batteries. It's bad for lead acid batteries to be in a discharged state for a long time.

    Therefore, I'm now using a battery charger that is connected to a relais that my software is controlling. If the voltage drops below 12.00 volt, it will start charging the batteries for 24 hours.


    1. the position of the panel is not optimal, so I will never get the panel's full potential. 

    2. You don't have to buy the cable supplied by Victron, it's possible to create your own. The cable is not proprietary. 

    3. It failed. Please read the update at the bottom of this article. 

    Tagged as : solar
  3. Don't Be Afraid of RAID

    May 22, 2020

    Introduction

    I sense this sentiment on the internet that RAID is dangerous, that the likelihood of your RAID array failing during a rebuild is almost a certainty, because hard drives have become so large.

    I think nothing is further from the truth and I would like to dispel this myth.

    Especially for home users and small businesses, RAID arrays are still a reliable and efficient way of storing a lot of data in a single place.

    Perception of RAID reliability

    There are many horror stories to be found on the internet about people at home losing their RAID array. These stories may have contributed to a negative attitude towards RAID in general.

    You may acuse me of victim blaming, but in many cases, I do wonder if those incidents were due to user error1, due to bad luck or actual RAID causing problems. And there is a bias in reporting: you won't hear from the countless people who have no issues.

    In any case, the damage is done, but I still think (software) RAID is perfectly fine.

    The myth about the Unrecoverable Read Error (URE)

    I think the trouble started with this terrible article on ZDNET from 2007.

    In this article, it's argued that as drives become bigger, but not more reliable, you will see more unrecoverable read errors (UREs). More capacity means more sectors, so more risk of one of them going bad.

    An URE is an incident where the hard drive can't read a sector5. For old people like me, that sounds like the definition of a 'bad sector'. The article argues that on average you would encounter an URE for every 12.5 TB of data read.

    By the logic of the ZDNET acticle, just copying all data from a 14 TB drive would probably be impossible, because you would probably hit an URE / bad sector before you finish your copy.

    This is a very big issue for RAID arrays. A RAID array rebuild consists of reading the contents of all remaining drives in their entirety2. So you are guaranteed to hit an URE during a RAID rebuild.

    The good news is that you don't have to worry about any of this. Because it is not true.

    Hard drives are not that unreliable in practice. On the contrary. They are remarkably reliable, I would say. Just look at the Backblaze drive statistics6.

    The prediction of the infamous ZDNET article has not come true. The URE specification for hard drive describes a worst-case scenario and seem to be more about marketing (a way to differentiate enterprise drives from consumer drives) than about reality.

    If the ZDNET article were true, I, myself, should have encountered many UREs because of the many RAID array scrubs/patrol reads that have completed acros various RAID arrays.

    RAID has never stopped working and is still going strong.

    Card

    Scrubbing protects against the impact of bad sectors

    When a drive fails in a RAID array that can only tollerate one drive failure, it's very important that all remaining drives won't encounter any read errors. Because redundancy is lost, any read errors due to bad sectors could mean that the entire array is lost or at least some files are corrupted7.

    Every RAID array supports 'scrubbing'. It's a process where every sector of the RAID array is read, which in effect causes all sectors of all hard drives to be read.

    A scrub is a process to check for bad sectors in advance. If bad sectors are found on a hard drive, the drive can be replaced so it will not cause problems during a potential future rebuild. Replacing the drive itself will cause a rebuild, but assuming the scrub didn't find any other drives with bad sectors, that rebuild will be fine.

    A RAID array that doesn't undergo a regular scrub is a disaster waiting to happen. Bad sectors may be building up on one of the other drivs and when a drive actually fails, the entire array may be lost because of the undetected bad sectors on (one of) the remaining drives.

    If you want to store data in a reliable way on a RAID array, you need to assure the array is scrubbed periodically. And even if you don't use RAID, I would recommend running a long SMART test once a month against every hard drive you own.

    By default, a Linux software RAID array is scrubbed once a week on Ubuntu. For details, look at the contents of /etc/cron.d/mdadm.

    If you use ZFS on Linux, your array is automatically scrubbed on the second Sunday of every month if you run Ubuntu.

    NAS vendors like Synology or QNAP have data scrubs enabled by default. Consider the manual of your particular NAS to adjust the frequency. I would recommend to scrub at least once a month and at night.

    Why is RAID 5 considered harmful?

    Frankly, I wonder that too.

    I notice a lot of people on the internet claiming that you should never use RAID 5 but I disagree. It all depends on the circumstances. Finding a balance between cost and risk is important.

    This page dating back to 2003 advocated not to use RAID 5 but that's focused on the enterprise environment and even there I see its uses.

    For small RAID arrays with five or less drives I think RAID 5 is still a great fit. Especially if you run a small 4-bay NAS it would make total sense to use RAID 5. You get a nice balance between capacity and the cost of availability.

    It's not really recommended to create larger RAID 5 arrays. Compared to a single drive, a RAID array with 8 drives is 8 times more likely to experience a drive failure. You multiply the risk of a single drive failing by eight. With larger arrays, double drive failure becomes a serious risk.

    This is why it's really recommended to use RAID 6 for larger RAID arrays, because RAID 6 can tollerate two simultaneous drive failures. I've used RAID 6 in the past and I use RAIDZ2 (ZFS) as the basis for my current NAS.

    I also run an 8-drive RAID 5 in one of my servers that hosts not so important data that I still want to keep around and would rather not lose, but not at every cost. It's all about a balance between risk and cost. Please also read the postscript of this post, you will like it.

    It is true that during a rebuild, hard drives are strained more, but unless the RAID array is also in heavy use, the load on the drive isn't that big: the data is read sequentially, which is quite easy on the drives.

    RAID rebuild performance is mostly determined by the size of the drives and not by the number of drives in the RAID array3.

    Years ago I ran a 20-drive RAID 6 based on 1 TB drives and it did a rebuild in 5 hours. Recently I tested a rebuild of 8 drives in RAID 5 (using the same drives) and it also took almost 5 hours (4H45M).

    The RAID write hole

    The RAID 5/6 'write hole' is often mentioned as something you should be afraid about.

    Parity-based RAID like RAID 5 and RAID 6 may be affected by an issue called the 'write hole'. To (over)simplify: if a computer would experience a sudden power failure, a write to the RAID array may be interrupted. This could cause a partial write to the RAID array, leaving it in an inconsistent state.

    As a side note, I would always recommend protecting your NAS with a UPS (battery backup) so your server can shut down in a clean way, before power is lost as the battery runs out.

    ZFS RAIDZ is not affected by the 'write hole' issue, because it writes data to a log first before writing it to the actual array4.

    Linux MDADM software RAID also is protected against the 'write hole' phenomenon by using a bitmap (which is enabled by default4).

    Hardware RAID is also protected against this by using a battery backup for the cache memory. The data in the cache memory is written to disk as soon as the computer is powered back on.

    Setup alerting if you care about your data

    I think that a lot of RAID horror stories are due to the fact that people may never notice any problems until it is too late because they never set up any kind of alerting (by email or other).

    Ideally, you would also make sure your system monitors the SMART data of your hard drives and alert when critical numbers start to rise (Reallocated Sector count and Current Pending Sector count).

    This is also a moment of personal reflection. Do you run a RAID array? Did you setup alerting? Or could your RAID array be failing this very moment and you wouldn't know?

    Anyway: I think a lack of proper alerting is a nice way of getting into trouble with RAID, but that's not on RAID. Any storage solution that is not monitored is just a disaster waiting to happen.

    Why people choose not to use RAID

    If a RAID array fails, all data is lost. Some people are not comfortable with this risk. They would rather lose the contents of some drives, but not all of them.

    Solutions like Unraid and SnapRAID use one or more dedicated hard drives to store redundant (parity) data. The other hard drives are formatted with your filesystem of choice and can be accessed as normal hard drives. Altough I have no experience with this product, StableBit DrivePool seems to work in a similar manner.

    If you would have six hard drives, thus five data drives and one parity disk, the loss of two drives would result in data loss, as with RAID 5. However, the data on the remaining four drives would still be intact. The data loss is limited to just one drive worth of data.

    The 'all-or-nothing' risk associated with regular software RAID is thus mitigated. I myself don't think those risks aren't that large, but Unraid and snapraid are popular product and I think they are reasonable alternatives.

    Mergerfs could also be an interesting option, although it only supports mirroring.

    Backups are still important

    Storing your data on any kind of RAID array is never a substitute for a backup.

    You should still copy your data to some other storage if you want to protect your data. You may chose to only make a backup of a subset of all of the data, but at least you take an informed risk.

    Evaluation

    I hope I have demonstrated why RAID is still a valid and reliable option for data storage.

    Feel free to share your own views in the comments.

    P.S.

    I ran a scrub on my 8-disk RAID 5 array (based on 2 TB drives) as I was writing this article. My servers are only powered on when I need them and while powered off, it's easy for them to miss their periodic scrub window.

    So as to practice what I preach I ran a scrub. Lo and behold, one of the drives was kicked out of my Linux software RAID array. Don't you love the irony?

    sd 0:0:4:0: [sde] tag#29 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    sd 0:0:4:0: [sde] tag#29 Sense Key : Medium Error [current] 
    sd 0:0:4:0: [sde] tag#29 Add. Sense: Unrecovered read error
    sd 0:0:4:0: [sde] tag#29 CDB: Read(10) 28 00 9f 42 9e 30 00 04 00 00
    print_req_error: critical medium error, dev sde, sector 2671943216
    

    Followed by:

    md/raid:md6: Disk failure on sde, disabling device.
    md/raid:md6: Operation continuing on 7 devices.
    

    The drive was clearly kicked out because the drive encountered bad sectors. A quick check of the SMART data revealed more than 300+ sectors were already remapped, but the data stored in them could not be recovered, causing read errors.

    This drive is clearly done, although it was still operational.

    After swapping this defective drive with a spare replacement, I started the rebuild proces, which took four hours and twenty minutes. My RAID 5 has rebuild and is now perfectly fine.

    If an event like this doesn't drive the point home that scrubs are important, I don't know what will.


    1. Sometimes I read what hardware people use for storage and I think about this quote by John Glenn: ‘I felt exactly how you would feel if you were getting ready to launch and knew you were sitting on top of 2 million parts — all built by the lowest bidder on a government contract.’ 

    2. ZFS works differently, it only reads the sectors containing actual data. 

    3. ZFS rebuilds or 'resilvers' become slower as you add more drives to a RAIDZ(2/3) VDEV, it seems. I'm not sure this is still the case with more recent ZFS versions. 

    4. Both ZFS and MDADM will take a performance hit by using a log/bitmap. Both solutions support using an SSD to accelerate the log/bitmap to remove this performance hit. Most home users probably won't need this. 

    5. The smallest unit of storage a drive can store, often 4K or 512 bytes for older, smaller drives. 

    6. Those hard drive live in a datacenter with a conditioned environment, which you probably don't have at home. But as long as you keep the temperature of hard drive within limits, I don't think it matters that much. 

    7. ZFS is both a RAID solution and a filesystem in one and can tell you exactly which file is affected. A nice feature. 

    Tagged as : storage RAID

Page 1 / 69