Using the watchdog timer to fix an unstable Raspberry Pi

I’m using a Raspberry Pi to make time-lapse photos using the motion daemon. The camera I use, a generic “GEMBIRD” (VID:PID 1908:2311) works out of the box, but causes the Pi to lock up from time to time. After replacing all the polyfuses with either a normal quick blow 1A fuse (for F3) or wire (USB fuses), the freezing still happened, even with different adapters with more than enough current capacity.

I’ve surmised the problem is the driver causing a kernel panic. This means the Pi won’t respond on the network and needs a hard reset to get it working again. I’ve not had time to diagnose the fault, but the BCM2708 has a watchdog timer that allows the freezing problems to be worked around.

After following the above tutorial, watchdog wasn’t using the hardware watchdog so was unable to reboot in the instance of a kernel panic or other mystery hardware freeze. The cause of the problem is the default heartbeat interval (60s). Setting it to 15s fixes it. To test that it works, just kill it and the system will reboot if the hardware timer is enabled.

Rebooting the system when it hangs is all well and good, but the freezing (or rebooting) could cause corruption on the SD card. To guard against this, the root partition can be kept read-only, so in the event of a crash the system should still remain bootable. A few daemons and other things need to be able to write to parts of the filesystem (/var, /home, parts of /etc). I followed the instructions here and here; here are the steps in full, after creating the new partition and mounting it on /persistent, and taking a backup image of the SD card. I ran the commands as root (sudo -i) rather than with sudo to avoid writes to /home while moving it.

1. Move/copy the /home, /var and /media partitions over to /persistent:

mv /home /persistent
cp /var /persistent
mv /media /persistent

2. Recreate mount points for each directory:

mkdir /home /var /media

3. Add bind entries for each mount point in /etc/fstab

nano /etc/fstab

Add lines:

/dev/mmcblk0p3  /persistentext4defaults,noatime    0       0
/persistent/media /media        none    bind       0       0
/persistent/home  /home         none    bind       0       0
/persistent/var   /var          none    bind       0       0

4. Link /etc/mtab to /proc/self/mounts:

rm /etc/mtab
ln -s /proc/self/mounts /etc/mtab

5. Move /etc/network/run to /dev/shm

rm -rf /etc/network/run 
sudo dpkg-reconfigure ifupdown

6. Delete the contents of /var

rm -r /var/*

7. Mount the replacement partitions:

mount -a

After these steps, everything should be OK. The current root filesytem is still writeable, so packages can still be installed, config files edited. The new /var partition worked, so I rebooted the Pi to see if it still came up.

The next test was to remount the / partition as read-only and see if everything still worked. Running mount -r -o remount / worked without any errors, suggesting nothing was still trying to write to the partition. After waiting a little while to see if anything popped up in /var/log/messages, I edited /etc/fstab to add “,ro” to the entry for / and rebooted to make / read-only by default.

These changes made the system more likely to survive random reboots, but it would still periodically lock up. I found that lockups only happened when motion was reading from the camera. The lock ups were caused just after a reboot, where the motion daemon started. The problem was caused by watchdog starting after motion, leaving a small time window for a lockup to happen without being caught by the watchdog timer.

To fix this, I set motion’s init script to depend on all services, and changed watchdog to only depend upon wd_keepalive. I changed /etc/init.d/motion to add $all to the #Required-Start directive, and /etc/init.d/watchdog to replace $all with wd_keepalive. After editing the inits script, they have to be refreshed by deleting and adding them in chkconfig: sudo chkconfig --del motion && sudo chkconfig --add motion and sudo chkconfig --del watchdog && sudo chkconfig --add watchdog. This shortens the window that motion can start (and freeze the system) before the watchdog has a chance to start.

It’s more of a cheap kludge than a fix, but it works.