LDAP Disaster

On Monday afternoon I updated various packages on our Fedora Core 5 server using yum. This has in the past caused one or two little tragedies. Really I should know better and do such updates over the weekend but of course I went ahead all gung-ho.

The vital mission critical thing that died this time was the OpenLDAP server which runs authentication across all the CETIS sites. No-one could get in to edit the wikis or blogs or a whole bunch of other services which is pretty disastrous really.

I scratched my brain for all of Tuesday and even a few hours on Monday night – trying to figure out what had happened. Basically it seemed that all the data in the openldap database had disappeared. I could connect to the server but it was unable to list the nodes of the directory. I tried a few command-line diagnostic tools. slapcat produced absolutely no output slapd_db_recover happily recovered something but made no difference whatsoever. Doing an ldapsearch (which should dump the whole dataset) did the following:

[root@arwen ldap]# ldapsearch -x
# extended LDIF
#
# LDAPv3
# base with scope subtree
# filter: (objectclass=*)
# requesting: ALL
#

# search result
search: 2
result: 32 No such object

I started off thinking that my config files were knackered – so I pawed over ldap.conf and slapd.conf for hours – and nothing changed. I did notice that there was an /etc/ldap.conf as well as an /etc/openldap/ldap.conf. I compared the two and removed the one loose in /etc as it seemed wrong. Didn’t help.

Next I got drawn down a big red-herring as I noticed messages in the logs when starting slapd:

Jun 13 12:01:32 arwen slapd[18004]: sql_select option missing
Jun 13 12:01:32 arwen slapd[18004]: auxpropfunc error no mechanism available
Jun 13 12:01:32 arwen slapd[18004]: auxpropfunc error invalid parameter supplied

Several sources claimed that this was to do with permission problems and SASL – but it turned out that it was completely unrelated to my actual problem and could be safely ignored. Again I wasted loads of time reading about SASL and chmodding files everywhere. I suppose it might become important were I ever to decide to actually use SASL with the directory.

So where had my data gone? This morning while on a conference call I was idly noodling through the database files in /var/lib/ldap and noticed a directory called rpmorig which I hadn’t really been through. I looked and I saw and I suddenly realised that there were a lot more .bdb files in there than there were in the parent directory and that they were full of data. The penny dropped. yum had kindly backed up all my data into this directory and replaced the working files with fresh empty ones. I moved the rpmorig directory into the place of /var/lib/ldap, restarted slapd and behold EVERYTHING WORKS AGAIN.

I curse whoever put together that yum package.

Another dead server moment

Our Fedora Core 5 server (arwen) got rather stuck today after a hard restart and failing to reboot. It kept getting stuck on Starting System Message Bus. A bit of googling later and it seemed that the problem originated from the server’s authentication settings.

Several days ago I had been fiddling with the server’s authentication settings (via the gui interface) hoping that it might be able to authenticate against its own ldap directory and had just left the settings sitting there when nothing seemed to be happening. Clearly it was a very bad idea and I should have left it alone.

As per the various posts on FedoraForum I used the following procedure.

Persuade the server to boot into runlevel 1:
push e when grub comes up, then e the 2nd line, add a 1 to the end of the line

Edit /etc/nsswitch.conf
Removing the references to ldap so;
passwd: files ldap
becomes
passwd: files

Edit /etc/sysconfig/authconfig
Changing USELDAP=yes to USELDAP=no.

With this done the system booted properly – however the story was not over as the instance of OpenLDAP itself complained about the database being corrupted.

This was fixed very swiftly using a recipe from Harold’s technotes:

# /usr/sbin/slapd_db_recover -v -h /var/lib/ldap
# service start ldap

With that done the whole thing was behaving properly again. Phew.

Bolton server switcharound

Yesterdays work on the Bolton firewalls and routers resulted in the “arwen” server (tencompetence and designforlearning) disappearing from the internet for much of the day.

There were several colliding issues:

1) The Bolton VPN system had been taken out by the changes so I couldn’t get in remotely to figure out what was going on (I was at the CAA conference in Loughborough). Once informed, the Bolton networks team made it a priority to get it working again.

2) The subnet configuration of the server needed re-applying – Mark Power performed valiantly and got the necessary settings changed from the office in Bolton. This was fixed around lunch-time. The server was now working but only within the Bolton network.

3) The final issue was that Mark Williamson had overlooked the arwen server when configuring the new firewall rules – the other cetis servers were fine though. Once discovered the firewall rules were corrected and the server became publicly visible again at around 5pm.

It was a shame that the tencompetence site went down for the second time in as many weeks, but all in all I think this was pretty much unavoidable given the complexity of the changes, new infrastructure, and I think that both Marks deserve a big thankyou for tackling the matter swiftly and seriously.

From A to B

Moving house has been a reasonably smooth operation, however the moving of the Tencompetence site turned out to throw up a nasty failure at the same time as me being out of action.

It turned out that ELROND had performed an automatic software update and rebooted itself. ARWEN (which is a virtual server sitting on top of Elrond) had then refused to come back up as VMWare had declared itself out-of-date. As such the Tencompetence site several bits of stuff aimed at the Design for Learning support site (the PHPbb Forum mainly) had gone off.

Then there was the issue of exactly what I’m supposed to be doing with the Design for Learning stuff – the request was for “something to show JISC” and had been verbally agreed that I’d set up a forum and a wiki. It was supposed to be done by Friday 16th… I’d got the forum up (but the server was down) and it took several phone calls and me being dragged in to the office on the Thursday to get as far as a new holding page. On Monday this week I got the mediawiki installation sorted and have since re-cast the entry page as a redirect into the wiki.

The D4L stuff is now all here: http://web.cetis.org.uk/designforlearning

Notes for internal purposes are on the intrawiki: Design for Learning Support

Moving Tencompetence

The last few days have involved quite a lot of work moving the Drupal instance intended for the new Tencompetence site from the “Elrond” server on to a new and shiny virtual server called “arwen” with a view to it going live for production use. This is the same server which we are intending to use for the new cetis site, so much of the software installed and buttons tweaked will be of long-term benefit to the whole organisation. There is a page in the internal cetis wiki detailing exactly how the machine has been set up.

Mostly this has gone smoothly but I’ve had a few gotchas along the way. The most catastrophic of which was that after moving it and it looking like everything was working, it started throwing errors when adding new content. It turned out that it had lost all the “Auto Increment” settings in the database and I had to go through table-by-table and fix them. Also (and this will hopefully not matter too much) the collation has got set to Latin-1 Swedish when it should be UTF-8 General. I’m going to try and fix this in a global change later in the week. Clearly I must have missed ticking a couple of boxes when exporting the database from Elrond. I really hate file-encoding issues…

UPDATE: There was an issue with attachments – some funkyness with mimetypes was causing word documents to get an extra .txt appended to the filename. This was fixed by adding the following lines to the bottom of php.ini
[mime_magic]
mime_magic.debug = Off
mime_magic.magicfile = “/usr/share/file/magic.mime”

Also, cron jobs were not happening, causing the aggregator to fail. Two things needed doing, adding a file to /etc/cron.hourly/ containing:
#!/bin/bash
curl -s http://localhost/tenc/cron.php > /dev/null
exit 0
…which calls the script on the hour. However there was also a problem causing PHPs fsockopen command to fail with a “permission denied” message. After a lot of hunting around it turned out that the SELinux security policy of the host machine (Fedora Core 5) was blocking the creation of sockets. I turned it off and all is now running sweetly.

This morning (16 June) the URL was switched over at the OUNL end. This has gone more-or-less fine however there are some funky redirects going on for people who are still pointing at the old server. A line needs editing in the drupal configuration to make the site refer to itself by name rather than ip address… I’ll do this towards the end of the day I think once everyone is pointing in the right direction.