Categories
computer

Recovering A Corrupt OpenLDAP Database On OSX Server

Recovering A Corrupt OpenLDAP Database On OSX Server
Last night we noticed some services provided by an OSX Leopard Server instance were not working correctly. The iChat, AFP and Web services were not authenticating. In Server Admin.app, the “Overview” tab of the Open Directory service reported…
LDAP Server is: Not Running
Password Server is: Running
Kerberos is: Not Running
Looking at the server error logs through Console.app, the following was occuring every 10 seconds..
com.apple.launchd[1] (org.openldap.slapd[27382]) Exited with exit code: 1
com.apple.launchd[1] (org.openldap.slapd) Throttling respawn: Will start in 10 seconds
The slapd daemon appeared not to be starting. Jumping to the command line, I tested the configuration using the `slapd -Tt` command.
core:openldap admin$ sudo /usr/libexec/slapd -Tt
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
bdb(dc=openrain,dc=com): PANIC: fatal region error detected; run recovery
bdb_db_open: Database cannot be opened, err -30978. Restore from backup!
bdb(dc=openrain,dc=com): DB_ENV->lock_id_free interface requires an environment configured for the locking subsystem
backend_startup_one: bi_db_open failed! (-30978)
slap_startup failed (test would succeed using the -u switch)
http://discussions.apple.com/message.jspa?messageID=9548971
With a little research, I concluded that..
The OpenLDAP database had been corrupted, and..
The `slapd_db_recover` tool (as present on some Linux installations) is instead named `db_recover`. Ah!
After carefully backing up the /var/db/openldap folder, I ran the recovery tool and re-tested the configuration..
core:openldap admin$ sudo db_recover -h /var/db/openldap/openldap-data/
core:openldap admin$ sudo /usr/libexec/slapd -Tt
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
config file testing succeeded
The errors in Console.app stopped, and the Server Admin.app panel started reporting..
LDAP Server is: Not Running
Password Server is: Running
Kerberos is: Not Running
I had to restart the AFP, iChat and Web services on the machine to get everything working again, but all seems well now.
Last night we noticed some services provided by an OSX Leopard Server instance were not working correctly. The iChat, AFP and Web services were not authenticating. In Server Admin.app, the “Overview” tab of the Open Directory service reported…
LDAP Server is: Not Running
Password Server is: Running
Kerberos is: Not Running
Looking at the server error logs through Console.app, the following was occuring every 10 seconds..
com.apple.launchd[1] (org.openldap.slapd[27382]) Exited with exit code: 1
com.apple.launchd[1] (org.openldap.slapd) Throttling respawn: Will start in 10 seconds
The slapd daemon appeared not to be starting. Jumping to the command line, I tested the configuration using the `slapd -Tt` command.
core:openldap admin$ sudo /usr/libexec/slapd -Tt
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
bdb(dc=openrain,dc=com): PANIC: fatal region error detected; run recovery
bdb_db_open: Database cannot be opened, err -30978. Restore from backup!
bdb(dc=openrain,dc=com): DB_ENV->lock_id_free interface requires an environment configured for the locking subsystem
backend_startup_one: bi_db_open failed! (-30978)
slap_startup failed (test would succeed using the -u switch)
With a little research, I concluded that..
  1. The OpenLDAP database had been corrupted, and..
  2. The `slapd_db_recover` tool (as present on some Linux installations) is instead named `db_recover`. Ah!
After carefully backing up the /var/db/openldap folder, I ran the recovery tool and re-tested the configuration..
core:openldap admin$ sudo db_recover -h /var/db/openldap/openldap-data/
core:openldap admin$ sudo /usr/libexec/slapd -Tt
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
overlay_config(): warning, overlay “dynid” already in list
config file testing succeeded
The errors in Console.app stopped, and the Server Admin.app panel started reporting..
LDAP Server is: Running
Password Server is: Running
Kerberos is: Running
I had to restart the AFP, iChat and Web services on the machine to get everything working again, but all seems well now.

44 replies on “Recovering A Corrupt OpenLDAP Database On OSX Server”

Add me to the grateful throngs. The LDAP db on my Leopard Server was corrupted after a restart that didn’t. After a panicked morning, I found your post. Worked like a champ!

(Does 4 count as a throng?)

This worked great but it was corrupted again this morning after a restart, any reason why this would keep happening? Thanks!

@Craig
I haven’t seen that happen before. The only time this has happened to me is when I’ve had to hard reset the machine and/or manually kill the daemons: in other words not shutting stuff down properly. My best guess is that there may be something external to LDAP that is corrupting the files on disk, or perhaps a disk issue itself.

Hi,
Thanks for your quick response, and actually we are having to force shutdown the machine as it is getting stuck on shutdown. Thanks so much for the insight.

@preston.lee

@Craig
No problem. If you’re consistently needing to force shutdown, try running `sync` at the command line before hitting the power button. That should force flush the disks write buffers. Hopefully that’ll at least keep LDAP from getting corrupted.

Preston, another great thanks from me. Anyway, my situation was identical to yours. Same identical messages, and corruption generated by hitting the power button.

I did the db_recover and it worked. Only thing, Kerberos is still listed as “Stopped”. Any ideas why? Everything seems to work.

Great post & instructions; like the others, it got me out of a jam after an update hung the restart and left me with a corrupt db.

Thanks a lot this assisted me with a similar problem I was having.

I had to run a few more commands in terminal to get it working though

1) sudo to root

sudo -i
2) shutdown the open directory server

service org.openldap.slapd stop

3) dump a copy of the Open Directory database to an LDIF format text file

mkdir /var/root/opendirectory
cd /var/root/opendirectory
slapcat -l dir.ldif
4) move the old (corrupt) database files out of the way (or remove them).

cd /var/db/openldap/openldap-data
mkdir SAVE
mv *.bdb SAVE/

be sure you don’t move, rename or delete the file named DB_CONFIG. It’s needed.

5) recreate the database from the LDIF format file

cd /var/root/opendirectory
slapadd -l dir.ldif
slapindex
You will see some harmless warnings during slapadd. Ignore them.

6) restart open directory

service org.openldap.slapd start

Thank you! Thank you! Thank you!!!

I thought I was being good, doing a backup through ServerAdmin before any updates…but then I wasn’t able to restore through ServerAdmin when this problem occurred.

Your solution worked perfectly, and was back up in minutes.

Thank you.

This has happened to to me three times.
Two times I rebuilt from scratch.

This time – thank you!

This works! Won’t know db_recovery is the right tool without reading your post.

On my Xserve, I have to use this to recover the database.

sudo db_recover -cev -h /var/db/openldap/openldap-data/

I may try this seeing all the positive responses and the fact that I am currently in the same boat, but did you lose all of your LDAP setting or did they stay intact, because I see you said you carefully backed up the openldap directory but does this require you to replace it? Or does it fix the existing LDAP

That would be wonderful. One last question if I may. You also mentioned you had to restart services such as AFP after. What was the reason? Did it prevent the LDAP from fully starting or running that recovery caused those services to stop?

KARMA is a boomerang and you have a lot of good KARMA coming your way my friend. This fix is fantastic! and a life saver for all who lose theri ldap data!!

TY TY TY !!

You are the man!!! You saved me from a LOT of work, time, frustration, profanity, ulcers, high blood pressure, etc.

Just about to poo my pants until I came across this article. Many thanks Guys!

LDAP Server up and running again.
Thanks a lot, this did it, great !!!!!

Preston, thanks for the article. I’m experiencing a similar problem, though OD reports that LDAP, Password Server and Kerberos are all running. But in my LDAP log I get the similar messages as you:

Oct 4 11:07:36 s1 slapd[26448]: bdb(cn=accesslog): DB_ENV->lock_id interface requires an environment configured for the locking subsystem
Oct 4 11:07:36: — last message repeated 3 times —
Oct 4 11:07:36 s1 slapd[26448]: findbase failed! 80

When I tested the config, I get this:

s1:~ sadmin$ sudo /usr/libexec/slapd -Tt
bdb_monitor_db_open: monitoring disabled; configure monitor database to enable
bdb_db_open: database “cn=accesslog”: unclean shutdown detected; attempting recovery.
bdb_db_open: database “cn=accesslog”: recovery skipped in read-only mode. Run manual recovery if errors are encountered.
bdb_db_open: database “cn=accesslog”: alock_recover failed
bdb_db_open: could not restore bdb backend -1config file testing succeeded
bdb_db_close: database “cn=accesslog”: alock_close failed

Do you think I should follow your steps above in using the db_recovery tool, or is that not applicable to my situation? Authentication and all services are working on the server, but I’m afraid this problem could manifest itself in ugly ways if I let it continue.

After some other troubleshooting I tried the db_recover tool, and rather than recovering the database, I received these messages:

Oct 4 11:46:18 s1 slapd[1116]: bdb(dc=*****,dc=*****): PANIC: fatal region error detected; run recovery
Oct 4 11:46:18 s1 slapd[1116]: SASL [conn=29] Failure: no user in database _ldap_replicator

I’ve rebooted the machine, manually unloaded & loaded slapd, and ran the recovery tool multiple times, and nothing has worked. Any other ideas? Getting pretty nervous now.

Not sure if anyone is checking this thread anymore, but just in case someone with the same problem reads it, here was my solution. None of the commands above worked in my situation – I suppose the corruption was too bad for the db_recover tool (or the manual .ldif steps), so I took one of my healthy OD Replicas and promoted it to Master. I then destroyed OD on the the previous Master, rebooted and made it a Replica of the new Master. This resulted in a fresh, clean database on the old Master, as well as not losing any records or passwords on the new one. Took me hours of troubleshooting and trial and error tests, but I should have done this as my very first step. There was minimal downtime and to most users the transition was transparent.

has exactly the same problem on leopard 10.5.8 server. thanks. it works again for me!!!!!!!

Seriously, put up a Paypal link since you saved me a few hours of research. Or a mailing address.

Thank you for the help. It is 2015 but I am still running 10.6.8 on my server and this solved the problem.

Leave a Reply

Your email address will not be published. Required fields are marked *