- 
                Notifications
    You must be signed in to change notification settings 
- Fork 177
mv mutex cleanup
- merged to master
- code complete May 4, 2014
- development started April 30, 2014
Each database / vnode in leveldb contains a mutex that protects the table file (.sst) management structures. All user activity must briefly acquire the mutex before a Read, Write, or Iterator operation. The background compactions must also briefly acquire the mutex before and after a compaction. This branch addresses two scenarios where a background compaction might hold the mutex for an extended time during heavy disk activities. Holding the mutex during compaction blocks user operations thereby impacting throughput and latency.
The first scenario deals with common logging statements in the compaction process. Here is a sample of the target messages:
2014/04/25-18:39:08.496084 7f8d767a2700 Level-0 table #15: started 2014/04/25-18:39:08.707713 7f8d767a2700 Level-0 table #15: 35220571 bytes, 29779 keys OK 2014/04/25-18:39:08.725394 7f8d767a2700 Delete type=0 #12 2014/04/25-18:39:08.727119 7f8d71798700 Compacting 6@0 + 0@1 files 2014/04/25-18:39:10.886134 7f8d71798700 Generated table #16: 178675 keys, 210224702 bytes 2014/04/25-18:39:10.887666 7f8d71798700 Compacted 6@0 + 0@1 files => 210224702 bytes 2014/04/25-18:39:10.889495 7f8d71798700 compacted to: files[ 0 1 0 0 0 0 0 ]
These messages were being generated while the compaction thread held the database mutex. The problem is that each message generated an fwrite() and fflush() call. Both calls, particularly the fflush(), could block for extended periods during heavy disk activity. Blocking during the logging became blocking of all user operations. The targeted log messages were moved and/or mutex unlocked to prevent blocking.
The changes only addressed common, regularly used log messages. Error logging, debug logging, and such were not changed.
The second scenario addressed by this branch deals with the deletion of old .sst files. There is a routine in db/db_impl.cc called DeleteObsoleteFiles(). This routine was regularly called with the database mutex held. The routine reads the current disk directories for all levels and potentially deletes any .sst files that are no longer needed. Large databases can easily contain tens of thousands of directory entries (.sst files). The reading, processing, and deleting of the files can hold the mutex for extended periods. This branch no longer holds the mutex during these disk operations.
[Note: there is an old, unimplemented branch that reduced the number of times per minute that DeleteObsoleteFiles() would be called. That branch will likely be revived since its incremental benefits were likely hidden by the mutex problem.]
DeleteObsoleteFiles() is called by the KeepOrDelete() routine. The original code assumed the database mutex, this->mutex_, was held. The new assumption is that the mutex is NOT held. The mutex is only locked if the version structures / routines are used. The changes move the "Deleted" log messages and all directory processing for deletes outside of the mutex lock.
WriteLevel0Table() addresses mutex and logging via simple copy/paste of the logging statements to a different position in the routine. BackgroundCall2(), BackgroundImmCompactCall(), and DoCompactionWork() do the same thing.
There is an addition s.ok() check and two changes of the constant 75000 to 300000 that have nothing to do with this branch. They are part of work from incomplete branches that must be brought to this branch for comparison and stability.
UpdatePenalty() routine had a Matthew V. logging statement. This is not essential. Currently commented out, likely to be deleted before PR merge. There is also change from another branch relating to the penalty calculation. That change is necessary for this branch to function properly in the 4 Terabyte range for evaluation of this modification.
SetupOtherInputs() has a non-essential logging statement. It is #if/#def'd, likely removed before PR merge. The logging is interesting, but not essential to debug of customer issues. Release the mutex at this point might cause problems due to its release allowing the "current version" to change.
Again a 4 Terabyte change for the testing. The default 4 days before file flush from cache is raised to 10 days. This impacts 4 Terabyte test's interaction with AAE.
The logger was never thread safe. The changes in this branch increase the number of log calls from outside the locked database mutex. The result is an increased chance thread overlap. A spin lock now protects the crucial fwrite call and intentionally does NOT cover the fflush() call. The spin lock will NOT ensure the temporal order of the logging statements, but will protect potential overwriting within fwrite.