Tuesday, October 27. 2009Drizzle Replication - The Transaction LogIn this installment of my Drizzle Replication blog series, I'll be talking about the Transaction Log. Before reading this entry, you may want to first read up on the Transaction Message, which is a central concept to this blog entry. The transaction log is just one component of Drizzle's default replication services, but it also serves as a generalized log of atomic data changes to a particular server. In this way, it is only partially related to replication. The transaction log is used by components of the replication services to store changes made to a server's data. However, there is nothing that mandates that this particular transaction log be a required feature for Drizzle replication systems. For instance, Eric Lambert is currently working on a Gearman-based replication service which, while following the same APIs, does not require the transaction log to function. Furthermore, other, non-replication-related modules may use the transaction log themselves. For instance, a future Recovery and/or Backup module may just as easily use the transaction log for its own purposes as well. Before we get into the details, it's worth noting the general goals we've had for the transaction log, as these goals may help explain some of the design choices made. In short, the goals for the transaction log are:
Overview of the Transaction Log Structure
Each entry in the transaction log is preceded by a 4 bytes containing an integer code identifying the type of entry to follow. The bytes which follow this type header are interpreted based on the type of entry. For entries of type Transaction message, the graphics here show the layout of the entry in the log. First, a 4 byte length header is written, then the serialized Transaction message, then a 4 byte checksum of the serialized Transaction message. Details of the TransactionLog::apply() MethodFor those interested in how the transaction log is written to, I'm going to detail the apply() method of the TransactionLog class in /plugin/transaction_log/transaction_log.cc. The TransactionLog class is simply a subclass of plugin::TransactionApplier and therefore must implement the single pure virtual apply method of that class interface. The TransactionLog class has a private drizzled::atomic<off_t> called log_offset which is an offset into the transaction log file that is incremented with each atomic write to the log file. You will notice in the code below that this atomic off_t is stored locally, then incremented by the total length of the log entry to be written. A buffer is then written to the log file using pwrite() at the original offset. In this way, we completely avoid calling pthread_mutex_lock() or similar when writing to the log file, which should increase scalability of the transaction log.
Reading the Transaction LogOK, so the above code shows how the transaction log is written. What about reading the log file? Well, it's pretty simple. There is an example program in /drizzle/message/transaction_reader.cc which has code showing how to do this. Here's a snippet from that program:
Shortcomings of the Transaction LogSo far, we've generally focused on a scalable design for the transaction log and have not spent too much time on performance tuning the code — and yes, performance != scalability. There are a number of problems with the current code which we will address in future versions of the transaction log. Namely:
Summary and Request for CommentsThat's it for the discussion about the transaction log. I'll post some more code examples from the replication plugins which utilize the transaction log in a later blog entry.
What do you think of the design of the transaction log? What would you change? Comments are always welcome! Cheers. Trackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
Very much detailed and explanation article. Thank you.
sorry if i am coming with stupid points. in TransactionLog::apply line no 41 it looks like offset is moving forward and then calculating and moving backward. is it something which can be avoided? as you said the dynamic allocation is an issue. anyway writing need to be serial. so what is the possibility of singleton for write operation? Thank you, Jobin. Hi Jobin!
First of all, there aren't any stupid questions. So, on line 41: cur_offset-= static_cast((total_envelope_length)); You are correct that we are recalculating the original offset there. The reason we do this is to "cut out" a chunk of the log file that we will use for writing the current transaction message. Because TransactionLog::apply() is code that is run in a threaded server, the variables of the TransactionLog such as the log_offset variable must be protected against changes from another thread's execution of the TransactionLog::apply() method. The log_offset variable is of type drizzled::atomic. This means that it's contents follow the atomic API templates implemented in /drizzled/atomic/. This template provides methods which ensure synchronized writes of a variable's contents in a threaded environment. The atomci.fetch_and_add() method atomically increments a variable's contents and returns the contents of the variable. On line 35: cur_offset= log_offset.fetch_and_add(static_cast(total_envelope_length)); what we are doing is carving out a chunk of space in the log file long enough to write our complete log entry. We "carve" this chunk out by simply atomically incrementing the log_offset. Once incremented, the new log_offset value is returned from fetch_and_add(), and we must therefore subtract the length of the transaction message we will write in order to return to the exact offset into the transaction log that we will write into. Hope the above makes sense! Cheers! Jay Beautiful idea.
carve out enough space for the current thread. leave the other threads to carve out what they want. ..beautiful..beautiful. threads need not wait for other threads to complete the writing. parallel writes can happen to log. am i right? replication is going to Rock! You got it! The key is the call to pwrite() (see line 83 above).
pwrite() is a standard POSIX system call that is similar to write() but allows the caller to write a supplied buffer of a supplied size to an exact offset in a file. Here is the doc on pwrite() and write(): http://www.opengroup.org/onlinepubs/000095399/functions/write.html Cheers! Jay |
Calendar
QuicksearchArchivesCategoriesSyndicate This Blog |
||||||||||||||||||||||||||||||||||||||||||
