New issue
Advanced search Search tips

Issue 866816 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , iOS , Chrome , Mac , Fuchsia
Pri: 3
Type: Bug



Sign in to add a comment

sql::Connection auto recovery support

Project Member Reported by se...@yandex-team.ru, Jul 24

Issue description

Currently there is no a well-established practice of using sql::Recovery in Chromium for every DB. This is sad, because there is currently a lot of DBs that have SQLITE_CORRUPT errors in production. Also, the recoverability of every new DB that is created in Chromium is solely depends on the knowing of a developer about the sql::Recovery helpers.

There were some attempts to solve the problem locally for LoginDatabase but they didn't make it to the master:
https://chromium-review.googlesource.com/c/chromium/src/+/528139
https://chromium-review.googlesource.com/c/chromium/src/+/565112

I would like to implement a generic auto recovery logic that would be accesible via sql::Connection interface, could work automatically without a custom error handler and could be integrated with existing error handlers that do the recovery manually.

The idea looks like this:
1. Introduce an enum with a recovery policy:
enum class RecoveryPolicy {
  kNone,     // Does nothing.
  kRaze,     // Does Raze.
  kRecover,  // Executes sql::RecoverDatabase, fallbacks to Raze if failed.
  kRecoverWithMeta,  // Executes sql::RecoverDatabaseWithMetaVersion, fallbacks
                     // to Raze if failed.
};

2. Add RecoveryPolicy as an additional parameter into sql::Connection::Open:
bool Open(const base::FilePath& path, RecoveryPolicy recovery_policy);

3. Add a call to the recovery logic in sql::Connection::OnSqliteError with some restrictions:
  a) If error_callback_ is set, try to do the recovery after a callback run.
  b) Do the recovery only if a DB is not poisoned, is opened and we're at the first-level error in the call stack.

4. The recovery logic will not poison the DB if it was recovered on Open (i.e. retry_flag == RETRY_ON_POISON), otherwise it will always poison the DB. This will prevent a sql::Connection user from DB bugs if it's not ready to handle an unexpected data change. Optionally we can implement a on_poisoned/recovered_event callback so user can reinitialize the connection and continue to work in the current session.

pwnall@ what do you think? Can we do this? Maybe you have your own thoughts on this topic?
If the idea looks good in general, I can make a CL and we can discuss details there.
 
pwnall@ kindly ping. 
Sorry I wasn't able to respond earlier. I was at offsites for the past few of days.

Immediate logistic troubles aside, I really like the overall idea of standardizing recovery. The timing is pretty fortunate, as well -- I've recently had time to finally go through some of the code in //sql, and I have a better idea of what's going on. Last year, I had just taken over from shess@.

At a lower level, I don't think the proposed API meshes well with the current design of sql::Connection. The general approach for configuring a sql::Connection is setters that must be called before Open, not making Open take a lot of arguments. I don't particularly like the approach I've just described either -- it seems to bundle a builder together with the built class. I hope you're fine with changes at this level of detail.

Before unifying recovery for all //sql users, I would like to understand the current situation. Would you happen to have the time and willingness to put together an inventory of all //sql users and their current strategies for recovery?
Hey, @pwnall. I've looked at all classes that do set_error_callback() on sql::Database (previously sql::Connection). Here is the result:

chrome/browser/extensions/activity_log/activity_database.cc
Checks for sql::IsErrorCatastrophic() and does RazeAndClose().

components/blacklist/opt_out_blacklist/sql/opt_out_store_sql.cc
Checks sql::Recovery::ShouldRecover() and follows a standard recovery procedure without meta table:
1. Resets error callback.
2. Runs sql::Recovery::RecoverDatabase.
3. Calls sql::Database::IsExpectedSqliteError().

components/history/core/browser/history_backend.cc
Here we have something a little more complicated. First it checks for sql::IsErrorCatastrophic() and then it does PostTask which Razes and then Closes the DB with some intermediate logic to notify its clients about the DB is being closed.

components/history/core/browser/thumbnail_database.cc
Generates some diagnostics for every error. Checks sql::Recovery::ShouldRecover() and follows a standard recovery procedure with meta table.

components/history/core/browser/top_sites_database.cc
Checks sql::Recovery::ShouldRecover() and does a MANUAL recovery procedure which includes:
1. Call to sql::Recovery::BeginRecoverDatabase.
2. Meta table version check.
3. Database content check/update/remove if some logical constraints are broken.

components/omnibox/browser/shortcuts_database.cc
Checks sql::Recovery::ShouldRecover() and follows a standard recovery procedure without meta table, but *meta table exists*.

components/password_manager/core/browser/android_affiliation/affiliation_database.cc
Checks for sql::IsErrorCatastrophic() and does RazeAndClose().

components/password_manager/core/browser/login_database.cc
Doesn't have any recovery logic, just prints a log message on SQLITE_CONSTRAINT error.

components/sync/syncable/directory_backing_store.cc
Doesn't have any recovery logic, generates only a log message (see LoggingUnrecoverableErrorHandler in sync_client.cc)

components/webdata/common/web_database_backend.cc
Checks for sql::IsErrorCatastrophic() and does RazeAndClose().

content/browser/appcache/appcache_database.cc
Detects corruption using sql::IsErrorCatastrophic(), generates metric and closes DB (no recovery logic).

net/extras/sqlite/sqlite_channel_id_store.cc
Checks for sql::IsErrorCatastrophic() and does RazeAndClose() via PostTask.

net/extras/sqlite/sqlite_persistent_cookie_store.cc
Checks for sql::IsErrorCatastrophic() and does RazeAndClose() via PostTask (same as previous).


Summary:
Only one class does a manual recovery via sql::Recovery::BeginRecoverDatabase with a content check.
Some users do RazeAndClose, some do sql::Recovery::RecoverDatabase[WithMetaVersion].
Some users do an error handling via PostTask.

Nit: sql::IsErrorCatastrophic() is basically mirrors sql::Recovery::ShouldRecover(). I think we can remove one of these.

Sign in to add a comment