wtorek, 10 listopada 2009

Errors happen

I have noticed a few DatastoreTimeoutException in logs of my application hosted on Google App Engine. It looks like below:

com.google.appengine.api.datastore.DatastoreTimeoutException: Unknown

I found a hint about this from Jason Cooper(Google):
The majority of exceptions are thrown because two or more requests come in at the same time that attempt to write to a single entity or entity group. This results in write contention, and an exception may be thrown if a given request can't complete its write within the deadline. So, during your design stage, be sure to identify areas where this may be a problem and mitigate as much as possible. You can do this by keeping your entity groups small and
sharding single entities that may updated a lot, such as a global counter: http://code.google.com/intl/pl/appengine/articles/sharding_counters.html
Due to the distributed nature of App Engine's datastore, these exceptions will happen from time to time; in practice, this effects between 0.1 and 0.2 percent of all datastore operations, and we are always working to make this percentage even lower. For critical datastore operations, you should continue to catch the exception so you can provide a custom error if necessary or perform your own retries.
I would recommending two or three times [execute retry] maximum before showing the user an error message.

I have implemented retry when DatastoreTimeoutException occures and it looks that works great now.

There is an interesting speach about quality of applications hosted on App Engine: Best Practices - Building a Production Quality Application on Google App Engine.
Ken Ashcraft diveded problems that could happen on few clasess:

• Out-of-memory
• DeadlineExceeded
• OverQuotaError
• Server crash
• Datastore crash
• Identical entity already exists

I've recently read Coders At Work (great book), where Ken Thompson wrote that he is guessing that about 50 percent of code in Google infrastructure is connected with handling exceptions/errors/failures that occures during runtime.
I have some experience with real-time distributed telecommunication systems. I haven't thinking before how many percent of this systems code is the special situations handling mechanism. But surprisingly, realy: it could be even an half. It looks like we should be more aware of underlying distributed infrastructure while developing applications on top od GAE.

Brak komentarzy:

Prześlij komentarz