Saturday, March 24, 2007

Assumptions On The Way

When I read this post of the idea when to cache...I thought, it's not just a caching problem. the cache is just a component. Hence, I think that the valid question is "'What 'to do or not to do'?".
I believe that the key to answer this question is the assumptions.
The caching is very relevant in distributed systems in general. It has many modes and many techniques.
As mentioned in the post, I agree that a cache implementation cannot increase the mail server performance. It's obvious that the relation is 1:1 between the client and his mailbox. Thus, the data is just required by only one client.
To let the wonder aside, and make a simple rule to follow on the implementation, each project should start by setting the "ASSUMPTIONS".
So, let say that from the beginning I made an assumption that each client on my system is hyperactive. My clients don't cease pressing the refresh button on their browser. This will push the designer to find a way to decrease the time response to each client (by implementing a cache layer)which can be considered as a "Selfish caching".
Another one would assume that the client are normal users, they access their mail smoothly without any problem. This system doesn't need a cache.

I will give a practical example for assumptions. When I read about GFS (Google File System) I felt something wrong, the problem was that I predicted some technical problems and I expected that GFS design is putting an assassination plan to it. But I figured out that I was completely wrong.
When i got back to the assumptions I found that the system is fair to provide a complete solution based on these assumptions. The moral is that the system's design and implementation are always relevant to the assumptions and not to a vague problem statement.
To be more specific, most of distributed Systems researches are based on a very important assumption which is, concurrent reading is always more frequent than concurrent writing.
(Once written, files are seldom modified again)

The GFS Assumptions:
  1. The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from
    component failures on a routine basis.
  2. The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
  3. The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more. Successive operations from the same client often read
    through a contiguous region of a file. A small random read typically reads a few KBs at some arbitrary offset. Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go back and forth.
  4. The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient.
  5. The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer-consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently append to a file. Atomicity with minimal synchronization overhead is essential. The file may be read later, or a consumer may be reading through the file simultaneously.
  6. High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.

No comments: