Big Data:
- Its the need for new techniques and tools in order to process it.
- Large data is needed as it helps in answering factoids questions and learning about relations in the data.
- Synchronization Mechanism is required to eliminate Parallelization Problems (Communication between workers, Access to shared resources)
- Parallelization Challenges include: How do we assign work units to workers? What if we have more work units than workers?What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die?
- Managing Multiple workers is difficult as we dont know how workers work, do they interrupt each other or the order in which they share their data. Solution to this is Semaphores(lock/unlock), Conditional variables(wait,notify,broadcast) and Barriers. Still there are problems like: Deadlock, livelock, race conditions, Dining philosophers, sleeping barbers, cigarette smokers...
- Current tools used are:
Programming Models:
- Shared Memory
- Message Passing (MPI)
Design Patterns:
- Master Slaves
- Producer Consumer
Shared work queues
Concurrency is difficult to reason because of lots of one-off solutions, writing our own libraries and explicitly managing everything.
BIG IDEAS:
Scale out, not up: To have cluster of computers storing data instead of just one supercomputer with lots of computation power
Move processing to the data as the cluster has a limited bandwidth
Provide data sequentially as random access might be expensive and disk throughput is reasonable.
Seamless scalabilty.