1. In-memory computation

Apache Spark is often a cluster-computing system, and it is intended to be quick for interactive queries which is achievable by In-memory cluster computation. It enables Spark to run iterative algorithms.

The data within RDD are stored in memory for as long as you want to store. We can easily make improvements to the general performance by an buy of magnitudes by holding the data in-memory.

2. Lazy Evaluation

Lazy analysis indicates the data inside of RDDS will not be executed in a very go. Once we utilize data it kinds a DAG along with the computation is performed only after an action is triggered. When an action is triggered all the transformation on RDDs then executed. So, it boundaries just how much work it has to accomplish.

three. Fault Tolerance

In Spark, we reach fault tolerance by utilizing DAG. If the worker node fails by making use of DAG we can find that through which node has the issue. Then we will re-compute the misplaced partition of RDD with the original 1. So, we will very easily get better the dropped data.

four. Speedy Processing

Now we are making a huge sum of data and we want that our processing speed should be very quick. So when employing Hadoopthe processing pace of MapReduce wasn’t fast. That is why we have been working with Spark since it offers incredibly fantastic speed.

5. Persistence

We will use RDD in in-memory and we can easily also retrieve them straight from memory. There’s no ought to go inside the disk, this increase the execution. Over the identical data, we will complete various functions. We could try this by storing the data explicitly in memory by contacting persist() or cache() purpose.

six. Partitioning

RDD partition the records logically and distributes the data across a variety of nodes within the cluster. The reasonable divisions are only for processing and internally it’s got no division. Thus, it offers parallelism.

7. Parallel

In Spark, RDD process the data parallelly

8. Location-Stickiness

To compute partitions RDDs are effective at defining placement preference. Placement preference refers to info about the place of RDD. The DAG scheduler destinations the partitions in this sort of way that activity should be near to data. Owing to this computation speed improves.

9. Coarse-grained Operation

We apply coarse-grained transformations to RDD. It means the operation applies not on somebody ingredient but into the entire dataset while in the data established of RDD.

ten. No limitation

We can use any number of RDD there is certainly no limit about the number. It constraints depend upon the scale of disk and memory.