RowMatrix is a class in PySpark’s MLLib library that represents a distributed matrix consisting of rows. Each row in the matrix is represented as a SparseVector or DenseVector object. RowMatrix is useful for performing operations on large datasets that are too big to fit into memory on a single machine.
Creating a RowMatrix in PySpark
To create a RowMatrix in PySpark, you first need to create an RDD of vectors. You can create an RDD of DenseVectors by calling the parallelize() method on a list of NumPy arrays:
!pip install pyspark from pyspark.sql import SparkSession from pyspark.mllib.linalg import Vectors, DenseVector, SparseVector from pyspark.mllib.linalg.distributed import RowMatrix, IndexedRowMatrix, CoordinateMatrix # create a SparkSession object spark = SparkSession.builder.appName("RowMatrixExample").getOrCreate() # create an RDD of vectors data = [ DenseVector([1.0, 2.0, 3.0]), DenseVector([4.0, 5.0, 6.0]), DenseVector([7.0, 8.0, 9.0]) ] rdd = spark.sparkContext.parallelize(data) row_matrix = RowMatrix(rdd)
You can also create an RDD of SparseVectors by passing in an array of tuples representing the indices and values of each non-zero element:
# convert the RowMatrix to an IndexedRowMatrix and compute the column summary statistics indexed_matrix = IndexedRowMatrix(row_matrix.rows.zipWithIndex().map(lambda x: (x, x))) coordinate_matrix = indexed_matrix.toCoordinateMatrix() transposed_coordinate_matrix = coordinate_matrix.transpose() col_stats = transposed_coordinate_matrix.toIndexedRowMatrix().columnSimilarities() # print the column summary statistics print(col_stats)
This example creates a SparkSession object, creates an RDD of vectors, and creates a RowMatrix object from the RDD. It then converts the RowMatrix to an IndexedRowMatrix using the IndexedRowMatrix() constructor and the zipWithIndex() method, converts the IndexedRowMatrix to a CoordinateMatrix using the toCoordinateMatrix() method, transposes the CoordinateMatrix using the transpose() method, and converts the resulting transposed CoordinateMatrix back to an IndexedRowMatrix using the toIndexedRowMatrix() method. Finally, it computes the column similarities of the transposed IndexedRowMatrix using the columnSimilarities() method and prints the result.
Compute Singular Value Decomposition (SVD)
To compute the Singular Value Decomposition (SVD) of a RowMatrix, you call the computeSVD() method. The SVD is a factorization of a matrix into three matrices: U, S, and V. The U matrix is a unitary matrix whose columns are the left singular vectors of the original matrix. The V matrix is a unitary matrix whose columns are the right singular vectors of the original matrix. The S matrix is a diagonal matrix whose diagonal entries are the singular values of the original matrix.
svd = row_matrix.computeSVD(k=2, computeU=True) U = svd.U # The U factor is a RowMatrix. s = svd.s # The singular values are returned as a NumPy array. V = svd.V # The V factor is a local dense matrix.
The k parameter specifies the number of singular values to compute. The computeU parameter specifies whether to compute the U matrix.
Spark important urls to refer