jeudi 14 juillet 2016

Pyspark: Merging lists which are in dataframe column

I have a dataframe as shown below. I want to merge the lists if they have atleast one same value. It is okay to take any of the component number. For example, [1,2] and [1,4,9] has 1 as common value. So both will be merged to [1,2,4,9]. Now [1,2] has component number 80 and and [1,4,9] has component number 30. For [1,2,4,9] it is okay to have any one of them as component number. In the example given below, I have considered 30.

It is possible to have a solution using dataframe or rdd operation avoiding as much iteration as possible? Thanks.

Aucun commentaire:

Enregistrer un commentaire