dimanche 17 juillet 2016

How to efficiently add a new key to an RDD in pyspark

I have 2 RDDs formats, the first one is ((provider, currency), value) where the key is (provider, currency) and the second one (provider, value) where the key is provider.

What I want to do is to transform an RDD from the (provider, value) format to the ((provider, currency), value). Practically the value from the (provider, value) RDD will be the same for every currency in the new ((provider, currency), value) RDD.

How could this be done in an efficient way, without having to collect() the RDDs and loop through them?

Aucun commentaire:

Enregistrer un commentaire