Ad Code

Data transformation with pyspark, pyspark.sql, extract matching string to create new column

Data transformation with pyspark, pyspark.sql, extract matching string to create new column



In Detail steps

Create a sample dataframe

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

Do the needed imports

from pyspark.sql.functions import regexp_extract, col

On df extract Employee name from column using regexp_extract(column_name, regex, group_number).

Here regex('(.)(by)(\s+)(\w+)') means

  • (.) - Any character (except newline)
  • (by) - Word by in the text
  • (\s+) - One or many spaces
  • (\w+) - Alphanumeric or underscore chars of length one

and group_number is 4 because group (\w+) is in 4th position in expression

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Other examples

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+ | ID| Notes| +----+--------------------+ |2345| Checked by John| |2398| Verified by Stacy| |2328|Verified by Srini...| |3983|Double Checked on...| +----+--------------------+
from pyspark.sql.functions import regexp_extract 
from pyspark.sql.functions import col

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
result.show()
+----+--------------------+--------+ | ID| Notes|Employee| +----+--------------------+--------+ |2345| Checked by John| John| |2398| Verified by Stacy| Stacy| |2328|Verified by Srini...|Srinivas| |3983|Double Checked on...| Marsha| +----+--------------------+--------+
Reactions

Post a Comment

0 Comments