Ad Code

Data Science labs blog

Data transformation with pyspark, pyspark.sql, extract matching string to create new column

Data transformation with pyspark, pyspark.sql, extract matching string to create new column



In Detail steps

Create a sample dataframe

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

Do the needed imports

from pyspark.sql.functions import regexp_extract, col

On df extract Employee name from column using regexp_extract(column_name, regex, group_number).

Here regex('(.)(by)(\s+)(\w+)') means

  • (.) - Any character (except newline)
  • (by) - Word by in the text
  • (\s+) - One or many spaces
  • (\w+) - Alphanumeric or underscore chars of length one

and group_number is 4 because group (\w+) is in 4th position in expression

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Other examples

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+ | ID| Notes| +----+--------------------+ |2345| Checked by John| |2398| Verified by Stacy| |2328|Verified by Srini...| |3983|Double Checked on...| +----+--------------------+
from pyspark.sql.functions import regexp_extract 
from pyspark.sql.functions import col

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
result.show()
+----+--------------------+--------+ | ID| Notes|Employee| +----+--------------------+--------+ |2345| Checked by John| John| |2398| Verified by Stacy| Stacy| |2328|Verified by Srini...|Srinivas| |3983|Double Checked on...| Marsha| +----+--------------------+--------+
Reactions

Post a Comment

0 Comments