Data transformation with pyspark, pyspark.sql, extract matching string to create new column

From Zezo August 24, 2020

Data transformation with pyspark, pyspark.sql, extract matching string to create new column

In Detail steps

Create a sample dataframe

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

Do the needed imports

from pyspark.sql.functions import regexp_extract, col

On df extract Employee name from column using regexp_extract(column_name, regex, group_number).

Here regex('(.)(by)(\s+)(\w+)') means

(.) - Any character (except newline)
(by) - Word by in the text
(\s+) - One or many spaces
(\w+) - Alphanumeric or underscore chars of length one

and group_number is 4 because group (\w+) is in 4th position in expression

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Other examples

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

from pyspark.sql.functions import regexp_extract 
from pyspark.sql.functions import col

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Reactions

Data Science and AI Labs

Data transformation with pyspark, pyspark.sql, extract matching string to create new column

Data transformation with pyspark, pyspark.sql, extract matching string to create new column

In Detail steps

Other examples

Post a Comment

0 Comments

Followers-Theo dõi

Mã Giảm Giá

Categories

Featured Post

Deploy LLM in HuggingFace Spaces For Free Using Ollama

Popular Posts

Connect and Read, Write to Azure Blob Storage from Databricks

Tags

Recent Posts

Tags

Recent in Sports

Mã giảm giá

Footer Menu Widget

Data Science and AI Labs

Ad Code

Social Widget

Data transformation with pyspark, pyspark.sql, extract matching string to create new column

Data transformation with pyspark, pyspark.sql, extract matching string to create new column

In Detail steps

Other examples

Related Posts

Post a Comment

0 Comments

Followers-Theo dõi

Mã Giảm Giá

Categories

Featured Post

Deploy LLM in HuggingFace Spaces For Free Using Ollama

Popular Posts

Connect and Read, Write to Azure Blob Storage from Databricks

Tags

Recent Posts

Tags

Recent in Sports

Mã giảm giá

Footer Menu Widget