Popular Kaggle related file operations in python
File operations are frequently used while writing code in Kaggle. In this post, we will discuss the very basics and frequently used operations on files in python.
Extract zip archives
The most popular module in python to play with zip archives is zipfile. Following is a sample function to extract zip archives in the current directory.
# Extrac zip files
import zipfile
def extract_images(filePath):
with zipfile.ZipFile(filePath,"r") as z:
z.extractall(".")
# Test
extract_images('/kaggle/input/facebook-recruiting-iv-human-or-bot/train.csv.zip')
Read Image file
OpenCv python module cv2
is a very popular module to read image content. Following is a sample code to read the image file and also resize it.
import cv2def read_image_content(image_directory):
# Get list of all the files
image_list = os.listdir(image_directory)
data = []
# Iterate each file to read content
for img_name in image_list:
# Get full image file path
image_full_path = os.path.join(image_directory,img_name)
img_content = cv2.imread(image_full_path,cv2.IMREAD_COLOR)
try:
img_resized = cv2.resize(img, (64,64))
data.append(img_resized)
except Exception as e:
print("Image with issue name, path", img_name, image_full_path)
print(str(e))
return data
Show progress bar while performing a time-consuming task
Sometimes we do perform a time-consuming task in python. It's very useful to visually show the progress bar for each operation. tqdm is a very popular library used to show the progress of the operation.
Following is modified code to read image file content and display the progress of reading the image file. You will notice that only for
loop is modified to use tqdm
to iterate the list.
import cv2def read_image_content(image_directory):
# Get list of all the files
image_list = os.listdir(image_directory)
data = []
# Iterate each file to read content
for img_name in tqdm(image_list):
# Get full image file path
image_full_path = os.path.join(image_directory,img_name)
img_content = cv2.imread(image_full_path,cv2.IMREAD_COLOR)
try:
img_resized = cv2.resize(img, (64,64))
data.append(img_resized)
except Exception as e:
print("Image with issue name, path", img_name, image_full_path)
print(str(e))
return data
Following is a visual sample to show a progress bar.
Data frame using Panda
Panda is the most popular library in python to read CSV files as a data frame. Panda has many inbuilt APIs which makes data cleanup and processing very easily. Following are a few widely used Panda API.
Read CSV File
train_df = pd.read_csv('/kaggle/working/train.csv')
train_df.head()
Check data types for all columns
train_df.dtypes
Check count of null value for each column
train_df.isnull().sum()