Popular Kaggle related file operations in python

Dilip Kumar
3 min readJan 12, 2022

File operations are frequently used while writing code in Kaggle. In this post, we will discuss the very basics and frequently used operations on files in python.

Extract zip archives

The most popular module in python to play with zip archives is zipfile. Following is a sample function to extract zip archives in the current directory.

# Extrac zip files
import zipfile
def extract_images(filePath):
with zipfile.ZipFile(filePath,"r") as z:
z.extractall(".")
# Test
extract_images('/kaggle/input/facebook-recruiting-iv-human-or-bot/train.csv.zip')

Read Image file

OpenCv python module cv2 is a very popular module to read image content. Following is a sample code to read the image file and also resize it.

import cv2def read_image_content(image_directory):
# Get list of all the files
image_list = os.listdir(image_directory)
data = []
# Iterate each file to read content
for img_name in image_list:
# Get full image file path
image_full_path = os.path.join(image_directory,img_name)
img_content = cv2.imread(image_full_path,cv2.IMREAD_COLOR)
try:
img_resized = cv2.resize(img, (64,64))
data.append(img_resized)
except Exception as e:
print("Image with issue name, path", img_name, image_full_path)
print(str(e))
return data

Show progress bar while performing a time-consuming task

Sometimes we do perform a time-consuming task in python. It's very useful to visually show the progress bar for each operation. tqdm is a very popular library used to show the progress of the operation.

Following is modified code to read image file content and display the progress of reading the image file. You will notice that only for loop is modified to use tqdm to iterate the list.

import cv2def read_image_content(image_directory):
# Get list of all the files
image_list = os.listdir(image_directory)
data = []
# Iterate each file to read content
for img_name in tqdm(image_list):
# Get full image file path
image_full_path = os.path.join(image_directory,img_name)
img_content = cv2.imread(image_full_path,cv2.IMREAD_COLOR)
try:
img_resized = cv2.resize(img, (64,64))
data.append(img_resized)
except Exception as e:
print("Image with issue name, path", img_name, image_full_path)
print(str(e))
return data

Following is a visual sample to show a progress bar.

Data frame using Panda

Panda is the most popular library in python to read CSV files as a data frame. Panda has many inbuilt APIs which makes data cleanup and processing very easily. Following are a few widely used Panda API.

Read CSV File

train_df = pd.read_csv('/kaggle/working/train.csv')
train_df.head()

Check data types for all columns

train_df.dtypes

Check count of null value for each column

train_df.isnull().sum()

--

--