Skip to content

Instantly share code, notes, and snippets.

@astappiev
Last active April 25, 2025 12:43
Show Gist options
  • Select an option

  • Save astappiev/e89bd9344e4f983a98e629926648be76 to your computer and use it in GitHub Desktop.

Select an option

Save astappiev/e89bd9344e4f983a98e629926648be76 to your computer and use it in GitHub Desktop.
Python utility that analyzes file size distributions across a directory tree. It scans a specified directory (or the current directory by default), collects file size information, and provides statistical summaries to help understand storage usage patterns.

Filesystem Size Statistics Tool

Overview

fs_sizes_stats.py is a Python utility that analyzes file size distributions across a directory tree. It scans a specified directory (or the current directory by default), collects file size information, and provides statistical summaries to help understand storage usage patterns.

Features

  • Fast recursive directory scanning with error handling for inaccessible files/directories
  • Categorizes files into configurable size buckets (from <1KB to ≥1GB)
  • Provides comprehensive statistics including:
    • Total file count
    • Total storage usage
    • Average file size
    • Distribution of files across size buckets
    • Percentage of files in each size category
    • Total size of files in each bucket
    • Percentage of total storage used by each size category

Usage

python fs_sizes_stats.py [directory_path]

If no directory path is provided, the script will analyze the current working directory.

Example Output

Time taken: 23.8087 seconds

Total files: 1417378
Total size: 1.67TB
Average file size: 1.24MB

Size distribution:
Size Range      | Files      | % Files | Size       | % Size
------------------------------------------------------------
<1KB            |     395728 |  27.92% | 173.59MB   |   0.01%
1KB–10KB        |     668171 |  47.14% | 2.31GB     |   0.14%
10KB–100KB      |     294747 |  20.80% | 8.10GB     |   0.47%
100KB–1MB       |      39034 |   2.75% | 10.79GB    |   0.63%
1MB–10MB        |      11070 |   0.78% | 37.73GB    |   2.21%
10MB–100MB      |       4722 |   0.33% | 206.72GB   |  12.08%
100MB–1GB       |       3768 |   0.27% | 855.78GB   |  50.01%
≥1GB            |        138 |   0.01% | 589.54GB   |  34.45%
import os
import sys
import time
# Define size buckets in bytes
buckets = [
("<1KB", lambda size: size < 1024),
("1KB–10KB", lambda size: 1024 <= size < 10 * 1024),
("10KB–100KB", lambda size: 10 * 1024 <= size < 100 * 1024),
("100KB–1MB", lambda size: 100 * 1024 <= size < 1024 * 1024),
("1MB–10MB", lambda size: 1024 * 1024 <= size < 10 * 1024 * 1024),
("10MB–100MB", lambda size: 10 * 1024 * 1024 <= size < 100 * 1024 * 1024),
("100MB–1GB", lambda size: 100 * 1024 * 1024 <= size < 1024 * 1024 * 1024),
("≥1GB", lambda size: size >= 1024 * 1024 * 1024),
]
def get_file_sizes(root_dir):
sizes = []
try:
for entry in os.scandir(root_dir):
try:
if entry.is_file(follow_symlinks=False):
sizes.append(entry.stat(follow_symlinks=False).st_size)
elif entry.is_dir(follow_symlinks=False):
sizes.extend(get_file_sizes(entry.path))
except Exception:
continue
except Exception:
pass
return sizes
def human_readable_size(size, decimal_places=2):
if size == 0:
return "0B"
units = ["B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"]
unit_index = 0
# Keep dividing by 1024 until size is manageable or we run out of units
while size >= 1024 and unit_index < len(units) - 1:
size /= 1024.0
unit_index += 1
return f"{size:.{decimal_places}f}{units[unit_index]}"
def print_stats(sizes):
total_files = len(sizes)
total_size = sum(sizes)
avg_size = total_size / total_files if total_files else 0
print(f"\nTotal files: {total_files}")
print(f"Total size: {human_readable_size(total_size)}")
print(f"Average file size: {human_readable_size(avg_size)}")
print("\nSize distribution:")
print(f"{'Size Range':15} | {'Files':10} | {'% Files':7} | {'Size':10} | {'% Size':8}")
print("-" * 60)
for label, cond in buckets:
bucket_sizes = [s for s in sizes if cond(s)]
bucket_total_files = len(bucket_sizes)
bucket_total_size = sum(bucket_sizes)
bucket_file_percent = (bucket_total_files / total_files * 100) if total_files else 0
bucket_size_percent = (bucket_total_size / total_size * 100) if total_size else 0
print(f"{label:15} | {bucket_total_files:10} | {bucket_file_percent:6.2f}% | {human_readable_size(bucket_total_size):10} | {bucket_size_percent:6.2f}%")
if __name__ == "__main__":
directory = sys.argv[1] if len(sys.argv) > 1 else "."
start_time = time.time()
sizes_walk = get_file_sizes(directory)
end_time = time.time()
print(f"Time taken: {end_time - start_time:.4f} seconds")
print_stats(sizes_walk)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment