How To Split Up Larger Files Into Smaller Files In Linux

This article is about how to break up big files into a smaller number of files in sizes of your choice and later re-assemble the files into the original bigger file. This article will describe step-by-step how to break up big files, split them into smaller sized files, upload them to a server, and then re-assemble them back into the original file verifying everything along the way.

Image via pixabay.com

This idea was originally forced upon me because I had to upload a large video (probably around 4 GB) but I had a very slow and unreliable internet connection. Specifically, I was in Manila, The Philippines which generally has slow and unreliable internet.

It can be extremely frustrating to upload a large file, only to find out the file upload failed after several hours of uploading, or the file upload got throttled which could take you many more hours to upload. So, in certain situations, it can be better to upload a file in little pieces (and then re-assemble the little pieces into the larger file) instead uploading the whole file at once.

Step 1: Split Up The Larger File

Ubuntu and most other Linux distros should come with the command called ‘split’ because it is part of the GNU coreutils.

The general form of the command looks like the following
split --bytes=size_of_subfiles file_to_split_up what_to_start_names_of_subfiles

For example, I have a video file called 20181030_142025.mp4 which happens to be 767 MB. To split up the file into 100MB file sizes (with the final file taking up the remainder in size), I would do something like the following:

split --bytes=100M 20181030_142025.mp4 video_split_up

It would then create the following split up files named as follows (it goes in alphabetical order)

video_split_upaa
video_split_upab
video_split_upac
video_split_upad
video_split_upae
video_split_upaf
video_split_upag
video_split_upah

The files video_split_upaa through video_split_upag would all be 100M in size, but video_split_upah would be 67M in size since it needs to take up the remainder of the size of the original file. So in Total, they would add up to 767 MB in size (the size of the original file).

Step 2: Create Files Containing the MD5 Checksums of Each of The Split Up Files, and Another File With The Md5 Checksum of The Original File.

Though this step is optional, it is recommended. After splitting up the larger file into smaller parts, I HIGHLY recommend creating you create a file with the md5 checksums of all of the split up files you made, and also make another md5 checksum file of the original file you wanted to split up, so later when you transfer the files, you can be (almost) 100% sure that the files were transferred correctly.

In order for you to re-assemble the split up files later into the original file, all of the split up files have to be transferred perfectly, so that is why md5 checksum files are so useful.re>

You can create the md5 checksums of all of the split up files with the following command:

md5sum video_split_upa* >> split_up_files.md5

Now for a description of the above command, step-by-step.
md5sum is a command that will create or check md5 checksums of files. After the md5sum command itself, you tell the command which file or files to calculate the md5 checksums of.

md5sum video_split_upa* means to calculate the md5 checksum of all files starting with video_split_upa.

Finally the ‘>>’
part means to append to the file called split_up_files.md5, all of the checksums found. Normally, md5sum will just output to standard output in the terminal. But, instead, I am redirecting that output to a file.

md5sum video_split_upa* >> split_up_files.md5

The command md5sum video_split_upa* >> split_up_files.md5 will create a file called split_up_files.md5, and contain the md5 checksum of all of those files.

Just for my case, inside of split_up_files.md5, I have the following content which was generated by the above command:

74d319528b7b0b604a01a22a4984b730 video_split_upaa
a7f2f2fa26bcd728bf96af37fb9b0144 video_split_upab
273f4595d664f8146e565846116df010 video_split_upac
39d0b2e27c014a62812e2d58811d9430 video_split_upad
02449fdcdeb1a2be6378606d68cf0761 video_split_upae
7a399832c055f9159c7e627bb6925da2 video_split_upaf
2109726ef2336afe92d80175bc161107 video_split_upag
328815ea41ced632e91dafbef12ceafb video_split_upah

Basically, those long strings are the calculated md5 checksums followed by 2 spaces, and then the corresponding file name.

You can verify the md5 checksum is correct from a file by doing something like the following. You use the same command md5, and just use the -c flag to check the md5 file.

For example, to check the file I created above, I would do
md5sum -c split_up_files.md5

I then get the following output:

video_split_upaa: OK
video_split_upab: OK
video_split_upac: OK
video_split_upad: OK
video_split_upae: OK
video_split_upaf: OK
video_split_upag: OK
video_split_upah: OK

You ideally will get OK for each one. If one of the checksums is wrong, you will get an error like the following

video_split_upac: FAILED

I got the failure because in the .md5 file, I changed 273f4595d664f8146e565846116df010 video_split_upac to 273f4595d664f8146e565846116df016 video_split_upac

I changed the last 0 to a 5. Since the calculated md5 checksum file is not the same, it gives an error.

So, md5 checksums are incredibly useful, and can save you lots of hassles and time making sure that you have transferred files 100% in full and uncorrupted.

Additionally, I recommend creating a separate md5 checksum file just for the original file (though you could just append to the other md5 checksum file).

To do this, you would just do something like the following:

md5sum 20181030_142025.mp4 > original_file.md5

And then to check that md5sum file, I would do:
md5sum -c original_file.md5
If you get OK as the output, you know that you can continue.

Step 3: Upload The Split Up Files And MD5 Checksum Files To A Server Or Website

In this step, use whatever method you want to transfer the files. Upload the files to your website, upload them to a storage site like Google Drive and then download them to your server, or whatever you like.

The whole point of splitting up the files was so that you wouldn’t have to upload the original file all at once. So, just upload the split up files that we made with the split command, and also remember to upload the other 2 md5 files so we can verify that the files were transferred correctly.

Make sure that in addition to the file containing all of the md5 checksums of the original files, that you also upload a file with the md5 checksum of the original file, so when we later re-assemble the split up files into the original file, that we can check that the original file is the same as it originally was.

Step 4: Verify Split Up Files Were Transferred Correctly

With this step, we will be using the md5 checksum files we created earlier, along with the md5sum command, to verify that all of the split up files were transferred correctly.

So, with the examples I created above, use the following command (modify as you like):

md5sum -c split_up_files.md5

Ideally, you should get an OK for all of the outputs. If one or more of the outputs is FAILED, you will need to re-transfer the file or files that failed in the upload process.

Step 5: Re-assemble the original file from the split up files

Now, we will re-assemble all of the smaller files that we created earlier, and we will create the original file from them. (Make sure that if you are just testing this command, that you have the original file in a separate folder)

To do this, we use use the cat command, and re-direct the output to a file.

The general form of the command looks like the following:
cat form_of_prefix_of_split_up_files* > original_file_name

But in our case with the file names created above, it would be like the following:
cat video_split_upa* > 20181030_142025.mp4

Then, finally, we need to verify that the original file was re-created correctly. To do that, we will use the md5 checksum file we created earlier.

md5sum -c original_file.md5
You know that you did it correctly, if you get output like the following:
20181030_142025.mp4: OK

20181030_142025.mp4, if you remember, was the original file I wanted to transfer in the first place, but wanted to break it up into parts.

Now, since you know that the original file has been re-assembled correctly, you can choose to delete all of the split up files on the server, since they are no longer necessary.

You could do that with a command like the following on the server (or just delete them manuall)
rm video_split_upa*

Did you like this article? Do you have anything to add? Let’s discuss it in the comments below.

Posted on Categories Linux

Leave a Reply

Your email address will not be published. Required fields are marked *