BUAA 2023 Spring OS

Prologue

In this lab, we finally come to File System. For MOS, a micro-kernel OS, file system serves as a process in user space, and all file operations are done via IPC between user process and this file system process. As we introduce file system to our OS, we have to deal with disk. Thus there is another part of code to handle memory and disk.

Tips for Windows Players

If you do labs on Windows with Visual Studio and WSL, you may encounter some subtle errors. Expand this part if your seek for solutions. 🙃

Visual Studio Tips

You may encounter an error saying that DT_DIR is undefined. It may not be a problem when you compile the project, but IntelliSense will show this as an error. Expand it if you have such

The reason is that, though with WSL, our code is on Windows, and we use MinGW (Minimal GNU for Windows). This macro is defined in standard dirent.h header file, but surrounded with a #if macro. Of course, _WIN32 is not defined since our OS will be compiled on Linux. But _BSD_SOURCE is not defined neither! 😵‍💫

// dirent.h
#if defined(_BSD_SOURCE) || defined(_WIN32)
#define DT_REG   	_A_NORMAL
#define DT_DIR   	_A_SUBDIR
#endif

So, the solution is that, in the CppProperties.json, add defines entry in configurations with _BSD_SOURCE declared. The rest of the properties are omitted. This works only as a reminder to Visual Studio IntelliSense, so won’t have actual effect on our compilation and run.

{
  "configurations": [
    {
      "defines": [ "_BSD_SOURCE" ] // enable DT_DIR
    }
  ]
}

If you don’t know what is CppProperties.json is, please refer to my previous post.

Configure-VS-for-Non-MSVC-Projects

Warning

For Windows, the EOL (End of line) symbol is CRLF (\r\n), while for Linux, it is LF (\n). In this lab, you have to pay attention to this subtle difference, for it may cause slight errors. For example, the string in our code is “hello\n”, but in the file, it is “hello\r\n” if you accept the default CRLF on Windows. So, when doing string compare, these two strings will not match each other, while they should.

Files you should especially pay attention to are tests/lab5_2/newmotd and tests/lab5_2/motd.

You can also refer to this post to change default EOL when saving a file: Encodings and line endings.

By printing the length of the strings. 🤪

1. Disk Access

The purpose of the File System is to make memory persistent. But our RAM is not persistent. So before we formally introduce File System, we have to get pass hard disk.

1.1 Device

Disk, the same as all other devices, are accessed though the register it provides, and these registers are mapped to our kernel memory. So we can manipulate them by read and write certain memory range.

In include/mmu.h, we can see the memory layout of MOS. And you can find that devices are mapped to kseg1.

I think now it is a good time for us to review memory accessing process.

You can see that kseg1 is the only segment that doesn’t go though cache. Ya know, you just cannot cache the input or output of keyboard, or any other devices.

1.2 Read & Write Device Segment

Now that we have to communicate with device by read and write kseg1, it is time to write such functions to do this job.

Just like when we access kernel address using KADDR and PADDR, here, we also need such conversion between device address and its correspondent in kseg1. It can be really simple like this.
1
#define KSEG1ADDR(pa) ((pa) + KSEG1)

As we know, such operations are done via system call. So here is the system call. va is the address of the value we want to write into or read out of the device located at address pa. And len is the length of data we want.

1
2
3

// kern/syscall_all.c
int sys_write_dev(u_int va, u_int pa, u_int len);
int sys_read_dev(u_int va, u_int pa, u_int len);

For MOS, we have the following there devices, and they are fixed to certain physical addresses. The detailed info for them are defined in corresponding header files.

Device	Start Address	Length	Header File
Console	0x10000000	0x20	`include/drivers/dev_cons.h`
IDE Disk	0x13000000	0x4200	`include/drivers/dev_disk.h`
RTC	0x15000000	0x200	`include/drivers/dev_rtc.h`

sys_write/read_dev

Before we read and write, we must validate address first.

Validate Address

First, since File System runs in user space, the va is limited to user space only. So we can simply check it by verify if it is in user space range.

static inline int is_illegal_va_range(u_long va, u_int len)
{
    if (len == 0)
        return 0;

    return va + len < va || va < UTEMP || va + len > UTOP;
}

Then, it is the physical address of device. Here I added a auxiliary range check function. For device address, we must make sure it is within one of the device memory range mentioned above.

static inline int in_range(u_long base, u_int range, u_long begin, u_int len)
{
    return (base <= begin) && (begin + len <= base + range);
}

static inline int is_illegal_dev_pa_range(u_long pa, u_int len)
{
    if (len == 0)
        return 0;
    if (pa + len < pa)
        return 1;
    if (!(in_range(0x10000000, 0x20,   pa, len) ||
          in_range(0x13000000, 0x4200, pa, len) ||
          in_range(0x15000000, 0x200,  pa, len)))
    {
        return 1;
    }

    return 0;
}

After address validation, we can read and write device as we will. Just use memcpy for memory exchange.

int sys_write_dev(u_int va, u_int pa, u_int len)
{
    if (is_illegal_va_range(va, len) || is_illegal_dev_pa_range(pa, len))
        return -E_INVAL;
    memcpy((void*)KSEG1ADDR(pa), (const void*)va, len);
    return 0;
}

int sys_read_dev(u_int va, u_int pa, u_int len)
{
    if (is_illegal_va_range(va, len) || is_illegal_dev_pa_range(pa, len))
        return -E_INVAL;
    memcpy((void*)va, (const void*)KSEG1ADDR(pa), len);
    return 0;
}

Just notice that pa is the physical address, and we have to add kseg1 offset to it.

1.3 IDE Read & Write

With sys_read_dev and sys_write_dev, we can communicate with device by read and write corresponding memory range. So it is time to really read and write data from disk. IDE (Integrated Drive Electronics) is a old-fashioned disk interface.

1.3.1 IDE Interface

Disk, you know, that structure. A disk is divided in to many sectors, and we can only get one, and only one sector by one read or write operation. The address information about our IDE disk in MOS is as follows. The base address of IDE disk is 0x13000000, defined by macro DEV_DISK_ADDRESS. All offset below are based on this.

Offset	Macro	Description	Data Bytes
0x0000	`DEV_DISK_OFFSET`	read/write offset to the 0 address of disk	4
0x0008	`DEV_DISK_OFFSET_HIGH32`	high 32 bits of read/write offset	4
0x0010	`DEV_DISK_ID`	the ID of the disk to operate	4
0x0020	`DEV_DISK_START_OPERATION`	set the disk to start read/write operation	4
0x0030	`DEV_DISK_STATUS`	return status of disk	4
0x4000	`DEV_DISK_BUFFER`	buffer with the size of a sector	512

Disk is large, so 4 bytes of 4G memory range may be not enough. So you can use high 32 bit to manipulate on larger memory.

Often, we may have more than one disk, so we should specify which disk to operate. But here in MOS, we only have one, whose ID is, of course, 0.

DEV_DISK_START_OPERATION indicates the following operation is a read or write. It is either DEV_DISK_OPERATION_READ or DEV_DISK_OPERATION_WRITE.

DEV_DISK_STATUS will store the return value of the disk operation. Just read the value from this memory. 0 indicates error.

DEV_DISK_BUFFER, well, it is literally the buffer that store the content of a sector, whether it will be written into or loaded from the disk.

1.3.2 IDE Operation

As we’re familiar with IDE Interface, we can now read and write data through it. Notice that, whenever we read or write, we do this one sector a time. Here is the declaration of read and write operation. You can see that we do these sector by sector.

1
2
3

// fs/serv.h
void ide_read(u_int diskno, u_int secno, void* dst, u_int nsecs);
void ide_write(u_int diskno, u_int secno, void* src, u_int nsecs);

You may notice that these are put in serv.h, which is the header file for File System process. Well, this perhaps because that disk operation is needed for File System.

To set or read registers or buffer, we can simple use the functions we wrote in 1.2. So it can be relatively easy.

ide_read/write

The detailed process of read/write IDE disk is that, set disk ID and offset first, then begin read or write, and at last, get the return value. Just notice that, the order of start operation and prepare buffer memory is different. Since we have to fill buffer before we write it into the disk, and we can only get the valid memory after we read it from the disk.

void ide_read(u_int diskno, u_int secno, void* dst, u_int nsecs)
{
    u_int begin = secno * BY2SECT;
    u_int end = begin + nsecs * BY2SECT;
    u_int read_flag = DEV_DISK_OPERATION_READ;
    u_int ret;
    for (u_int offset = 0; begin + offset < end; offset += BY2SECT)
    {
        u_int dev_offset = begin + offset;
        panic_on(syscall_write_dev(&diskno,
                                   DEV_DISK_ADDRESS + DEV_DISK_ID,
                                   sizeof(diskno)));
        panic_on(syscall_write_dev(&dev_offset,
                                   DEV_DISK_ADDRESS + DEV_DISK_OFFSET,
                                   sizeof(dev_offset)));
        // read start
        panic_on(syscall_write_dev(&read_flag,
                                   DEV_DISK_ADDRESS + DEV_DISK_START_OPERATION,
                                   sizeof(read_flag)));
        panic_on(syscall_read_dev(dst + offset,
                                  DEV_DISK_ADDRESS + DEV_DISK_BUFFER,
                                  BY2SECT));
        // read end
        panic_on(syscall_read_dev(&ret,
                                  DEV_DISK_ADDRESS + DEV_DISK_STATUS,
                                  sizeof(ret)));
        if (ret == 0)   // failed
            user_panic("ide_read failed");
    }
}

void ide_write(u_int diskno, u_int secno, void* src, u_int nsecs)
{
    u_int begin = secno * BY2SECT;
    u_int end = begin + nsecs * BY2SECT;
    u_int write_flag = DEV_DISK_OPERATION_WRITE;
    u_int ret;
    for (u_int offset = 0; begin + offset < end; offset += BY2SECT)
    {
        u_int dev_offset = begin + offset;
        panic_on(syscall_write_dev(&diskno,
                                   DEV_DISK_ADDRESS + DEV_DISK_ID,
                                   sizeof(diskno)));
        panic_on(syscall_write_dev(&dev_offset,
                                   DEV_DISK_ADDRESS + DEV_DISK_OFFSET,
                                   sizeof(dev_offset)));
        // write begin
        panic_on(syscall_write_dev(src + offset,
                                   DEV_DISK_ADDRESS + DEV_DISK_BUFFER,
                                   BY2SECT));
        panic_on(syscall_write_dev(&write_flag,
                                   DEV_DISK_ADDRESS + DEV_DISK_START_OPERATION,
                                   sizeof(write_flag)));
        // write end
        panic_on(syscall_read_dev(&ret,
                                  DEV_DISK_ADDRESS + DEV_DISK_STATUS,
                                  sizeof(ret)));
        if (ret == 0)	// failed
            user_panic("ide_write failed");
    }
}

Well, we cannot handle disk hardware error, so we simply panic on any error.

Tips: Here, we should use user_panic, rather than panic. Since the latter one is for kernel.

2. File System

2.1 Disk Layout

The basic unit in disk is sector, as we already know. But a sector is only of 512 bytes, which is a little small, thus will make too many of them. So the File System merge multiple sectors into one block as the basic unit for Operating System. Usually, one block is of the same size as page.

There are some macros in MOS for such conversions.

// user/include/fs.h
#define BY2BLK BY2PG
#define BIT2BLK (BY2BLK * 8)
// fs/serv.h
#define BY2SECT 512                 // bytes of a sector
#define SECT2BLK (BY2BLK / BY2SECT) // how many sectors in a block (8)

And it is easy to imagine the layout of those blocks. Later, when we read or write disk, we always read or write a complete block, i.e. read or write multiple sectors in a row.

Block 0 stores the bootloader, and partition table of the disk. Super block stores the key information about the disk - the magic number, number of blocks and root directory. By reading this, our File System can know the general information about the blocks.

// user/include/fs.h
struct Super
{
	uint32_t s_magic;   // Magic number: FS_MAGIC
	uint32_t s_nblocks; // Total number of blocks on disk
	struct File s_root; // Root directory node
};

And the following several blocks are bitmap blocks. The number of it depends on the size of the disk. If there are N blocks, then there must be enough bitmap blocks to hold these N bits. You can get it by a simple division, just make sure there are enough bits.

1 2	nbitblock = (NBLOCK + BIT2BLK - 1) / BIT2BLK; nbitblock = NBLOCK / BIT2BLK + !!(NBLOCK % BIT2BLK); // Or this

2.2 Block Cache in Memory

In MOS, there is a wired design, I think, that map disk blocks to File System process’s memory. And here is how this is done.

As you can see, we mapped disk to memory block by block. Now you can see why I say wired, because it will limit our disk size to 1GB only. 😢

There are two macros for this mapping that indicate the address range for such cache.
1
2
3
// fs/serv.h
#define DISKMAP 0x10000000 // cache map start address
#define DISKMAX 0x40000000 // maximum disk size

Such mapping is quite easy, as it is one-to-one. So we just add a offset to block ID (bno/blockno in code) to convert it into corresponding virtual address. So we have this function.

1	void* diskaddr(u_int blockno);

diskaddr

Just remember to validate address range.

void* diskaddr(u_int blockno)
{
    u_long va = (DISKMAP + blockno * BY2BLK);
    panic_on(va < DISKMAP);
    panic_on(va >= DISKMAP + DISKMAX);
    return (void*)va;
}

Things become easier as block and page are of the same size. When we cache a block, we simple get its virtual address in memory via diskaddr function, and then allocate a page to store it.

2.3 Block Management

The fundamental work for File System is to manage all these blocks. As we know, it uses bitmap to record the use of blocks, so the basic mechanism is clear and set corresponding bits. And of course, allocate and free blocks when necessary.

2.3.1 Block Initialization

All these are actually done in File System process. So let’s have a brief look at it. Here is its main function.

// fs/serv.c
int main()
{
    user_assert(sizeof(struct File) == BY2FILE);
    debugf("FS is running\n");

    serve_init();
    fs_init();
    serve();

    return 0;
}

We can see that, it first initialize it self, then initialize File System, and at last, serve as File System. Here, we just need to know fs_init.

// fs/serv/c
void fs_init(void)
{
    read_super();
    check_write_block();
    read_bitmap();
}

Still, is separated into small tasks. First one is to read the Super block to get block info. Then, I think, is a self diagnose, read it as you will. And the last one is to read the bitmap block to get the current disk usage.

read_bitmap

Related functions will be introduced later. 😉

// fs/fs.c
uint32_t* bitmap;
void read_bitmap(void)
{
    void* blk = NULL;

    // Step 1: Calculate the number of the bitmap blocks, and read them into memory.
    u_int nbitmap = super->s_nblocks / BIT2BLK + !!(super->s_nblocks % BIT2BLK);
    for (u_int i = 0; i < nbitmap; i++)
        read_block(i + 2, blk, 0);
    bitmap = diskaddr(2);

    // Step 2: Make sure the reserved and root blocks are marked in-use.
    user_assert(!block_is_free(0));
    user_assert(!block_is_free(1));

    // Step 3: Make sure all bitmap blocks are marked in-use.
    for (u_int i = 0; i < nbitmap; i++)
        user_assert(!block_is_free(i + 2));

    debugf("read_bitmap is good\n");
}

2.3.2 Block Mapping

As mentioned above, we cache blocks in memory, so we have to maintain such mapping. It involves these two functions below. Mapping block is actually allocating a page for the block, and link it virtual address to the page.

1
2
3

// fs/serv.h
int map_block(u_int blockno);
void unmap_block(u_int blockno);

Actually, in Lab 5, unmap_block is not declared in fs/serv.h. They are in tests/lab5_3/mix_check.c. Err… And in our File System, we don’t yet use it. Perhaps not the time?

Before we handle these functions, let’s take a look at some auxiliary functions.

// fs/serv.c
void* block_is_mapped(u_int blockno); // return mapped va if is mapped
int block_is_free(u_int blockno);
int block_is_dirty(u_int blockno);

In bitmap, 1 represents free state.

block_is_xxx

// fs/serv.c
void* block_is_mapped(u_int blockno)
{
    void* va = diskaddr(blockno);
    if (va_is_mapped(va))
        return va;
    return NULL;
}

int block_is_free(u_int blockno)
{
    if (super == 0 || blockno >= super->s_nblocks)
        return 0;
    if (bitmap[blockno / 32] & (1 << (blockno % 32)))
        return 1;
    return 0;
}

int block_is_dirty(u_int blockno)
{
    void* va = diskaddr(blockno);
    return va_is_mapped(va) && va_is_dirty(va);
}

bitmap is a u_int array, one unit in bitmap is 4 Bytes, 32 bits. One u_int can store 32 bits, so that’s what 32 means here.

Actually the mapping status of block can be presented by that of the virtual address it corresponds to.

int va_is_mapped(void* va)
{
	return (vpd[PDX(va)] & PTE_V) && (vpt[VPN(va)] & PTE_V);
}

int va_is_dirty(void* va)
{
	return vpt[VPN(va)] & PTE_DIRTY;
}

With these auxiliary functions, it would be easy to map blocks.

map_block

int map_block(u_int blockno)
{
    // Step 1: If the block is already mapped in cache, return 0.
    if (block_is_mapped(blockno))
        return 0;

    // Step 2: Alloc a page in permission 'PTE_D' via syscall.
    // Hint: syscall_mem_alloc will link va to the new page automatically.
    return syscall_mem_alloc(0, diskaddr(blockno), PTE_D);
}

void unmap_block(u_int blockno)
{
    // Step 1: Get the mapped address of the cache page of this block
    void* va = block_is_mapped(blockno);
    if (!va)    // not mapped
        return;

    // Step 2: If this block is used (not free) and dirty in cache, write it back
    // to the disk first.
    if (!block_is_free(blockno) && block_is_dirty(blockno))
        write_block(blockno);

    // Step 3: Unmap the virtual address via syscall.
    syscall_mem_unmap(0, va);

    // Make sure block is unmapped.
    user_assert(!block_is_mapped(blockno));
}

2.3.3 Block Allocation

Then, we got to handle block allocation. Here are the declaration of related functions. Just allocate and free, huh?

1
2
3

// fs/serv.h
int alloc_block(void);
void free_block(u_int blockno)

The same as unmap_block, free_block is also declared in tests/lab5_3/mix_check.c with the same reason I guessed.

To allocate a block, we have to traverse all blocks to find the first free block, because we want to make use of limited number of blocks. Then, we just modify the bitmap block to mark it allocated.

Important! As we modified the bitmap block, we got to write it back to the disk to make the change persistent.

There are two steps in alloc_block. First, we should allocate a block, just allocate one by marking it allocated. Then, we use map_block to map it to its virtual address.

alloc_block

int alloc_block(void)
{
    int r, bno;
    // Step 1: find a free block.
    if ((r = alloc_block_num()) < 0)  // failed
        return r;
    bno = r;

    // Step 2: map this block into memory.
    if ((r = map_block(bno)) < 0)     // failed
    {
        free_block(bno);  // exception safe rollback
        return r;
    }

    // Step 3: return the allocated block number.
    return bno;
}

// Real allocation and return the block id.
int alloc_block_num(void)
{
    /*
     * Walk through this bitmap, find a free one and mark it as used, then sync
     * this block to IDE disk from memory.
     */
    for (int blockno = 3; blockno < super->s_nblocks; blockno++)
    {
        if (bitmap[blockno / 32] & (1 << (blockno % 32)))
        {
            bitmap[blockno / 32] &= ~(1 << (blockno % 32));
            /*
             * Write the affected bitmap block to disk. (blockno / BIT2BLK)
             * is the bitmap block order, 2 is the reserved two blocks.
             */
            write_block(blockno / BIT2BLK + 2);
            return blockno;
        }
    }

    return -E_NO_DISK;  // no free blocks
}

free_block is relatively simpler. Since here we cache blocks, so there’s no need to free the page. We can reuse the page next time.

free_block

void free_block(u_int blockno)
{
    // Step 1: If 'blockno' is invalid, return.
    if ((blockno == 0) || (blockno >= super->s_nblocks))
        return;

    // Step 2: Set the flag bit of 'blockno' in 'bitmap'.
    bitmap[blockno / 32] |= (1 << (blockno % 32));
}

2.3.4 Read & Write Block

Finally, we just need to read and write block to finish this part.

1
2
3

// fs/fs.c
int read_block(u_int blockno, void** blk, u_int* isnew);
void write_block(u_int blockno);

read/write_block

// fs/fs.c
int read_block(u_int blockno, void** blk, u_int* isnew)
{
    // Step 1: validate blockno. Make file the block to read is within the disk.
    if (super && blockno >= super->s_nblocks)
        user_panic("reading non-existent block %08x\n", blockno);

    // Step 2: validate this block is used, not free.
    if (bitmap && block_is_free(blockno))
        user_panic("reading free block %08x\n", blockno);

    // Step 3: transform block number to corresponding virtual address.
    void* va = diskaddr(blockno);

    // Step 4: read disk and set *isnew.
    if (block_is_mapped(blockno))
    {
        // the block is already in memory
        if (isnew)
            *isnew = 0;
    }
    else
    {
        // the block is not in memory, and will be cached the first time
        if (isnew)
            *isnew = 1;
        map_block(blockno);	// map (cache) block into memory
        ide_read(0, blockno * SECT2BLK, va, SECT2BLK);
    }

    // Step 5: if blk != NULL, assign 'va' to '*blk'.
    if (blk)
        *blk = va;

    return 0;
}

void write_block(u_int blockno)
{
    // Step 1: detect if this block is mapped, if not, can't write it's data to disk.
    if (!block_is_mapped(blockno))
        user_panic("write unmapped block %08x", blockno);

    // Step2: write data to IDE disk. (using ide_write, and the diskno is 0)
    ide_write(0, blockno * SECT2BLK, diskaddr(blockno), SECT2BLK);
}

2.4 File Layout

2.4.1 File & Block

Although block is fundamental for File System, users cannot accept such low level concept. So we use files to cover all these ugly stuffs. Here is a good figure to show the mechanism of how file covers block.

Since the blocks are stored as type struct Block, we can simple access them by subscription, so in struct File, we can only record the index of the block it has. File::f_indirect uses a block if necessary to store more direct links. We can get the number of blocks of a file using its size. Let’s assume there is a File* file.
1
int nblk = file->f_size / BY2BLK;
Then we could have this conversion from file block id to disk block id. To make it easier, we simply omitted the first 10 (NDIRECT) links in the indirect block to make file block id continuous. Thus we could have such code.
1
2
3
4
if (i < NDIRECT)
     bno = file->f_direct[i];
else
     bno = ((uint32_t*)blocks[file->f_indirect])[i];
Notice: These two are just pseudo-code.

And here is the actual declaration of file.

struct File

This is in fs/fs.h.

// Maximum size of a filename (a single path component), including null.
#define MAXNAMELEN 128

// Maximum size of a complete pathname, including null.
#define MAXPATHLEN 1024

// Number of (direct) block pointers in a File descriptor.
#define NDIRECT 10
#define NINDIRECT (BY2BLK / 4)

// Maximum file size on disk.
#define MAXFILESIZE (NINDIRECT * BY2BLK)

#define BY2FILE 256  // size of struct File

struct File
{
    char f_name[MAXNAMELEN]; // filename
    uint32_t f_size;         // file size in bytes
    uint32_t f_type;         // file type
    uint32_t f_direct[NDIRECT];
    uint32_t f_indirect;
    
    // the pointer to the directory where this file is in, valid only in memory.
    struct File* f_dir;
    char f_pad[BY2FILE - MAXNAMELEN - (3 + NDIRECT) * 4 - sizeof(void*)];
} __attribute__((aligned(4), packed));

#define FILE2BLK (BY2BLK / sizeof(struct File))  // 16

We can see that, there are 1024 (NINDIRECT) links available, which means a file can have at most 1024 blocks. Thus the maximum file size our MOS can handle is 4M.

Notice: It seems, our filename is the absolute path name. 😧

ℹ If you want to know how exactly we get disk block id from file block id, hold your horse! I’ll talk about them in 2.4.4. 🙂

2.4.2 File Descriptor

In Linux, when we open a file using UNIX style open, we will get a file descriptor representing the file we opened. The same goes with MOS. Let’s see its definition.

// user/include/fd.h
struct Fd
{
    u_int fd_dev_id;
    u_int fd_offset;
    u_int fd_omode;
};

struct Filefd
{
    struct Fd f_fd;
    u_int f_fileid;
    struct File f_file;
};

struct Fd is the basic description of a file, it indicate the file is a regular file, a console or a pipe, and some basic information. Then, struct Filefd is the specific descriptor for file.

We arrange an area in the memory for file descriptor and file content, and they can be accessed via a simple conversion. Here is the layout of file descriptor and content in a process’ memory.

Important! Now we’ve had this layout in our mind. Remember, the address in file descriptor table is the same as the address of struct Fd and struct Filefd. So… Did you notice that? The first member in struct Filefd is just struct Fd, which means we can simply apply force conversion between them, or get them directly from the virtual address.

In MOS, we support at most 32 (MAXFD) file descriptors for one process. We want to make file descriptor table to take up a whole page directory, so we just make it a 4 MB segment. We assign 4 KB for each struct Filefd, and each of them corresponds to a 4 MB size segment as the content of the file. The conversion can be simple.

// user/lib/fd.h
#define MAXFD 32
#define FILEBASE 0x60000000
#define FDTABLE  (FILEBASE - PDMAP)
#define INDEX2FD(i)   (FDTABLE  + (i) * BY2PG)
#define INDEX2DATA(i) (FILEBASE + (i) * PDMAP)
// user/lib/fd.c
void* fd2data(struct Fd* fd) { return (void*)INDEX2DATA(fd2num(fd)); }
int   fd2num(struct Fd* fd)  { return ((u_int)fd - FDTABLE) / BY2PG; }
int   num2fd(int fd)         { return FDTABLE + fd * BY2PG; }

So when we open a file, we first prepare the file descriptor, then map the file to its 4 MB segment.

Previously, we know that MOS can only handle 4 MB file. So this is enough.

2.4.3 Allocate & Free File Descriptor

Well, the arrangement of these two function is not that pleasing. Perhaps due the the use of them. fd_alloc is used by more functions when opening different devices, while close can handle all devices.

// user/include/fd.h
int fd_alloc(struct Fd** fd);
// user/lib/fd.c
void fd_close(struct Fd* fd);

When we allocate a file descriptor, we only find the first empty 4 KB segment, and return its virtual address. We do not allocate page for it yet! The page allocation will be done later when initialize the descriptor by File System process. So when we close a file descriptor, we simple un-map the page from the virtual address.

fd_alloc/close

int fd_alloc(struct Fd** fd)
{
    for (u_int fdno = 0; fdno < MAXFD - 1; fdno++)
    {
        u_int va = INDEX2FD(fdno);
        if ((vpd[VPD(va)] & PTE_V) == 0)
        {
            // Later, a new page directory entry will be
            // created at the same time.
            *fd = (struct Fd*)va;
            return 0;
        }
        if ((vpt[VPN(va)] & PTE_V) == 0)
        {
            *fd = (struct Fd*)va;
            return 0;
        }
    }

    return -E_MAX_OPEN;
}

void fd_close(struct Fd* fd)
{
    syscall_mem_unmap(0, fd);
}

One more thing, when we want to look up a file descriptor, we use the id of the file descriptor (what we get from fd2num). We cannot find the file descriptor immediately after fd_alloc, since then, no actual page is allocated. We can only get it after File System initialized it for us.

fd_lookup

int fd_lookup(int fdnum, struct Fd** fd)
{
    if (fdnum >= MAXFD)
        return -E_INVAL;

    u_int va = INDEX2FD(fdnum);
    if ((vpt[VPN(va)] & PTE_V) != 0)  // fd is used
    {
        *fd = (struct Fd*)va;
        return 0;
    }

    return -E_INVAL;
}

2.4.4 Traverse File Blocks

We can use file_block_walk to get the corresponding disk block id of the given file block id. This one is quite similar to its peer pgdir_walk. And we wrap it to do the conversion from file block id to disk block id as file_map_block.

1
2
3

// fs/fs.c
int file_map_block(struct File* f, u_int filebno, u_int* diskbno, u_int alloc)
int file_block_walk(struct File* f, u_int filebno, uint32_t** ppdiskbno, u_int alloc);

file_map_block

// fs/fs.c
int file_map_block(struct File* f, u_int filebno, u_int* diskbno, u_int alloc)
{
	int r;
	// I think this pointer is redundant. We can simply use
	// uint32_t bno instead.
	uint32_t* ptr;

	// Step 1: find the pointer for the target block.
	if ((r = file_block_walk(f, filebno, &ptr, alloc)) < 0)
		return r;

	// Step 2: if the block not exists, and create is set, alloc one.
	if (*ptr == 0)
	{
		if (alloc == 0)
			return -E_NOT_FOUND;

		if ((r = alloc_block()) < 0)
			return r;

		*ptr = r;
	}

	// Step 3: set the pointer to the block in *diskbno and return 0.
	*diskbno = *ptr;

	return 0;
}

file_block_walk

int file_block_walk(struct File* f, u_int filebno, uint32_t** ppdiskbno, u_int alloc)
{
    int r;
    uint32_t* ptr; // ptr is the address of diskbno, so *ptr is diskbno
    uint32_t* blk;

    if (filebno < NDIRECT)
    {
        // Step 1: if the target block is corresponded to a direct pointer, just return the
        // disk block number.
        ptr = &f->f_direct[filebno];
    }
    else if (filebno < NINDIRECT)
    {
        // Step 2: if the target block is corresponded to the indirect block, but there's no
        //  indirect block and `alloc` is set, create the indirect block.
        if (f->f_indirect == 0)
        {
            if (alloc == 0)
                return -E_NOT_FOUND;
            if ((r = alloc_block()) < 0)
                return r;
            f->f_indirect = r;
        }
        // Step 3: read the new indirect block to memory.
        if ((r = read_block(f->f_indirect, (void**)&blk, 0)) < 0)
            return r;
        ptr = blk + filebno;
    }
    else
        return -E_INVAL;

    // Step 4: store the result into *ppdiskbno, and return 0.
    *ppdiskbno = ptr;

    return 0;
}

I just don’t understand why here we pass diskbno as a double-pointer, which, obviously, can be done via a simple pointer. ☹️

Similarly, we have file_get_block to directly get the block content from its file block id.

1 2	// fs/serv.h int file_get_block(struct File f, u_int blockno, void *pblk);

file_get_block

// fs/fs.c
int file_get_block(struct File* f, u_int filebno, void** blk)
{
	int r;
	u_int diskbno;
	u_int isnew;

	// Step 1: find the disk block number is `f` using `file_map_block`.
	if ((r = file_map_block(f, filebno, &diskbno, 1)) < 0)
		return r;

	// Step 2: read the data in this disk to blk.
	if ((r = read_block(diskbno, blk, &isnew)) < 0)
		return r;

	return 0;
}

2.5 File Operations

Finally, we should provide interface for user processes to manipulate files. All our file operation are handed to File System process via IPC, and File System will do all stuffs for us and deliver necessary data back to us. Here I just list some of them.

// user/include/lib.h
int fsipc_open(const char* path, u_int omode, struct Fd* fd);
int fsipc_map(u_int fileid, u_int offset, void* dstva);
int fsipc_close(u_int fileid);
int fsipc_remove(const char* path);

For more information on how File System process handle these requests, please refer to may another post.

File System IPC

File operations are provided to users by library functions, such as open and close, read and write.

2.5.1 Device Interface

After we open a file, or to be more general, open a device, we can just do many operations just using device, instead of by requesting File System. We have three types of devices, and each can be represented by a Device structure. Are we getting it? We are using function pointers to achieve polymorphism.

struct Dev
{
    int dev_id;
    char* dev_name;
    int (*dev_read) (struct Fd*, void*, u_int, u_int);
    int (*dev_write)(struct Fd*, const void*, u_int, u_int);
    int (*dev_close)(struct Fd*);
    int (*dev_stat) (struct Fd*, struct Stat*);
    int (*dev_seek) (struct Fd*, u_int);
};

Then we have these three devices.

1
2
3

extern struct Dev devcons;
extern struct Dev devfile;
extern struct Dev devpipe;

Then, let’s see how devfile is initialized. So the problem now is how these concrete functions are implemented.

struct Dev devfile = {
    .dev_id    = 'f',
    .dev_name  = "file",
    .dev_read  = file_read,
    .dev_write = file_write,
    .dev_close = file_close,
    .dev_stat  = file_stat,
};

file_read reads n bytes from the given file from given offset.

file_read

static int file_read(struct Fd* fd, void* buf, u_int n, u_int offset)
{
    struct Filefd* f = (struct Filefd*)fd;
    u_int size = f->f_file.f_size;
    
    // Avoid reading past the end of file.
    if (offset > size)
        return 0;
    if (offset + n > size)
        n = size - offset;
    memcpy(buf, (char*)fd2data(fd) + offset, n);

    return n;
}

file_write is similar to its read brother.

file_write

Generally, there is not much difference, but for write, we have to enlarge the file when necessary.

static int file_write(struct Fd* fd, const void* buf, u_int n, u_int offset)
{
    int r;

    struct Filefd* f = (struct Filefd*)fd;
    u_int tot = offset + n;

    // Don't write more than the maximum file size.
    if (tot > MAXFILESIZE)
        return -E_NO_DISK;

    // Increase the file's size if necessary
    if (tot > f->f_file.f_size)
    {
        if ((r = ftruncate(fd2num(fd), tot)) < 0)
            return r;
    }

    // Write the data
    memcpy((char*)fd2data(fd) + offset, buf, n);

    return n;   // return bytes written
}

int ftruncate(int fdnum, u_int size)
{
    if (size > MAXFILESIZE)
        return -E_NO_DISK;

    int r;
    struct Fd* fd;
    if ((r = fd_lookup(fdnum, &fd)) < 0)
        return r;

    if (fd->fd_dev_id != devfile.dev_id)
        return -E_INVAL;

    struct Filefd* f = (struct Filefd*)fd;
    u_int fileid = f->f_fileid;
    u_int oldsize = f->f_file.f_size;
    f->f_file.f_size = size;

    if ((r = fsipc_set_size(fileid, size)) < 0)
        return r;

    void* va = fd2data(fd);

    // Map any new pages needed if extending the file
    for (u_int i = ROUND(oldsize, BY2PG); i < ROUND(size, BY2PG); i += BY2PG)
    {
        if ((r = fsipc_map(fileid, i, va + i)) < 0)
        {
            fsipc_set_size(fileid, oldsize);
            return r;
        }
    }

    // Unmap pages if truncating the file
    for (u_int i = ROUND(size, BY2PG); i < ROUND(oldsize, BY2PG); i += BY2PG)
    {
        if ((r = syscall_mem_unmap(0, (void*)(va + i))) < 0)
            user_panic("ftruncate: syscall_mem_unmap %08x: %e", va + i, r);
    }

    return 0;
}

When we close a file, we first should mark it dirty to save any changes back to the disk. Then, we just un-map its corresponding virtual address.

file_close

int file_close(struct Fd* fd)
{
    int r;

    struct Filefd* ffd = (struct Filefd*)fd;
    u_int fileid = ffd->f_fileid;
    u_int size = ffd->f_file.f_size;

    // Set the start address storing the file's content.
    void* va = fd2data(fd);

    // Tell the file server the dirty page.
    for (u_int i = 0; i < size; i += BY2PG)
        fsipc_dirty(fileid, i);

    // Request the file server to close the file with fsipc.
    if ((r = fsipc_close(fileid)) < 0)
    {
        debugf("cannot close the file\n");
        return r;
    }

    // Unmap the content of file, release memory.
    if (size == 0)
        return 0;

    for (u_int i = 0; i < size; i += BY2PG)
    {
        if ((r = syscall_mem_unmap(0, (void*)(va + i))) < 0)
        {
            debugf("cannont unmap the file.\n");
            return r;
        }
    }

    return 0;
}

Well, there’s not much to say about file_stat, it just get the statistics of a file.

file_stat

static int file_stat(struct Fd* fd, struct Stat* st)
{
    struct Filefd* f = (struct Filefd*)fd;

    strcpy(st->st_name, f->f_file.f_name);
    st->st_size = f->f_file.f_size;
    st->st_isdir = f->f_file.f_type == FTYPE_DIR;

    return 0;
}

2.5.2 Library Interface

We cannot let users manipulate device directly, so we hide those details and provide some library interface for them. For the functions mentioned in 2.5.1, we have corresponding library functions for each of them.

// user/include/lib.h
int close(int fd);
int read(int fd, void *buf, u_int nbytes);
int write(int fd, const void *buf, u_int nbytes);
int stat(const char* path, struct Stat* stat);

int close(int fdnum)
{
    int r;
    
    struct Fd* fd;
    if ((r = fd_lookup(fdnum, &fd)) < 0)
        return r;
    struct Dev* dev;
    if ((r = dev_lookup(fd->fd_dev_id, &dev)) < 0)
        return r;

    r = (*dev->dev_close)(fd);
    fd_close(fd);

    return r;
}

ℹ This is a little special, please refer to File System IPC for more information. 😙

read

int read(int fdnum, void* buf, u_int n)
{
    int r;

    struct Fd* fd;
    if ((r = fd_lookup(fdnum, &fd)) < 0)
        return r;
    struct Dev* dev;
    if ((r = dev_lookup(fd->fd_dev_id, &dev)) < 0)
        return r;

    if (fd->fd_omode == O_WRONLY)
        return -E_INVAL;
    r = dev->dev_read(fd, buf, n, fd->fd_offset);
    
    if (r > 0)
        fd->fd_offset += r;

    return r;
}

// read n bytes not at once
int readn(int fdnum, void* buf, u_int n)
{
    int m, tot;

    for (tot = 0; tot < n; tot += m)
    {
        m = read(fdnum, (char*)buf + tot, n - tot);
        if (m < 0)
            return m;
        if (m == 0)
            break;
    }

    return tot;
}

write

int write(int fdnum, const void* buf, u_int n)
{
    int r;
    
    struct Fd* fd;
    if ((r = fd_lookup(fdnum, &fd)) < 0)
        return r;
    struct Dev* dev;
    if ((r = dev_lookup(fd->fd_dev_id, &dev)) < 0)
        return r;

    if ((fd->fd_omode & O_ACCMODE) == O_RDONLY)
        return -E_INVAL;

    r = dev->dev_write(fd, buf, n, fd->fd_offset);
    if (r > 0)
        fd->fd_offset += r;

    return r;
}

stat

int stat(const char* path, struct Stat* stat)
{
    int fd, r;

    if ((fd = open(path, O_RDONLY)) < 0)
        return fd;
    r = fstat(fd, stat);
    close(fd);
    
    return r;
}

int fstat(int fdnum, struct Stat* stat)
{
    int r;
    struct Dev* dev = NULL;
    struct Fd* fd;

    if ((r = fd_lookup(fdnum, &fd)) < 0 || (r = dev_lookup(fd->fd_dev_id, &dev)) < 0)
    {
        return r;
    }

    stat->st_name[0] = 0;
    stat->st_size = 0;
    stat->st_isdir = 0;
    stat->st_dev = dev;
    
    return (*dev->dev_stat)(fd, stat);
}

2.5.3 Open & Close a File

Above is operations after we open a file. But, we have to open it first. So… ? 🫤 Let’s open it. The opening process can be divided into several steps.

Allocate a page for file description. As stated above, we simple get a candidate virtual address and do nothing to it. So if we do this again, we’ll get the same virtual address.
Initialize the virtual address by requesting File System. File System will open the file and store the file descriptor in the virtual address we got in the first step. Of course, it will allocate a page to the address first. 🤪
Then, as we have the file descriptor allocate pages for the content of the file and map them to the 4 MB segment.
At last, return the file descriptor number to the caller.

open

int open(const char* path, int mode)
{
    int r;

    // Step 1: Alloc a new 'Fd' using 'fd_alloc' in fd.c.
    struct Fd* fd;
    if ((r = fd_alloc(&fd)) < 0)
        return r;

    // Step 2: Prepare the 'fd' using 'fsipc_open' in fsipc.c.
    if ((r = fsipc_open(path, mode, fd)) < 0)
        return r;

    // Step 3: Set 'va' to the address of the page where the 'fd''s data is cached.
    char* va = fd2data(fd);
    struct Filefd* ffd = (struct Filefd*)fd;
    u_int fileid = ffd->f_fileid;
    u_int size = ffd->f_file.f_size;

    // Step 4: Alloc pages and map the file content using 'fsipc_map'.
    for (int i = 0; i < size; i += BY2PG)
    {
        if ((r = fsipc_map(fileid, i, va + i)) < 0)
            return r;
    }

    // Step 5: Return the number of file descriptor using 'fd2num'.
    return fd2num(fd);
}

Compared to opening a file, it would be much easier to close it.

int close(int fdnum)
{
    int r;
    struct Dev* dev = NULL;
    struct Fd* fd;

    if ((r = fd_lookup(fdnum, &fd)) < 0 || (r = dev_lookup(fd->fd_dev_id, &dev)) < 0)
    {
        return r;
    }

    r = (*dev->dev_close)(fd);
    fd_close(fd);

    return r;
}