Deep learning-based streamflow models such as long short-term memory (LSTM) have immense promise as tools for improving streamflow prediction in places with little to no preexisting streamflow data. When LSTM models are evaluated on many basins (more than 500), they perform very well. LSTM models often perform better than established process-based models like the National Water Model (NWM), but they tend to underperform in arid and semi-arid environments. These studies do not evaluate the models on basins smaller than 5 square kilometers and generally only use a few average performance metrics to compare the model prediction accuracy. In this study, I compare the performance of daily LSTM and NWM models in six mountainous, semiarid basins in the Reynolds Creek experimental watershed (southwest Idaho, United States). The compared models include finetuned and continental US (CONUS) trained LSTM models as well as three editions of the NWM. The study basins range in size from 0.1 to 240 square kilometers and span an elevation gradient of 1,145 meters. To properly capture the strengths and weaknesses of the individual models, the model performance is evaluated based on twelve process-related hydrologic signatures. In this study area, all the models perform poorly on the smallest watersheds and the NWM outperforms the CONUS-trained LSTM model. Finetuning the LSTM with basins similar in climate, region, and location was evaluated a way of improving LSTM performance in this challenging environment. Only the process of finetuning the LSTM with the local basin (the Reynolds Creek outlet) was effective at improving the performance of the LSTM. Overall, the hydrologic signature metrics indicated that the models had varying predictive strengths and that no single model could be labeled the all-around best. This research highlights promising research directions in model evaluation, LSTM finetuning, and hydrological modeling in small watersheds.